Load packages
### load packages
options( warn = -1)
library(ggplot2)
suppressMessages(library(dplyr));library(dplyr)
suppressMessages(library(statsr));library(statsr)
suppressMessages(library(corrplot));library(corrplot)
suppressMessages(library(gridExtra));library(gridExtra)
Load data
load("movies.Rdata")
This project consist of 6 parts:
The data set, movies, is comprised of32 variables and 651 randomly sampled movies produced and released before 2016
This is an observational study, the researches not introduce any type of intervention on this study to modify the data. There are variables that are only informative. Because this study is not an experimental study, we cannot establish causality with any probable results, and correlation does not prove causation.
This can be generalizable only to the subset of movies produced/published by those companies inside the United State of America
If the movies audience influence, or related, to other variables? What variable(s) most influence the audience?
The main idea is to analyze if the public audience (audience_score) is influenced by different predictors, and try to answer the research question. I decided to take 4 numerical predictors and 4 categorical predictors such as: imdb_rating, audience_rating, genre, critics_score, runtime, mpaa_rating,imdb_num_votes, best_actor_win.
Answering this question will support, or not, my assumption that the public do not put so much attention to the expert or internet movies database criticism, but if an author, in the movie, with an Oscar from the academy could have strong inference in the audience opinion.
Answer the research question could be useful for the company to the focus company focus in those variables that affect/influence the final user, the audience.
Because the research question defined above only involves specific variables, one predicted (audience_score) and eight predictors (four numerical variables :imdb_rating,critics_score,runtime, imdb_num_votes, and four categorical variables: audience_rating, genre, mpaa_rating, best_actor_win), I decided to take a sub data frame from the original data frame to start the exploratory data analysis only with a reduced data-frame that include only the variables selected.
NewMovies<- movies %>% select(audience_score, imdb_rating, audience_rating,genre, critics_score,
runtime, mpaa_rating, imdb_num_votes, best_actor_win) %>%
filter (mpaa_rating == "G" | mpaa_rating == "PG"|mpaa_rating == "PG-13"|
mpaa_rating == "R"| mpaa_rating == "Unrated")
NewMovies<-na.omit(NewMovies)
There are several techniques to deal with missing values, I decided to use **simple truncation* as indicated above
smp_size <- floor(0.75 * nrow(NewMovies));set.seed(123)
train_ind <- sample(seq_len(nrow(NewMovies)), size = smp_size)
SubSetMovies <- NewMovies[train_ind, ]
test <- NewMovies[-train_ind, ]
hist(SubSetMovies$audience_score, main="Histogram for audience Score (Predicted Variable)",
xlab="Audience Score", border="black",breaks=50, xlim=c(0,100), las=1)
summary(SubSetMovies$audience_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.48 80.00 97.00
FIGURE 1 & TABLE 1 :Predicted variable distribution (Bin 15).The median of the predicted variable (audience score) is 63% with a min of 11 and a max of 97%. Any movies were scored under 11%.
par(mfrow=c(2,2))
p1<- hist(SubSetMovies$critics_score, main="(A) Histogram for Critics Score",
xlab=" Critics Score in Rotten Tomatoe", border="black",breaks=50, xlim=c(0,100),las=1)
hist(SubSetMovies$runtime, main=" (B)Histogram for Run time",
xlab=" Runt time", border="black",breaks=50, xlim=c(0,200),
las=1)
hist(SubSetMovies$imdb_rating, main=" (C) Histogram for imdb_rating ",
xlab=" internet movies database rating", border="black",breaks=50, xlim=c(0,10),
las=1)
hist(SubSetMovies$imdb_num_votes, main=" (D) Histogram for imdb num votes",
xlab=" internet movies database num of votes", border="black",breaks=200, xlim=c(0,300000),
las=1)
FIGURE 2: Histograms for the predictor numerical variables. Critics score and IMDB num votes show a broad distribution.
It is good practice to standardize the predictor/explanatory variables before proceeding so that they have a mean of zero (“centering”) and standard deviation of one (“scaling”). It ensures that the estimated coefficients are all on the same scale, making it easier to compare effect sizes during modeling analysis.
SubSetMovies$critics_scoreS <- scale(SubSetMovies$critics_score, center = TRUE, scale = TRUE)
SubSetMovies$runtimeS <- scale(SubSetMovies$runtime, center = TRUE, scale = TRUE)
SubSetMovies$imdb_ratingS <- scale(SubSetMovies$imdb_rating, center = TRUE, scale = TRUE)
SubSetMovies$imdb_num_votesS <- scale(SubSetMovies$imdb_num_votes, center = TRUE, scale = TRUE)
par(mfrow=c(2,2))
hist(SubSetMovies$critics_scoreS, main="(A) Histogram for Critics Score",
xlab=" Critics Score in Rotten Tomatoe", border="black", xlim=c(-2,2), las=1,breaks=20)
hist(SubSetMovies$runtimeS, main="(A) Histogram for Critics Score",
xlab=" Critics Score in Rotten Tomatoe", border="black", xlim=c(-2,2), las=1,breaks=20)
hist(SubSetMovies$imdb_ratingS, main="(A) Histogram for Critics Score",
xlab=" Critics Score in Rotten Tomatoe", border="black", xlim=c(-2,2),las=1,breaks=20)
hist(SubSetMovies$imdb_num_votesS, main="(A) Histogram for Critics Score",
xlab=" Critics Score in Rotten Tomatoe", border="black", xlim=c(-2,2),
las=1,breaks=50)
FIGURE 3: Histograms for the standardized predictor variables
Matrix_Numerical_Variables <- SubSetMovies[names(SubSetMovies) %in% c('audience_score','critics_score', 'runtime', 'imdb_rating','imdb_num_votes')]
corr.matrix <- cor(Matrix_Numerical_Variables)
corrplot(corr.matrix, method="number")
FIGURE 4: Correlation plot between numerical variables shows high correlation between two predictors (critics_score and imdb_rating with correlation of 77%) presenting the effect of collinearity that need to be analyzed. This collinearity can compromise estimation parameters of the regression final model. Independent variables must be independent each other
p1 <- ggplot(aes(x=mpaa_rating), data=SubSetMovies) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('Frequency') +
ggtitle('(A) Rating on IMDB') + coord_flip()
p2 <- ggplot(aes(x=genre), data=SubSetMovies) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('Frequency') +
ggtitle('(B) Genre of movies') + coord_flip()
p3 <- ggplot(aes(x=best_actor_win), data=SubSetMovies) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('Frequency') +
ggtitle('(C) Actors ever win an Oscar') + coord_flip()
p4 <- ggplot(aes(x=audience_rating), data=SubSetMovies) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('Frequency') +
ggtitle('(D) Actors ever win an Oscar') + coord_flip()
grid.arrange(p1, p2, p3, p4, ncol=2)
FIGURE 5: Bar plots for categorical variables.
Plot A. Only rating R, PG-13, PG, Unrated and G were selected in this group of categorical leves for the variable Rating On Internet Movies Database (Rating IMDB).
Plot (B) Shows the categorical variable Genre of Movie with movies rating as Drama has the high frequency. Plot (C) Shows the categorical variable Actors ever win an Oscar, This variable was included to considers if the presence of actors with an Oscar from the academy has influence in the public opinion about the movies. Plot(D). Categorical variable audience rating.
As discussed above, this research project is considering four categorical variables as predictors and we need to see the possible correlation between these categorical variables and the predicted variable 3.5.1: Audience score vs. mpaa_rating
boxplot(audience_score~mpaa_rating, data=SubSetMovies, main='Audience score vs. mpaa_rating',
xlab='mpaa_rating', ylab='Audience Score')+ scale_fill_grey() + theme_classic()
## NULL
FIGURE 6
by(SubSetMovies$audience_score, SubSetMovies$mpaa_rating, summary)
## SubSetMovies$mpaa_rating: G
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 53.5 70.0 66.0 84.0 92.0
## ------------------------------------------------------------
## SubSetMovies$mpaa_rating: NC-17
## NULL
## ------------------------------------------------------------
## SubSetMovies$mpaa_rating: PG
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 49.25 65.00 62.84 79.00 93.00
## ------------------------------------------------------------
## SubSetMovies$mpaa_rating: PG-13
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 43.25 55.00 55.72 69.75 94.00
## ------------------------------------------------------------
## SubSetMovies$mpaa_rating: R
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.00 44.00 66.00 62.28 80.00 97.00
## ------------------------------------------------------------
## SubSetMovies$mpaa_rating: Unrated
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 78.00 84.50 79.92 88.25 96.00
TABLE 2
FIGURE 6 & TABLE 2 The categorical variable mpaa_rating, with the levels considered in this study (G, PG, PG-13 R , and unrated) seems to has reasonable correlation with audience score.
3.5.2: Audience score vs. genre
boxplot(audience_score~genre, data=SubSetMovies, main='Audience score vs. genre',
xlab='genre', ylab='Audience Score')
FIGURE 7
by(SubSetMovies$audience_score, SubSetMovies$genre, summary)
## SubSetMovies$genre: Action & Adventure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 37.25 53.00 53.79 65.25 94.00
## ------------------------------------------------------------
## SubSetMovies$genre: Animation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 55.50 63.00 59.29 67.50 88.00
## ------------------------------------------------------------
## SubSetMovies$genre: Art House & International
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 48.00 65.50 62.50 78.75 86.00
## ------------------------------------------------------------
## SubSetMovies$genre: Comedy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 38.50 51.00 53.76 68.50 93.00
## ------------------------------------------------------------
## SubSetMovies$genre: Documentary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 78.50 86.00 83.94 89.50 96.00
## ------------------------------------------------------------
## SubSetMovies$genre: Drama
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 52.00 71.00 65.41 80.00 95.00
## ------------------------------------------------------------
## SubSetMovies$genre: Horror
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 34.25 42.00 45.00 53.25 84.00
## ------------------------------------------------------------
## SubSetMovies$genre: Musical & Performing Arts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.00 73.25 79.00 78.60 87.00 95.00
## ------------------------------------------------------------
## SubSetMovies$genre: Mystery & Suspense
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 40.00 53.00 55.50 70.25 97.00
## ------------------------------------------------------------
## SubSetMovies$genre: Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 53.00 76.00 67.75 87.25 91.00
## ------------------------------------------------------------
## SubSetMovies$genre: Science Fiction & Fantasy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 26.00 42.50 69.00 61.83 83.50 85.00
TABLE 3
FIGURE 7 & TABLE 3 The categorical variable genre shows reasonable correlation with the dependend variable.
3.5.3: Audience score vs. best_actor_win
boxplot(audience_score~best_actor_win, data=SubSetMovies, main='Audience score vs. best_actor_win',
xlab='Author ever win an oscar', ylab='Audience Score')
FIGURE 8
by(SubSetMovies$audience_score, SubSetMovies$best_actor_win, summary)
## SubSetMovies$best_actor_win: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 45.50 65.00 62.17 79.50 96.00
## ------------------------------------------------------------
## SubSetMovies$best_actor_win: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 49.00 68.00 64.16 81.00 97.00
TABLE 4
3.5.4: Audience score vs. audience_rating
boxplot(audience_score~audience_rating, data=SubSetMovies, main='Audience score vs. best_actor_win',
xlab='Audience Rating', ylab='Audience Score')
FIGURE 9
by(SubSetMovies$audience_score, SubSetMovies$best_actor_win, summary)
## SubSetMovies$best_actor_win: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 45.50 65.00 62.17 79.50 96.00
## ------------------------------------------------------------
## SubSetMovies$best_actor_win: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 49.00 68.00 64.16 81.00 97.00
TABLE 5
4.1 Analsis of collinearity between predictors Critics_score & imdb_rating FIGURE 4, above, shows a high correlation between two numerical predictors (Critics score & imbd rating) and it is a good practice to analyze possible collinearity during regression modeling.
basic.lm<-lm(SubSetMovies$imdb_rating ~SubSetMovies$critics_score)
(prelim_plot <- ggplot(SubSetMovies, aes(x =imdb_rating, y = critics_score)) +
geom_point() +
geom_smooth(method = "lm"))
FIGURE 9: Linear (quasy) and positive relationship between two numerical predictors
summary(basic.lm)
##
## Call:
## lm(formula = SubSetMovies$imdb_rating ~ SubSetMovies$critics_score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.88805 -0.39053 0.06126 0.42999 2.52186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.758027 0.073307 64.91 <2e-16 ***
## SubSetMovies$critics_score 0.030028 0.001144 26.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.718 on 484 degrees of freedom
## Multiple R-squared: 0.5874, Adjusted R-squared: 0.5865
## F-statistic: 689 on 1 and 484 DF, p-value: < 2.2e-16
TABLE 5: Summary of relationship between predictors
FIGURE 9 and TABLE 5 show the colliniarity (correlation) effect between these two predictors.
basic.lm1<-lm(SubSetMovies$audience_score ~SubSetMovies$critics_score)
p1a<-prelim_plot <- ggplot(SubSetMovies, aes(x =critics_score, y = audience_score)) +
geom_point() +
ggtitle('(A) Audience score vs critics score')+
geom_smooth(method = "lm")
basic.lm2<-lm(SubSetMovies$audience_score ~SubSetMovies$imdb_rating)
p2a<-prelim_plot <- ggplot(SubSetMovies, aes(x =imdb_rating, y = audience_score)) +
geom_point() + ggtitle('(B) Audience score vs imdb rating')+
geom_smooth(method = "lm")
fit1<-lm(audience_score~critics_score, data = SubSetMovies);
fit2<-lm(audience_score~imdb_rating, data =SubSetMovies)
p3a<-ggplot(data = fit1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")+ ggtitle('(C) Residuals')
p4a<-ggplot(data = fit2, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")+ ggtitle('(D) Residuals')
grid.arrange(p1a, p2a, p3a,p4a, ncol=2)
FIGURE 10 Correlation between predicted variable with each of the predictors that presesent collinearity (plots A and B), and their residuals (plot C and D).
summary(basic.lm1);summary(basic.lm2)
##
## Call:
## lm(formula = SubSetMovies$audience_score ~ SubSetMovies$critics_score)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.457 -10.032 0.865 9.817 43.683
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.17048 1.47518 22.49 <2e-16 ***
## SubSetMovies$critics_score 0.51049 0.02302 22.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.45 on 484 degrees of freedom
## Multiple R-squared: 0.504, Adjusted R-squared: 0.503
## F-statistic: 491.8 on 1 and 484 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = SubSetMovies$audience_score ~ SubSetMovies$imdb_rating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.129 -6.944 1.154 5.706 51.616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.892 2.795 -14.27 <2e-16 ***
## SubSetMovies$imdb_rating 15.793 0.425 37.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.45 on 484 degrees of freedom
## Multiple R-squared: 0.7404, Adjusted R-squared: 0.7399
## F-statistic: 1381 on 1 and 484 DF, p-value: < 2.2e-16
TABLE 6: Summary of relationship between predictors that present collineairy and the predicted variable.
Collinearity: We can deal with collinearity in different ways including principal component analysis (PCA) and other methods, but in this project I decide from those results presented on FIGURE 10 and TABLE 6. to not consider one of those variables that present collinearity. The question is which one? And why? If we decided to eliminate the variable with lower R squared (TABLE 5) that will be the critics_score that has a lower R squared (R2 = 0.49, Table 6), and continue with the imdb_rating variable (R2 = 0.748, Table 5), BUT the residuals plots, FIFURE 10, plot C and D, shows that the correlation between imdb_rating and predicted variable present heteroscedasticity (not constant variance in the residuals) so I decided to take the predictor with lower R squared, critics_score, but with homoscedasticity (constant variance in the residuals plot, Figure 10 plot C and D). Lower R-square but with homoscedasticity.
4.2 Model selection
We will use stepwise model selection method as backwards elimination. Here, we start with a full model that is a model with all possible co-variants or predictors included, and then we will drop variables one at a time until a parsimonious model is reached. WI will be going to be focusing on p values and adjusted R squared.
full_model<-lm(audience_score ~ critics_score+ runtime + imdb_num_votes+genre + best_actor_win+audience_rating+mpaa_rating, data =SubSetMovies )
summary(full_model)
##
## Call:
## lm(formula = audience_score ~ critics_score + runtime + imdb_num_votes +
## genre + best_actor_win + audience_rating + mpaa_rating, data = SubSetMovies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.6512 -5.8570 0.9251 6.2862 21.3303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.405e+01 3.821e+00 8.909 < 2e-16 ***
## critics_score 1.936e-01 1.897e-02 10.204 < 2e-16 ***
## runtime 2.774e-02 2.577e-02 1.076 0.282382
## imdb_num_votes 1.476e-05 4.400e-06 3.355 0.000858 ***
## genreAnimation -1.575e+00 4.190e+00 -0.376 0.707184
## genreArt House & International 2.096e+00 2.917e+00 0.718 0.472929
## genreComedy 8.682e-01 1.710e+00 0.508 0.611942
## genreDocumentary 6.923e+00 2.448e+00 2.828 0.004887 **
## genreDrama 1.283e+00 1.511e+00 0.849 0.396458
## genreHorror -7.970e-01 2.422e+00 -0.329 0.742218
## genreMusical & Performing Arts 5.944e+00 3.170e+00 1.875 0.061406 .
## genreMystery & Suspense -8.058e-01 1.928e+00 -0.418 0.676204
## genreOther -6.718e-02 2.869e+00 -0.023 0.981329
## genreScience Fiction & Fantasy 1.014e+00 3.799e+00 0.267 0.789700
## best_actor_winyes 7.622e-01 1.165e+00 0.654 0.513243
## audience_ratingUpright 2.722e+01 1.030e+00 26.418 < 2e-16 ***
## mpaa_ratingPG -2.998e+00 2.963e+00 -1.012 0.312202
## mpaa_ratingPG-13 -3.957e+00 3.054e+00 -1.295 0.195832
## mpaa_ratingR -3.775e+00 2.946e+00 -1.281 0.200657
## mpaa_ratingUnrated -3.291e+00 3.383e+00 -0.973 0.331049
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.711 on 466 degrees of freedom
## Multiple R-squared: 0.8264, Adjusted R-squared: 0.8193
## F-statistic: 116.8 on 19 and 466 DF, p-value: < 2.2e-16
TABLE 7 Summary of full_model. From Table 7, the predictor candidate to be excluded from the model is runtime ( p-value = 0.48) to arriive to the full_model2
full_model2<-lm(audience_score ~ critics_score+ imdb_num_votes+genre + best_actor_win+audience_rating+mpaa_rating, data =SubSetMovies )
summary(full_model2)
##
## Call:
## lm(formula = audience_score ~ critics_score + imdb_num_votes +
## genre + best_actor_win + audience_rating + mpaa_rating, data = SubSetMovies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.5299 -5.7521 0.9312 6.1541 21.4881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.649e+01 3.077e+00 11.859 < 2e-16 ***
## critics_score 1.952e-01 1.892e-02 10.313 < 2e-16 ***
## imdb_num_votes 1.619e-05 4.194e-06 3.862 0.000128 ***
## genreAnimation -1.906e+00 4.180e+00 -0.456 0.648539
## genreArt House & International 2.150e+00 2.918e+00 0.737 0.461553
## genreComedy 7.039e-01 1.704e+00 0.413 0.679667
## genreDocumentary 6.658e+00 2.436e+00 2.733 0.006514 **
## genreDrama 1.436e+00 1.505e+00 0.954 0.340616
## genreHorror -9.950e-01 2.415e+00 -0.412 0.680537
## genreMusical & Performing Arts 6.308e+00 3.152e+00 2.001 0.045948 *
## genreMystery & Suspense -6.691e-01 1.924e+00 -0.348 0.728222
## genreOther 5.084e-02 2.868e+00 0.018 0.985861
## genreScience Fiction & Fantasy 1.079e+00 3.799e+00 0.284 0.776632
## best_actor_winyes 1.031e+00 1.138e+00 0.906 0.365528
## audience_ratingUpright 2.728e+01 1.029e+00 26.504 < 2e-16 ***
## mpaa_ratingPG -2.842e+00 2.960e+00 -0.960 0.337497
## mpaa_ratingPG-13 -3.611e+00 3.038e+00 -1.188 0.235244
## mpaa_ratingR -3.577e+00 2.940e+00 -1.216 0.224435
## mpaa_ratingUnrated -3.052e+00 3.376e+00 -0.904 0.366407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.712 on 467 degrees of freedom
## Multiple R-squared: 0.826, Adjusted R-squared: 0.8193
## F-statistic: 123.2 on 18 and 467 DF, p-value: < 2.2e-16
TABLE 8 Summary of full_model2.
Fromm TABLE 8, we can eliminate the entire categorical variable mpaa_rating to arrive to the full_model3.
full_model3<-lm(audience_score ~ critics_score+ imdb_num_votes+genre + best_actor_win+audience_rating, data =SubSetMovies )
summary(full_model3)
##
## Call:
## lm(formula = audience_score ~ critics_score + imdb_num_votes +
## genre + best_actor_win + audience_rating, data = SubSetMovies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.9970 -5.8276 0.8967 6.0744 21.5227
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.336e+01 1.418e+00 23.523 < 2e-16 ***
## critics_score 1.997e-01 1.830e-02 10.910 < 2e-16 ***
## imdb_num_votes 1.550e-05 4.090e-06 3.789 0.000171 ***
## genreAnimation 6.546e-01 3.524e+00 0.186 0.852719
## genreArt House & International 1.700e+00 2.838e+00 0.599 0.549386
## genreComedy 3.174e-01 1.671e+00 0.190 0.849440
## genreDocumentary 6.320e+00 2.137e+00 2.958 0.003252 **
## genreDrama 9.006e-01 1.430e+00 0.630 0.529077
## genreHorror -1.519e+00 2.337e+00 -0.650 0.515974
## genreMusical & Performing Arts 5.798e+00 3.100e+00 1.870 0.062050 .
## genreMystery & Suspense -1.261e+00 1.839e+00 -0.686 0.493173
## genreOther -2.519e-01 2.844e+00 -0.089 0.929470
## genreScience Fiction & Fantasy 1.294e+00 3.778e+00 0.343 0.732010
## best_actor_winyes 1.044e+00 1.127e+00 0.926 0.354872
## audience_ratingUpright 2.726e+01 1.024e+00 26.620 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.692 on 471 degrees of freedom
## Multiple R-squared: 0.8253, Adjusted R-squared: 0.8202
## F-statistic: 159 on 14 and 471 DF, p-value: < 2.2e-16
TABLE 9 Summary of full_model3. Table 8 shows that the categorical variable "best_actor _win" can be descated in this research. We can not remove only a level of the categorical variable, we need remove the entire categorical variable.
full_model4<-lm(audience_score ~ critics_score+ imdb_num_votes+genre + audience_rating, data =SubSetMovies )
summary(full_model4)
##
## Call:
## lm(formula = audience_score ~ critics_score + imdb_num_votes +
## genre + audience_rating, data = SubSetMovies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.0750 -5.9142 0.8498 6.0660 21.3457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.346e+01 1.414e+00 23.659 < 2e-16 ***
## critics_score 2.000e-01 1.829e-02 10.932 < 2e-16 ***
## imdb_num_votes 1.587e-05 4.070e-06 3.899 0.000111 ***
## genreAnimation 6.875e-01 3.523e+00 0.195 0.845379
## genreArt House & International 1.600e+00 2.835e+00 0.564 0.572913
## genreComedy 2.981e-01 1.671e+00 0.178 0.858426
## genreDocumentary 6.243e+00 2.135e+00 2.925 0.003613 **
## genreDrama 9.767e-01 1.427e+00 0.684 0.494059
## genreHorror -1.638e+00 2.333e+00 -0.702 0.483044
## genreMusical & Performing Arts 5.797e+00 3.099e+00 1.870 0.062067 .
## genreMystery & Suspense -1.064e+00 1.826e+00 -0.582 0.560530
## genreOther -2.410e-01 2.844e+00 -0.085 0.932506
## genreScience Fiction & Fantasy 1.145e+00 3.774e+00 0.303 0.761649
## audience_ratingUpright 2.724e+01 1.024e+00 26.609 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.69 on 472 degrees of freedom
## Multiple R-squared: 0.825, Adjusted R-squared: 0.8202
## F-statistic: 171.2 on 13 and 472 DF, p-value: < 2.2e-16
TABLE 10 Summary of full_model4.
At this point we need to make a decision between parsimonious model and R-squared and the elimination to an entire categorical value where one of its levels has a low p-value. ie. Genre, see Table 10. This variable has 11 levels but only 2 of them (genre\(musical&Performing Arts, and genre\)documentary) have low p-value (see Table 10). The idea is the elimination of this entire categorical variable and observe the variability in Adjusted R-squared. If the variation in Adjusted R-squared is insignificant we will decide for parsimonious model idea, but if the variation in Adjusted R-squared is significant we need to include again the variable in the model and decide for Adjusted R-Squared idea.
Elimination of the categorical variable genre from the model
full_model5<-lm(audience_score ~ critics_score+ imdb_num_votes+ audience_rating, data =SubSetMovies )
summary(full_model5)
##
## Call:
## lm(formula = audience_score ~ critics_score + imdb_num_votes +
## audience_rating, data = SubSetMovies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.5720 -6.0496 0.5664 6.1679 25.4470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.319e+01 8.944e-01 37.112 <2e-16 ***
## critics_score 2.174e-01 1.726e-02 12.601 <2e-16 ***
## imdb_num_votes 1.133e-05 3.873e-06 2.926 0.0036 **
## audience_ratingUpright 2.807e+01 9.974e-01 28.143 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.753 on 482 degrees of freedom
## Multiple R-squared: 0.8187, Adjusted R-squared: 0.8176
## F-statistic: 725.6 on 3 and 482 DF, p-value: < 2.2e-16
TABLE 11 Summary of full_model5.
TTable 5 show a small change in Adjusted R-squared and R-Square. in this scenario the idea of parsimonious modelswin and the final mode has only three variables, two numerical (critics_score and imbd_num_votes) and one categorical (audience rating).
4.3 Goodness of Fit in thefinal Model
px1<-ggplot(data = full_model5, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")+ ggtitle('(A) Residuals')
px2<-ggplot(data = full_model5, aes(sample = .resid)) +
stat_qq()+ ggtitle('(B) Sample versus theoretical')
grid.arrange(px1,px2, ncol=2)
Full_residuals = resid(full_model5)
p5a<-(hist(Full_residuals, main="Residuals distribution for the final model (full_model5)",
xlab="Residuals",
border="black",breaks=20, xlim=c(-20,20),
las=1))
*FIGURE 11** Goodness of the fit. Plot A shot quasy homoscedasticity, and Figure C (histogram) show normal distribution of residuals round zero.
The final model is:
Audience_score = 33.2+ 0.22critics_score+0.000011imb_num_votes +28.1*audience_rating:Upright
We can pick a movie from 2016 (sample test) and do a prediction for this movie using the final model developed and quantify the uncertainty around this prediction using an appropriate interval.
newmovie <- test %>% select(audience_score,critics_score,audience_rating, imdb_num_votes)
MovieSelected<-newmovie[sample(nrow(newmovie), 1), ]
MovieSelected
## # A tibble: 1 x 4
## audience_score critics_score audience_rating imdb_num_votes
## <dbl> <dbl> <fct> <int>
## 1 92 82 Upright 315051
predict(full_model5, MovieSelected)
## 1
## 82.66101
predict(full_model5, MovieSelected, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 82.66101 65.33573 99.98629
Model predictions are within the 95% Confidence Interval.
The final model presented in this document predict audience score within the 95% confidence intervals. The study is observational and we cannot prove causation. The results cannot be generalized for the size of the sample.
Discussion: Those results show interesting points: 1. An author with an Oscar from the academy appears not to have any influence in the audience score, and I assumed the opposite 2. Definetly the long of the movies, runtime, is irrelevant for the audience scoring the movies 3. The critics from the experts, critics_score, appear to have strong influence/correlation with the public when scoring a movie.
Olinto J. Linares-Perdomo