Load packages

### load packages
options( warn = -1)
library(ggplot2)
suppressMessages(library(dplyr));library(dplyr)
suppressMessages(library(statsr));library(statsr)
suppressMessages(library(corrplot));library(corrplot)
suppressMessages(library(gridExtra));library(gridExtra)

Load data

load("movies.Rdata")

This project consist of 6 parts:

The data set, movies, is comprised of32 variables and 651 randomly sampled movies produced and released before 2016

This is an observational study, the researches not introduce any type of intervention on this study to modify the data. There are variables that are only informative. Because this study is not an experimental study, we cannot establish causality with any probable results, and correlation does not prove causation.

This can be generalizable only to the subset of movies produced/published by those companies inside the United State of America

2. Research Question.

If the movies audience influence, or related, to other variables? What variable(s) most influence the audience?

The main idea is to analyze if the public audience (audience_score) is influenced by different predictors, and try to answer the research question. I decided to take 4 numerical predictors and 4 categorical predictors such as: imdb_rating, audience_rating, genre, critics_score, runtime, mpaa_rating,imdb_num_votes, best_actor_win.

Answering this question will support, or not, my assumption that the public do not put so much attention to the expert or internet movies database criticism, but if an author, in the movie, with an Oscar from the academy could have strong inference in the audience opinion.

Answer the research question could be useful for the company to the focus company focus in those variables that affect/influence the final user, the audience.

3. Exploratory Data Analysis(EDA)

Because the research question defined above only involves specific variables, one predicted (audience_score) and eight predictors (four numerical variables :imdb_rating,critics_score,runtime, imdb_num_votes, and four categorical variables: audience_rating, genre, mpaa_rating, best_actor_win), I decided to take a sub data frame from the original data frame to start the exploratory data analysis only with a reduced data-frame that include only the variables selected.

NewMovies<- movies %>% select(audience_score, imdb_rating, audience_rating,genre, critics_score, 
                                 runtime, mpaa_rating, imdb_num_votes, best_actor_win) %>% 
                                 filter (mpaa_rating == "G" | mpaa_rating == "PG"|mpaa_rating == "PG-13"|
                                         mpaa_rating == "R"| mpaa_rating == "Unrated")

NewMovies<-na.omit(NewMovies)

There are several techniques to deal with missing values, I decided to use **simple truncation* as indicated above

smp_size <- floor(0.75 * nrow(NewMovies));set.seed(123)
train_ind <- sample(seq_len(nrow(NewMovies)), size = smp_size)
SubSetMovies <- NewMovies[train_ind, ]
test <- NewMovies[-train_ind, ]

Analysis of Numerical Variables: Predicted and predictors

3.1. Predicted variable distribution

hist(SubSetMovies$audience_score,  main="Histogram for audience Score (Predicted Variable)", 
     xlab="Audience Score",  border="black",breaks=50, xlim=c(0,100), las=1)

summary(SubSetMovies$audience_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.48   80.00   97.00

FIGURE 1 & TABLE 1 :Predicted variable distribution (Bin 15).The median of the predicted variable (audience score) is 63% with a min of 11 and a max of 97%. Any movies were scored under 11%.

3.2. Predictor variables distributions

par(mfrow=c(2,2))
p1<- hist(SubSetMovies$critics_score,  main="(A) Histogram for Critics Score", 
     xlab=" Critics Score in Rotten Tomatoe", border="black",breaks=50, xlim=c(0,100),las=1)

hist(SubSetMovies$runtime,  main=" (B)Histogram for Run time", 
     xlab=" Runt time", border="black",breaks=50, xlim=c(0,200),
     las=1)

hist(SubSetMovies$imdb_rating,  main=" (C) Histogram for imdb_rating ", 
     xlab=" internet movies database rating", border="black",breaks=50, xlim=c(0,10),
     las=1)
hist(SubSetMovies$imdb_num_votes,  main=" (D) Histogram for imdb num votes", 
     xlab=" internet movies database num of votes", border="black",breaks=200, xlim=c(0,300000),
     las=1)

FIGURE 2: Histograms for the predictor numerical variables. Critics score and IMDB num votes show a broad distribution.

It is good practice to standardize the predictor/explanatory variables before proceeding so that they have a mean of zero (“centering”) and standard deviation of one (“scaling”). It ensures that the estimated coefficients are all on the same scale, making it easier to compare effect sizes during modeling analysis.

SubSetMovies$critics_scoreS <- scale(SubSetMovies$critics_score, center = TRUE, scale = TRUE)
SubSetMovies$runtimeS <- scale(SubSetMovies$runtime, center = TRUE, scale = TRUE)
SubSetMovies$imdb_ratingS <- scale(SubSetMovies$imdb_rating, center = TRUE, scale = TRUE)
SubSetMovies$imdb_num_votesS <- scale(SubSetMovies$imdb_num_votes, center = TRUE, scale = TRUE)

par(mfrow=c(2,2))
hist(SubSetMovies$critics_scoreS,  main="(A) Histogram for Critics Score", 
     xlab=" Critics Score in Rotten Tomatoe", border="black", xlim=c(-2,2), las=1,breaks=20)


hist(SubSetMovies$runtimeS,  main="(A) Histogram for Critics Score", 
     xlab=" Critics Score in Rotten Tomatoe", border="black", xlim=c(-2,2), las=1,breaks=20)

hist(SubSetMovies$imdb_ratingS,  main="(A) Histogram for Critics Score", 
     xlab=" Critics Score in Rotten Tomatoe",  border="black", xlim=c(-2,2),las=1,breaks=20)


hist(SubSetMovies$imdb_num_votesS,  main="(A) Histogram for Critics Score", 
     xlab=" Critics Score in Rotten Tomatoe", border="black", xlim=c(-2,2),
     las=1,breaks=50)

FIGURE 3: Histograms for the standardized predictor variables

3.3 Correlation between numerical variables

Matrix_Numerical_Variables <- SubSetMovies[names(SubSetMovies) %in% c('audience_score','critics_score', 'runtime', 'imdb_rating','imdb_num_votes')]
corr.matrix <- cor(Matrix_Numerical_Variables)
corrplot(corr.matrix, method="number")

FIGURE 4: Correlation plot between numerical variables shows high correlation between two predictors (critics_score and imdb_rating with correlation of 77%) presenting the effect of collinearity that need to be analyzed. This collinearity can compromise estimation parameters of the regression final model. Independent variables must be independent each other

3.4 Bar plot for categorical variables

p1 <- ggplot(aes(x=mpaa_rating), data=SubSetMovies) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('Frequency') +
  ggtitle('(A) Rating on IMDB') + coord_flip()

p2 <- ggplot(aes(x=genre), data=SubSetMovies) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('Frequency') +
  ggtitle('(B) Genre of movies') + coord_flip()

p3 <- ggplot(aes(x=best_actor_win), data=SubSetMovies) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('Frequency') +
  ggtitle('(C) Actors ever win an Oscar') + coord_flip()

p4 <- ggplot(aes(x=audience_rating), data=SubSetMovies) + geom_bar(aes(y=100*(..count..)/sum(..count..))) + ylab('Frequency') +
  ggtitle('(D) Actors ever win an Oscar') + coord_flip()

grid.arrange(p1, p2, p3, p4, ncol=2)

FIGURE 5: Bar plots for categorical variables.

Plot A. Only rating R, PG-13, PG, Unrated and G were selected in this group of categorical leves for the variable Rating On Internet Movies Database (Rating IMDB).

Plot (B) Shows the categorical variable Genre of Movie with movies rating as Drama has the high frequency. Plot (C) Shows the categorical variable Actors ever win an Oscar, This variable was included to considers if the presence of actors with an Oscar from the academy has influence in the public opinion about the movies. Plot(D). Categorical variable audience rating.

3.5 Correlation between categorical variables and the predicted variable

As discussed above, this research project is considering four categorical variables as predictors and we need to see the possible correlation between these categorical variables and the predicted variable 3.5.1: Audience score vs. mpaa_rating

boxplot(audience_score~mpaa_rating, data=SubSetMovies, main='Audience score vs. mpaa_rating',
        xlab='mpaa_rating', ylab='Audience Score')+ scale_fill_grey() + theme_classic()

## NULL

FIGURE 6

 by(SubSetMovies$audience_score, SubSetMovies$mpaa_rating, summary)
## SubSetMovies$mpaa_rating: G
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    53.5    70.0    66.0    84.0    92.0 
## ------------------------------------------------------------ 
## SubSetMovies$mpaa_rating: NC-17
## NULL
## ------------------------------------------------------------ 
## SubSetMovies$mpaa_rating: PG
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   49.25   65.00   62.84   79.00   93.00 
## ------------------------------------------------------------ 
## SubSetMovies$mpaa_rating: PG-13
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   43.25   55.00   55.72   69.75   94.00 
## ------------------------------------------------------------ 
## SubSetMovies$mpaa_rating: R
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   44.00   66.00   62.28   80.00   97.00 
## ------------------------------------------------------------ 
## SubSetMovies$mpaa_rating: Unrated
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   78.00   84.50   79.92   88.25   96.00

TABLE 2

FIGURE 6 & TABLE 2 The categorical variable mpaa_rating, with the levels considered in this study (G, PG, PG-13 R , and unrated) seems to has reasonable correlation with audience score.

3.5.2: Audience score vs. genre

boxplot(audience_score~genre, data=SubSetMovies, main='Audience score vs. genre',
        xlab='genre', ylab='Audience Score')

FIGURE 7

by(SubSetMovies$audience_score, SubSetMovies$genre, summary)
## SubSetMovies$genre: Action & Adventure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   37.25   53.00   53.79   65.25   94.00 
## ------------------------------------------------------------ 
## SubSetMovies$genre: Animation
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   55.50   63.00   59.29   67.50   88.00 
## ------------------------------------------------------------ 
## SubSetMovies$genre: Art House & International
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   48.00   65.50   62.50   78.75   86.00 
## ------------------------------------------------------------ 
## SubSetMovies$genre: Comedy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   38.50   51.00   53.76   68.50   93.00 
## ------------------------------------------------------------ 
## SubSetMovies$genre: Documentary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   57.00   78.50   86.00   83.94   89.50   96.00 
## ------------------------------------------------------------ 
## SubSetMovies$genre: Drama
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   52.00   71.00   65.41   80.00   95.00 
## ------------------------------------------------------------ 
## SubSetMovies$genre: Horror
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.00   34.25   42.00   45.00   53.25   84.00 
## ------------------------------------------------------------ 
## SubSetMovies$genre: Musical & Performing Arts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   55.00   73.25   79.00   78.60   87.00   95.00 
## ------------------------------------------------------------ 
## SubSetMovies$genre: Mystery & Suspense
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   40.00   53.00   55.50   70.25   97.00 
## ------------------------------------------------------------ 
## SubSetMovies$genre: Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   53.00   76.00   67.75   87.25   91.00 
## ------------------------------------------------------------ 
## SubSetMovies$genre: Science Fiction & Fantasy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   26.00   42.50   69.00   61.83   83.50   85.00

TABLE 3

FIGURE 7 & TABLE 3 The categorical variable genre shows reasonable correlation with the dependend variable.

3.5.3: Audience score vs. best_actor_win

boxplot(audience_score~best_actor_win, data=SubSetMovies, main='Audience score vs. best_actor_win',
        xlab='Author ever win an oscar', ylab='Audience Score')

FIGURE 8

by(SubSetMovies$audience_score, SubSetMovies$best_actor_win, summary)
## SubSetMovies$best_actor_win: no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   45.50   65.00   62.17   79.50   96.00 
## ------------------------------------------------------------ 
## SubSetMovies$best_actor_win: yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   49.00   68.00   64.16   81.00   97.00

TABLE 4

3.5.4: Audience score vs. audience_rating

boxplot(audience_score~audience_rating, data=SubSetMovies, main='Audience score vs. best_actor_win',
        xlab='Audience Rating', ylab='Audience Score')

FIGURE 9

by(SubSetMovies$audience_score, SubSetMovies$best_actor_win, summary)
## SubSetMovies$best_actor_win: no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   45.50   65.00   62.17   79.50   96.00 
## ------------------------------------------------------------ 
## SubSetMovies$best_actor_win: yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   49.00   68.00   64.16   81.00   97.00

TABLE 5

4. Modeling

4.1 Analsis of collinearity between predictors Critics_score & imdb_rating FIGURE 4, above, shows a high correlation between two numerical predictors (Critics score & imbd rating) and it is a good practice to analyze possible collinearity during regression modeling.

basic.lm<-lm(SubSetMovies$imdb_rating ~SubSetMovies$critics_score)

(prelim_plot <- ggplot(SubSetMovies, aes(x =imdb_rating, y = critics_score)) +
    geom_point() +
    geom_smooth(method = "lm"))

FIGURE 9: Linear (quasy) and positive relationship between two numerical predictors

summary(basic.lm)
## 
## Call:
## lm(formula = SubSetMovies$imdb_rating ~ SubSetMovies$critics_score)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.88805 -0.39053  0.06126  0.42999  2.52186 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.758027   0.073307   64.91   <2e-16 ***
## SubSetMovies$critics_score 0.030028   0.001144   26.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.718 on 484 degrees of freedom
## Multiple R-squared:  0.5874, Adjusted R-squared:  0.5865 
## F-statistic:   689 on 1 and 484 DF,  p-value: < 2.2e-16

TABLE 5: Summary of relationship between predictors

FIGURE 9 and TABLE 5 show the colliniarity (correlation) effect between these two predictors.

basic.lm1<-lm(SubSetMovies$audience_score ~SubSetMovies$critics_score)

p1a<-prelim_plot <- ggplot(SubSetMovies, aes(x =critics_score, y = audience_score)) +
    geom_point() +
  ggtitle('(A) Audience score vs critics score')+
    geom_smooth(method = "lm")


basic.lm2<-lm(SubSetMovies$audience_score ~SubSetMovies$imdb_rating)

p2a<-prelim_plot <- ggplot(SubSetMovies, aes(x =imdb_rating, y = audience_score)) +
    geom_point() +  ggtitle('(B) Audience score vs imdb rating')+
    geom_smooth(method = "lm")


fit1<-lm(audience_score~critics_score, data = SubSetMovies); 
fit2<-lm(audience_score~imdb_rating, data =SubSetMovies)
p3a<-ggplot(data = fit1, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")+  ggtitle('(C) Residuals')

p4a<-ggplot(data = fit2, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")+  ggtitle('(D) Residuals')


grid.arrange(p1a, p2a, p3a,p4a, ncol=2)

FIGURE 10 Correlation between predicted variable with each of the predictors that presesent collinearity (plots A and B), and their residuals (plot C and D).

summary(basic.lm1);summary(basic.lm2)
## 
## Call:
## lm(formula = SubSetMovies$audience_score ~ SubSetMovies$critics_score)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.457 -10.032   0.865   9.817  43.683 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                33.17048    1.47518   22.49   <2e-16 ***
## SubSetMovies$critics_score  0.51049    0.02302   22.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.45 on 484 degrees of freedom
## Multiple R-squared:  0.504,  Adjusted R-squared:  0.503 
## F-statistic: 491.8 on 1 and 484 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = SubSetMovies$audience_score ~ SubSetMovies$imdb_rating)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.129  -6.944   1.154   5.706  51.616 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -39.892      2.795  -14.27   <2e-16 ***
## SubSetMovies$imdb_rating   15.793      0.425   37.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.45 on 484 degrees of freedom
## Multiple R-squared:  0.7404, Adjusted R-squared:  0.7399 
## F-statistic:  1381 on 1 and 484 DF,  p-value: < 2.2e-16

TABLE 6: Summary of relationship between predictors that present collineairy and the predicted variable.

Collinearity: We can deal with collinearity in different ways including principal component analysis (PCA) and other methods, but in this project I decide from those results presented on FIGURE 10 and TABLE 6. to not consider one of those variables that present collinearity. The question is which one? And why? If we decided to eliminate the variable with lower R squared (TABLE 5) that will be the critics_score that has a lower R squared (R2 = 0.49, Table 6), and continue with the imdb_rating variable (R2 = 0.748, Table 5), BUT the residuals plots, FIFURE 10, plot C and D, shows that the correlation between imdb_rating and predicted variable present heteroscedasticity (not constant variance in the residuals) so I decided to take the predictor with lower R squared, critics_score, but with homoscedasticity (constant variance in the residuals plot, Figure 10 plot C and D). Lower R-square but with homoscedasticity.

4.2 Model selection

We will use stepwise model selection method as backwards elimination. Here, we start with a full model that is a model with all possible co-variants or predictors included, and then we will drop variables one at a time until a parsimonious model is reached. WI will be going to be focusing on p values and adjusted R squared.

full_model<-lm(audience_score ~ critics_score+ runtime + imdb_num_votes+genre + best_actor_win+audience_rating+mpaa_rating, data =SubSetMovies )
summary(full_model)
## 
## Call:
## lm(formula = audience_score ~ critics_score + runtime + imdb_num_votes + 
##     genre + best_actor_win + audience_rating + mpaa_rating, data = SubSetMovies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.6512  -5.8570   0.9251   6.2862  21.3303 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.405e+01  3.821e+00   8.909  < 2e-16 ***
## critics_score                   1.936e-01  1.897e-02  10.204  < 2e-16 ***
## runtime                         2.774e-02  2.577e-02   1.076 0.282382    
## imdb_num_votes                  1.476e-05  4.400e-06   3.355 0.000858 ***
## genreAnimation                 -1.575e+00  4.190e+00  -0.376 0.707184    
## genreArt House & International  2.096e+00  2.917e+00   0.718 0.472929    
## genreComedy                     8.682e-01  1.710e+00   0.508 0.611942    
## genreDocumentary                6.923e+00  2.448e+00   2.828 0.004887 ** 
## genreDrama                      1.283e+00  1.511e+00   0.849 0.396458    
## genreHorror                    -7.970e-01  2.422e+00  -0.329 0.742218    
## genreMusical & Performing Arts  5.944e+00  3.170e+00   1.875 0.061406 .  
## genreMystery & Suspense        -8.058e-01  1.928e+00  -0.418 0.676204    
## genreOther                     -6.718e-02  2.869e+00  -0.023 0.981329    
## genreScience Fiction & Fantasy  1.014e+00  3.799e+00   0.267 0.789700    
## best_actor_winyes               7.622e-01  1.165e+00   0.654 0.513243    
## audience_ratingUpright          2.722e+01  1.030e+00  26.418  < 2e-16 ***
## mpaa_ratingPG                  -2.998e+00  2.963e+00  -1.012 0.312202    
## mpaa_ratingPG-13               -3.957e+00  3.054e+00  -1.295 0.195832    
## mpaa_ratingR                   -3.775e+00  2.946e+00  -1.281 0.200657    
## mpaa_ratingUnrated             -3.291e+00  3.383e+00  -0.973 0.331049    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.711 on 466 degrees of freedom
## Multiple R-squared:  0.8264, Adjusted R-squared:  0.8193 
## F-statistic: 116.8 on 19 and 466 DF,  p-value: < 2.2e-16

TABLE 7 Summary of full_model. From Table 7, the predictor candidate to be excluded from the model is runtime ( p-value = 0.48) to arriive to the full_model2

full_model2<-lm(audience_score ~ critics_score+ imdb_num_votes+genre + best_actor_win+audience_rating+mpaa_rating, data =SubSetMovies )
summary(full_model2)
## 
## Call:
## lm(formula = audience_score ~ critics_score + imdb_num_votes + 
##     genre + best_actor_win + audience_rating + mpaa_rating, data = SubSetMovies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5299  -5.7521   0.9312   6.1541  21.4881 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.649e+01  3.077e+00  11.859  < 2e-16 ***
## critics_score                   1.952e-01  1.892e-02  10.313  < 2e-16 ***
## imdb_num_votes                  1.619e-05  4.194e-06   3.862 0.000128 ***
## genreAnimation                 -1.906e+00  4.180e+00  -0.456 0.648539    
## genreArt House & International  2.150e+00  2.918e+00   0.737 0.461553    
## genreComedy                     7.039e-01  1.704e+00   0.413 0.679667    
## genreDocumentary                6.658e+00  2.436e+00   2.733 0.006514 ** 
## genreDrama                      1.436e+00  1.505e+00   0.954 0.340616    
## genreHorror                    -9.950e-01  2.415e+00  -0.412 0.680537    
## genreMusical & Performing Arts  6.308e+00  3.152e+00   2.001 0.045948 *  
## genreMystery & Suspense        -6.691e-01  1.924e+00  -0.348 0.728222    
## genreOther                      5.084e-02  2.868e+00   0.018 0.985861    
## genreScience Fiction & Fantasy  1.079e+00  3.799e+00   0.284 0.776632    
## best_actor_winyes               1.031e+00  1.138e+00   0.906 0.365528    
## audience_ratingUpright          2.728e+01  1.029e+00  26.504  < 2e-16 ***
## mpaa_ratingPG                  -2.842e+00  2.960e+00  -0.960 0.337497    
## mpaa_ratingPG-13               -3.611e+00  3.038e+00  -1.188 0.235244    
## mpaa_ratingR                   -3.577e+00  2.940e+00  -1.216 0.224435    
## mpaa_ratingUnrated             -3.052e+00  3.376e+00  -0.904 0.366407    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.712 on 467 degrees of freedom
## Multiple R-squared:  0.826,  Adjusted R-squared:  0.8193 
## F-statistic: 123.2 on 18 and 467 DF,  p-value: < 2.2e-16

TABLE 8 Summary of full_model2.

Fromm TABLE 8, we can eliminate the entire categorical variable mpaa_rating to arrive to the full_model3.

full_model3<-lm(audience_score ~ critics_score+ imdb_num_votes+genre + best_actor_win+audience_rating, data =SubSetMovies )
summary(full_model3)
## 
## Call:
## lm(formula = audience_score ~ critics_score + imdb_num_votes + 
##     genre + best_actor_win + audience_rating, data = SubSetMovies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.9970  -5.8276   0.8967   6.0744  21.5227 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.336e+01  1.418e+00  23.523  < 2e-16 ***
## critics_score                   1.997e-01  1.830e-02  10.910  < 2e-16 ***
## imdb_num_votes                  1.550e-05  4.090e-06   3.789 0.000171 ***
## genreAnimation                  6.546e-01  3.524e+00   0.186 0.852719    
## genreArt House & International  1.700e+00  2.838e+00   0.599 0.549386    
## genreComedy                     3.174e-01  1.671e+00   0.190 0.849440    
## genreDocumentary                6.320e+00  2.137e+00   2.958 0.003252 ** 
## genreDrama                      9.006e-01  1.430e+00   0.630 0.529077    
## genreHorror                    -1.519e+00  2.337e+00  -0.650 0.515974    
## genreMusical & Performing Arts  5.798e+00  3.100e+00   1.870 0.062050 .  
## genreMystery & Suspense        -1.261e+00  1.839e+00  -0.686 0.493173    
## genreOther                     -2.519e-01  2.844e+00  -0.089 0.929470    
## genreScience Fiction & Fantasy  1.294e+00  3.778e+00   0.343 0.732010    
## best_actor_winyes               1.044e+00  1.127e+00   0.926 0.354872    
## audience_ratingUpright          2.726e+01  1.024e+00  26.620  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.692 on 471 degrees of freedom
## Multiple R-squared:  0.8253, Adjusted R-squared:  0.8202 
## F-statistic:   159 on 14 and 471 DF,  p-value: < 2.2e-16

TABLE 9 Summary of full_model3. Table 8 shows that the categorical variable "best_actor _win" can be descated in this research. We can not remove only a level of the categorical variable, we need remove the entire categorical variable.

full_model4<-lm(audience_score ~ critics_score+ imdb_num_votes+genre + audience_rating, data =SubSetMovies )
summary(full_model4)
## 
## Call:
## lm(formula = audience_score ~ critics_score + imdb_num_votes + 
##     genre + audience_rating, data = SubSetMovies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.0750  -5.9142   0.8498   6.0660  21.3457 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.346e+01  1.414e+00  23.659  < 2e-16 ***
## critics_score                   2.000e-01  1.829e-02  10.932  < 2e-16 ***
## imdb_num_votes                  1.587e-05  4.070e-06   3.899 0.000111 ***
## genreAnimation                  6.875e-01  3.523e+00   0.195 0.845379    
## genreArt House & International  1.600e+00  2.835e+00   0.564 0.572913    
## genreComedy                     2.981e-01  1.671e+00   0.178 0.858426    
## genreDocumentary                6.243e+00  2.135e+00   2.925 0.003613 ** 
## genreDrama                      9.767e-01  1.427e+00   0.684 0.494059    
## genreHorror                    -1.638e+00  2.333e+00  -0.702 0.483044    
## genreMusical & Performing Arts  5.797e+00  3.099e+00   1.870 0.062067 .  
## genreMystery & Suspense        -1.064e+00  1.826e+00  -0.582 0.560530    
## genreOther                     -2.410e-01  2.844e+00  -0.085 0.932506    
## genreScience Fiction & Fantasy  1.145e+00  3.774e+00   0.303 0.761649    
## audience_ratingUpright          2.724e+01  1.024e+00  26.609  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.69 on 472 degrees of freedom
## Multiple R-squared:  0.825,  Adjusted R-squared:  0.8202 
## F-statistic: 171.2 on 13 and 472 DF,  p-value: < 2.2e-16

TABLE 10 Summary of full_model4.

At this point we need to make a decision between parsimonious model and R-squared and the elimination to an entire categorical value where one of its levels has a low p-value. ie. Genre, see Table 10. This variable has 11 levels but only 2 of them (genre\(musical&Performing Arts, and genre\)documentary) have low p-value (see Table 10). The idea is the elimination of this entire categorical variable and observe the variability in Adjusted R-squared. If the variation in Adjusted R-squared is insignificant we will decide for parsimonious model idea, but if the variation in Adjusted R-squared is significant we need to include again the variable in the model and decide for Adjusted R-Squared idea.

Elimination of the categorical variable genre from the model

full_model5<-lm(audience_score ~ critics_score+ imdb_num_votes+ audience_rating, data =SubSetMovies )
summary(full_model5)
## 
## Call:
## lm(formula = audience_score ~ critics_score + imdb_num_votes + 
##     audience_rating, data = SubSetMovies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.5720  -6.0496   0.5664   6.1679  25.4470 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.319e+01  8.944e-01  37.112   <2e-16 ***
## critics_score          2.174e-01  1.726e-02  12.601   <2e-16 ***
## imdb_num_votes         1.133e-05  3.873e-06   2.926   0.0036 ** 
## audience_ratingUpright 2.807e+01  9.974e-01  28.143   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.753 on 482 degrees of freedom
## Multiple R-squared:  0.8187, Adjusted R-squared:  0.8176 
## F-statistic: 725.6 on 3 and 482 DF,  p-value: < 2.2e-16

TABLE 11 Summary of full_model5.

TTable 5 show a small change in Adjusted R-squared and R-Square. in this scenario the idea of parsimonious modelswin and the final mode has only three variables, two numerical (critics_score and imbd_num_votes) and one categorical (audience rating).

4.3 Goodness of Fit in thefinal Model

px1<-ggplot(data = full_model5, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")+  ggtitle('(A) Residuals')


px2<-ggplot(data = full_model5, aes(sample = .resid)) +
  stat_qq()+  ggtitle('(B) Sample versus theoretical')

grid.arrange(px1,px2, ncol=2)

Full_residuals = resid(full_model5)
p5a<-(hist(Full_residuals,  main="Residuals distribution for the final model (full_model5)", 
           xlab="Residuals", 
           border="black",breaks=20, xlim=c(-20,20),
           las=1))

*FIGURE 11** Goodness of the fit. Plot A shot quasy homoscedasticity, and Figure C (histogram) show normal distribution of residuals round zero.

The final model is:

Audience_score = 33.2+ 0.22critics_score+0.000011imb_num_votes +28.1*audience_rating:Upright

5. Predictors

We can pick a movie from 2016 (sample test) and do a prediction for this movie using the final model developed and quantify the uncertainty around this prediction using an appropriate interval.

newmovie <- test %>% select(audience_score,critics_score,audience_rating, imdb_num_votes)
MovieSelected<-newmovie[sample(nrow(newmovie), 1), ]
MovieSelected
## # A tibble: 1 x 4
##   audience_score critics_score audience_rating imdb_num_votes
##            <dbl>         <dbl> <fct>                    <int>
## 1             92            82 Upright                 315051
predict(full_model5, MovieSelected)
##        1 
## 82.66101
predict(full_model5, MovieSelected, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 82.66101 65.33573 99.98629

Model predictions are within the 95% Confidence Interval.

6. Conclusion

The final model presented in this document predict audience score within the 95% confidence intervals. The study is observational and we cannot prove causation. The results cannot be generalized for the size of the sample.

Discussion: Those results show interesting points: 1. An author with an Oscar from the academy appears not to have any influence in the audience score, and I assumed the opposite 2. Definetly the long of the movies, runtime, is irrelevant for the audience scoring the movies 3. The critics from the experts, critics_score, appear to have strong influence/correlation with the public when scoring a movie.

Olinto J. Linares-Perdomo