Setup

Part 1: Data

Movies data set includes information from Rotten Tomatoes and IMDB for a random sample of movies.

Sample collection methodology: Rotten Tomatoes and IMDB are review-aggregation websites that include information related to films and television programs. The rating for a particular film or television program is the result of critics and audiences reviews.

Data collection method implications: Since the sample was optained from a randomly sample selected movies in Rotten Tomatoes and IMDB, we can reneralized the results to the entire film population from 1970 to 2016 included either in Rotten Tomatoes and IMDB. In addition, it is important to point out that audience reviews comes from volunteers subjects; as volunteers are more likely to participate in the reviews, it can be a source of bias.

Scope of inference: Each film is equally likely to be selected, therefore we are dealing with a large-scale obervational study, the sample is representative of the population. As the groups are not escencially the same (due there is no random assignment), causal conclutions cannot be made.

In short, we have an observational study: not-causal-generalizable.


Part 2: Research question

Research question: We are interested in learning what attributes make a movie popular. For this time we only consider IMDB ratings. We will take the following variables from movies dataset:


Part 3: Exploratory data analysis

Most of the variables are not numerical, therefore only a scatterplot would not be ideal for vizualizing any patterns in the data. Firstly, we would be interested in find out whether a particular film genre is more likely to have higher or lower scores than others. If so genre would be representative in order to estimate IMDB rating.

Comments

Secondly, runtime of movie (in minutes) could have to do with a higher or lower scores, and also could be more likely to win an Oscar. As both IMDB and runtime are numerical variables, let use a simple scatterplot and add a smoth linear regression. We could use facet_wrap to create a plot of each decade separetely.

Note: In order to facilitate the data manipulation We will consider in 2000's category all the movies from this year to the present.

## `geom_smooth()` using formula 'y ~ x'

Comments

Now let’s check the relationship between the number of votes and the IMDB Rating. We will use a simple scatterplot.

## `geom_smooth()` using formula 'y ~ x'

Comments


Part 4: Modeling

Now we can start modeling and confirm what we already figured out in the EDA. For this we will use the backward selection approach, by starting with a full model and then systematically drop terms according to significance of the changes in fit quality.

## 
## Call:
## lm(formula = imdb_rating ~ imdb_num_votes + runtime + genre + 
##     best_pic_win + decade, data = best)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0173 -0.4088  0.0597  0.5371  2.3980 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.372e+00  2.499e-01  21.496  < 2e-16 ***
## imdb_num_votes                  3.624e-06  3.472e-07  10.438  < 2e-16 ***
## runtime                         5.540e-03  1.983e-03   2.794 0.005368 ** 
## genreAnimation                  1.385e-01  3.060e-01   0.453 0.650855    
## genreArt House & International  9.195e-01  2.528e-01   3.637 0.000299 ***
## genreComedy                    -1.292e-02  1.416e-01  -0.091 0.927339    
## genreDocumentary                2.050e+00  1.637e-01  12.524  < 2e-16 ***
## genreDrama                      7.552e-01  1.184e-01   6.378 3.47e-10 ***
## genreHorror                     1.973e-02  2.086e-01   0.095 0.924672    
## genreMusical & Performing Arts  1.469e+00  2.701e-01   5.440 7.62e-08 ***
## genreMystery & Suspense         5.133e-01  1.547e-01   3.319 0.000957 ***
## genreOther                      4.008e-01  2.403e-01   1.668 0.095860 .  
## genreScience Fiction & Fantasy -2.710e-01  3.048e-01  -0.889 0.374371    
## best_pic_winyes                -9.971e-02  3.477e-01  -0.287 0.774373    
## decade80's                     -4.872e-02  1.386e-01  -0.352 0.725307    
## decade90's                     -3.901e-01  1.289e-01  -3.026 0.002576 ** 
## decade2000's                   -3.718e-01  1.225e-01  -3.034 0.002510 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8537 on 633 degrees of freedom
## Multiple R-squared:  0.3961, Adjusted R-squared:  0.3808 
## F-statistic: 25.95 on 16 and 633 DF,  p-value: < 2.2e-16

With this approach we want to remove all the terms that do not result in a statistically significant detriment of the goodness of fit. In this case we will use drop1 function and then we will choose the term to remove from the model that has the leas significant effect of reducinf the goodness of fit. In other words, the term with the largest p-value for its F-test.

## Single term deletions
## 
## Model:
## imdb_rating ~ imdb_num_votes + runtime + genre + best_pic_win + 
##     decade
##                Df Sum of Sq    RSS      AIC  F value    Pr(>F)    
## <none>                      461.29 -188.919                       
## imdb_num_votes  1    79.396 540.68  -87.690 108.9515 < 2.2e-16 ***
## runtime         1     5.688 466.97 -182.953   7.8049 0.0053680 ** 
## genre          10   188.205 649.49   13.492  25.8264 < 2.2e-16 ***
## best_pic_win    1     0.060 461.35 -190.834   0.0822 0.7743726    
## decade          3    13.912 475.20 -175.605   6.3638 0.0002973 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that removing the variable best_pic_win, regarding whether or not is an Oscar Winning Film, has the single least significant effect on reducing the goodness of fit, so we will star with update function by removing this term from best.fit. The new model will be named best.fit1. Then we will apply drop1 function one more time and repeat the process.

## Single term deletions
## 
## Model:
## imdb_rating ~ imdb_num_votes + runtime + genre + decade
##                Df Sum of Sq    RSS      AIC  F value    Pr(>F)    
## <none>                      461.35 -190.834                       
## imdb_num_votes  1    85.362 546.71  -82.487 117.3074 < 2.2e-16 ***
## runtime         1     5.629 466.98 -184.951   7.7357 0.0055750 ** 
## genre          10   188.227 649.57   11.574  25.8668 < 2.2e-16 ***
## decade          3    13.863 475.21 -177.590   6.3504 0.0003028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finally we will remove runtime which, even though it has somewhat small p-value, we want to follow the principle of parsimony, which establishes that the simpler model is better. Also, the addition of runtime could bring nothing new to the model. The final model is shown below as bestfit2 We will see that some film genres do have large p-values, as we see it in the Kernel Estimates plot at the beginning od EDA. It is important to recall that we cannot remove a particular film genre from the model, although some genres does not affect the rating, we must treat the variable as a whole. Something similar happens with decade variable.

## 
## Call:
## lm(formula = imdb_rating ~ imdb_num_votes + genre + decade, data = best)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0368 -0.4235  0.0674  0.5717  2.4467 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.936e+00  1.477e-01  40.187  < 2e-16 ***
## imdb_num_votes                  3.924e-06  3.118e-07  12.587  < 2e-16 ***
## genreAnimation                  6.037e-02  3.061e-01   0.197 0.843696    
## genreArt House & International  9.342e-01  2.538e-01   3.681 0.000252 ***
## genreComedy                    -4.029e-02  1.417e-01  -0.284 0.776208    
## genreDocumentary                2.041e+00  1.641e-01  12.441  < 2e-16 ***
## genreDrama                      7.973e-01  1.177e-01   6.774 2.85e-11 ***
## genreHorror                    -3.008e-02  2.088e-01  -0.144 0.885479    
## genreMusical & Performing Arts  1.540e+00  2.700e-01   5.704 1.79e-08 ***
## genreMystery & Suspense         5.473e-01  1.548e-01   3.536 0.000435 ***
## genreOther                      4.225e-01  2.411e-01   1.752 0.080239 .  
## genreScience Fiction & Fantasy -2.917e-01  3.061e-01  -0.953 0.340937    
## decade80's                     -3.766e-02  1.389e-01  -0.271 0.786294    
## decade90's                     -3.963e-01  1.289e-01  -3.073 0.002207 ** 
## decade2000's                   -4.007e-01  1.217e-01  -3.293 0.001046 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8576 on 635 degrees of freedom
## Multiple R-squared:  0.3886, Adjusted R-squared:  0.3752 
## F-statistic: 28.83 on 14 and 635 DF,  p-value: < 2.2e-16

Check Conditions


Part 5: Prediction

As we could see, the model does not fit the data well, but we will use it anyway and pick a recent movie and do a prediction for its IMDB rating using the model. Also we will quantify the uncertainty around this prediction using an appropriate interval. We will try to predict IMDB Rating of It Chapter:2.

##        fit      lwr      upr
## 1 5.515917 4.071357 6.960477

The real IMDB Rating is 6,6 which is include in the prediction interval, however not in all the film cases the interval will include the real rating since the model does not fit the data aqqurately. Moreover I used a very bad and predictable movie. If we use the model again but this time with another movie like The Shape of Water we see that the real IMDB rating is 7.3 which is not include in the prediction interval.

##        fit      lwr      upr
## 1 5.254198 3.761581 6.746816

Part 6: Conclusion

Here the overall goal of fitting any statistical model is to faithfully represent the data and the relationship between the variables. In general we are interested in figure out the balancing between goodness-of-fit (good reponse-predictor relationship) and complexity (how complicated the model is).

With this assignment we could confirm that selection of the appropriate variables can culminate in a suitable model which sadly was not the case this time. Even though satisfactory conclusions are not reached we have concluded that the variables used in the model, as a whole, are not significant to predict something as subjective as the rating of a film.