Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(tidyr)
library(GGally)

## Warning: package 'GGally' was built under R version 4.0.3

Load data

load("movies.Rdata")

Part 1: Data

The data set contains a random sample of 651 movies produced and released before 2016. According to IMDb (one of the sources of the information along Rotten Tomatoes), movies data comes from various sources like on-screen credits, press kits, official bios, autobiographies, interviews, among others.

Furthermore, data is submitted by people in the industry and visitors, and also from studios and filmmakers. In addition, data goes through consistency checks to ensure it is as accurate and reliable as possible. Nevertheless, because of the sheer volume and the nature of the information listed, occasional mistakes are inevitable and, when spotted/reported, they are promptly verified and fixed.

Because the sample is less than 10% of the whole population of movies out there, it’s fair to say that the sample is independent. However, it may not be representative because most of the movies collected are American and do not represent all the countries, therefore, it’s not possible to ensure causality between the results found in this project and real life, because there might be mistakes in data collection as stated above.

Source: https://help.imdb.com/article/imdb/general-information/where-does-the-information-on-imdb-come-from/GGD7NGF5X3ECFKNN?ref_=helpart_nav_24#

Part 2: Research question

Some people like a movie depending on its genre, others might like it because of an specific actor/actress, while others might be influenced by “experts” opinion, so I wondered what characteristics make a movie popular? In this case, the response variable will be audience_score which will serve as a measure of popularity.

Part 3: Exploratory data analysis

Remove duplicates

# There was one duplicate: The movie titled "Man on Wire"
movies <- movies[!duplicated(movies),]

Check relationships for categorical variables and response variable

title_type - Documentaries and tv movies tend to obtain higher audience scores

movies %>% ggplot(aes(y=audience_score, color=title_type)) + geom_boxplot()

genre - Again, documentaries tend to have higher scores as well as horror movies

movies %>% group_by(genre) %>% summarise(mean_score=mean(audience_score)) %>%
    ggplot(aes(x=reorder(genre, mean_score), y=mean_score, fill=genre)) +
    geom_bar(stat="identity") + xlab("Genre") + ylab("Mean audience score") + theme(axis.text.x=element_blank(), axis.ticks.x=element_blank())

## `summarise()` ungrouping output (override with `.groups` argument)

mpaa_rating - General audiences movies evidenced greater audience scores than restricted and adults only rated movies

movies %>% group_by(mpaa_rating) %>% summarise(mean_score=mean(audience_score)) %>%
    ggplot(aes(x=reorder(mpaa_rating, -mean_score), y=mean_score, fill=mpaa_rating)) + geom_bar(stat="identity") +
    xlab("MPAA rating") + ylab("Mean audience score")

## `summarise()` ungrouping output (override with `.groups` argument)

critics_rating - As the movie gets a better critic rating (Certified fresh, for example) it will also increase its score

movies %>% group_by(critics_rating) %>% summarise(mean_score=mean(audience_score)) %>%
    ggplot(aes(x=reorder(critics_rating, mean_score), y=mean_score, fill=critics_rating)) +
    geom_bar(stat="identity") + xlab("Critics rating") + ylab("Mean audience score")

## `summarise()` ungrouping output (override with `.groups` argument)

audience_rating - Something similar as the critic rating, however, audience rating and audience score may be dependent because they come from the same source (audience)

movies %>% group_by(audience_rating) %>% summarise(mean_score=mean(audience_score)) %>%
    ggplot(aes(x=audience_rating,y=mean_score)) + geom_bar(stat="identity") + ylab("Mean audience score") + xlab("Audience rating")

## `summarise()` ungrouping output (override with `.groups` argument)

Whether or not the movie won or was nominated for a best picture Oscar enhances its probability to get a higher audience score. It also seems to be important the fact that the movie made part of the Top 200 Box Office list

movies[,c("best_pic_nom","best_pic_win","best_actor_win","best_actress_win","best_dir_win","top200_box", "audience_score")] %>%
   gather(variable,value,-audience_score) %>% group_by(variable,value) %>%
    summarise(mean_score=mean(audience_score)) %>% ggplot(aes(x=reorder(variable, mean_score), y=mean_score, fill=value)) +
    geom_bar(stat="identity", position="dodge") + xlab(element_blank())

## `summarise()` regrouping output by 'variable' (override with `.groups` argument)

Check relationships for numerical variables and response variable (audience_score) with the correlation coefficient

runtime: 0.182 - Low
imdb_rating: 0.865 - High
imdb_num_votes: 0.290 - Low
critics_score: 0.703 - High

ggpairs(movies[,c("audience_score", "runtime", "imdb_rating", "imdb_num_votes", "critics_score")])

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing non-finite values (stat_density).

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removing 1 row that contained a missing value

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

## Warning: Removed 1 rows containing missing values (geom_point).

There is very little correlation between date related variables and audience score

summarise(na.omit(movies), thtr_rel_year=cor(audience_score, thtr_rel_year),
          thtr_rel_month = cor(audience_score, thtr_rel_month),
          thtr_rel_day = cor(audience_score, thtr_rel_day),
          dvd_rel_year = cor(audience_score, dvd_rel_year),
          dvd_rel_month = cor(audience_score, dvd_rel_month),
          dvd_rel_day = cor(audience_score, dvd_rel_day)
          )

## # A tibble: 1 x 6
##   thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month
##           <dbl>          <dbl>        <dbl>        <dbl>         <dbl>
## 1       -0.0766         0.0526       0.0248      -0.0693        0.0638
## # ... with 1 more variable: dvd_rel_day <dbl>

Part 4: Modeling

From EDA, variables that seemed to have higher relationship with audience score were:

title_type
genre
mpaa_rating
critics_rating
top200_box
best_pic_win
best_pic_nom
imdb_rating
critics_score

Which will be included for the initial model. Having said that, the criteria to include variables for the final model will be based using a backwards model selection approach, excluding a variable if its p-value is large compared to the rest of explanatory variables and also checking if the adjusted R squared increases.

# First fit
mlr1 <- with(movies, lm(audience_score ~ title_type + genre + mpaa_rating +
                                critics_rating + top200_box + best_pic_win +
                                best_pic_nom + imdb_rating + critics_score))
summary(mlr1)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + mpaa_rating + 
##     critics_rating + top200_box + best_pic_win + best_pic_nom + 
##     imdb_rating + critics_score)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.724  -6.260   0.243   5.569  48.790 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -31.368346   6.083322  -5.156 3.38e-07 ***
## title_typeFeature Film           1.388917   3.679275   0.377  0.70593    
## title_typeTV Movie               2.381495   5.771791   0.413  0.68003    
## genreAnimation                   8.101270   3.863166   2.097  0.03639 *  
## genreArt House & International  -0.161602   2.983355  -0.054  0.95682    
## genreComedy                      2.330064   1.638844   1.422  0.15559    
## genreDocumentary                 2.668210   3.934725   0.678  0.49795    
## genreDrama                       0.117339   1.430868   0.082  0.93467    
## genreHorror                     -4.725357   2.447565  -1.931  0.05398 .  
## genreMusical & Performing Arts   5.279939   3.363648   1.570  0.11699    
## genreMystery & Suspense         -5.738160   1.837634  -3.123  0.00188 ** 
## genreOther                       0.950807   2.808161   0.339  0.73503    
## genreScience Fiction & Fantasy  -0.921163   3.517561  -0.262  0.79350    
## mpaa_ratingNC-17                -4.579540   7.463121  -0.614  0.53969    
## mpaa_ratingPG                    0.416980   2.709540   0.154  0.87774    
## mpaa_ratingPG-13                -1.479603   2.779837  -0.532  0.59473    
## mpaa_ratingR                    -0.852817   2.697606  -0.316  0.75200    
## mpaa_ratingUnrated               0.163155   3.083782   0.053  0.95782    
## critics_ratingFresh             -2.212394   1.178188  -1.878  0.06087 .  
## critics_ratingRotten            -5.117050   1.936749  -2.642  0.00845 ** 
## top200_boxyes                    0.650208   2.684538   0.242  0.80870    
## best_pic_winyes                 -4.116556   4.283571  -0.961  0.33692    
## best_pic_nomyes                  3.380515   2.523563   1.340  0.18087    
## imdb_rating                     14.810896   0.589491  25.125  < 2e-16 ***
## critics_score                   -0.002231   0.035794  -0.062  0.95032    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.824 on 625 degrees of freedom
## Multiple R-squared:  0.7725, Adjusted R-squared:  0.7638 
## F-statistic: 88.45 on 24 and 625 DF,  p-value: < 2.2e-16

summary(mlr1)$adj.r.squared

## [1] 0.7638077

# Second fit: critics_score removed
mlr2 <- with(movies, lm(audience_score ~ title_type + genre + mpaa_rating +
                                critics_rating + top200_box + best_pic_win +
                                best_pic_nom + imdb_rating))
summary(mlr2)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + mpaa_rating + 
##     critics_rating + top200_box + best_pic_win + best_pic_nom + 
##     imdb_rating)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.741  -6.293   0.250   5.583  48.805 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -31.4461     5.9494  -5.286 1.73e-07 ***
## title_typeFeature Film           1.4108     3.6596   0.385 0.700000    
## title_typeTV Movie               2.3949     5.7632   0.416 0.677876    
## genreAnimation                   8.1050     3.8596   2.100 0.036132 *  
## genreArt House & International  -0.1523     2.9772  -0.051 0.959222    
## genreComedy                      2.3268     1.6367   1.422 0.155628    
## genreDocumentary                 2.6802     3.9269   0.683 0.495168    
## genreDrama                       0.1099     1.4247   0.077 0.938553    
## genreHorror                     -4.7334     2.4422  -1.938 0.053053 .  
## genreMusical & Performing Arts   5.2745     3.3598   1.570 0.116953    
## genreMystery & Suspense         -5.7421     1.8351  -3.129 0.001835 ** 
## genreOther                       0.9468     2.8052   0.338 0.735840    
## genreScience Fiction & Fantasy  -0.9199     3.5147  -0.262 0.793624    
## mpaa_ratingNC-17                -4.5798     7.4572  -0.614 0.539341    
## mpaa_ratingPG                    0.4227     2.7059   0.156 0.875923    
## mpaa_ratingPG-13                -1.4647     2.7673  -0.529 0.596799    
## mpaa_ratingR                    -0.8403     2.6879  -0.313 0.754685    
## mpaa_ratingUnrated               0.1661     3.0810   0.054 0.957022    
## critics_ratingFresh             -2.1993     1.1583  -1.899 0.058057 .  
## critics_ratingRotten            -5.0279     1.3043  -3.855 0.000128 ***
## top200_boxyes                    0.6497     2.6824   0.242 0.808700    
## best_pic_winyes                 -4.1154     4.2801  -0.962 0.336663    
## best_pic_nomyes                  3.3778     2.5212   1.340 0.180808    
## imdb_rating                     14.7918     0.5029  29.411  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.817 on 626 degrees of freedom
## Multiple R-squared:  0.7725, Adjusted R-squared:  0.7642 
## F-statistic: 92.44 on 23 and 626 DF,  p-value: < 2.2e-16

summary(mlr2)$adj.r.squared

## [1] 0.7641835

# Third fit: top200_box removed
mlr3 <- with(movies, lm(audience_score ~ title_type + genre + mpaa_rating +
                                critics_rating + best_pic_win + imdb_rating +
                                best_pic_nom))
summary(mlr3)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + mpaa_rating + 
##     critics_rating + best_pic_win + imdb_rating + best_pic_nom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.758  -6.313   0.279   5.574  48.831 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -31.36009    5.93437  -5.284 1.74e-07 ***
## title_typeFeature Film           1.42618    3.65632   0.390  0.69663    
## title_typeTV Movie               2.41402    5.75831   0.419  0.67520    
## genreAnimation                   8.00740    3.83563   2.088  0.03723 *  
## genreArt House & International  -0.19545    2.96967  -0.066  0.94755    
## genreComedy                      2.29121    1.62887   1.407  0.16003    
## genreDocumentary                 2.63328    3.91920   0.672  0.50190    
## genreDrama                       0.07322    1.41559   0.052  0.95876    
## genreHorror                     -4.76416    2.43708  -1.955  0.05104 .  
## genreMusical & Performing Arts   5.22266    3.35049   1.559  0.11955    
## genreMystery & Suspense         -5.77171    1.82964  -3.155  0.00168 ** 
## genreOther                       0.93174    2.80239   0.332  0.73964    
## genreScience Fiction & Fantasy  -0.90054    3.51116  -0.256  0.79766    
## mpaa_ratingNC-17                -4.66810    7.44267  -0.627  0.53075    
## mpaa_ratingPG                    0.37642    2.69708   0.140  0.88905    
## mpaa_ratingPG-13                -1.52220    2.75502  -0.553  0.58079    
## mpaa_ratingR                    -0.91501    2.66815  -0.343  0.73176    
## mpaa_ratingUnrated               0.09695    3.06541   0.032  0.97478    
## critics_ratingFresh             -2.23788    1.14638  -1.952  0.05137 .  
## critics_ratingRotten            -5.06711    1.29323  -3.918 9.90e-05 ***
## best_pic_winyes                 -4.07872    4.27423  -0.954  0.34032    
## imdb_rating                     14.79817    0.50186  29.487  < 2e-16 ***
## best_pic_nomyes                  3.38224    2.51921   1.343  0.17989    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.809 on 627 degrees of freedom
## Multiple R-squared:  0.7725, Adjusted R-squared:  0.7645 
## F-statistic: 96.79 on 22 and 627 DF,  p-value: < 2.2e-16

summary(mlr3)$adj.r.squared

## [1] 0.7645376

# Fourth fit: title_type removed
mlr4 <- with(movies, lm(audience_score ~  genre + critics_rating + mpaa_rating +
                                  best_pic_win + imdb_rating + best_pic_nom
                                ))
summary(mlr4)

## 
## Call:
## lm(formula = audience_score ~ genre + critics_rating + mpaa_rating + 
##     best_pic_win + imdb_rating + best_pic_nom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.780  -6.408   0.228   5.542  48.815 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -29.86562    4.61391  -6.473 1.94e-10 ***
## genreAnimation                   8.02051    3.82988   2.094 0.036642 *  
## genreArt House & International  -0.17682    2.96279  -0.060 0.952428    
## genreComedy                      2.25268    1.62374   1.387 0.165830    
## genreDocumentary                 1.34877    2.24230   0.602 0.547718    
## genreDrama                       0.09801    1.41210   0.069 0.944688    
## genreHorror                     -4.77239    2.43290  -1.962 0.050249 .  
## genreMusical & Performing Arts   4.77925    3.15193   1.516 0.129947    
## genreMystery & Suspense         -5.76864    1.82694  -3.158 0.001667 ** 
## genreOther                       1.01108    2.78420   0.363 0.716616    
## genreScience Fiction & Fantasy  -0.90040    3.50606  -0.257 0.797407    
## critics_ratingFresh             -2.22557    1.14344  -1.946 0.052053 .  
## critics_ratingRotten            -5.04441    1.29037  -3.909 0.000103 ***
## mpaa_ratingNC-17                -4.65829    7.43126  -0.627 0.530985    
## mpaa_ratingPG                    0.37838    2.69230   0.141 0.888278    
## mpaa_ratingPG-13                -1.51165    2.74966  -0.550 0.582680    
## mpaa_ratingR                    -0.89425    2.66366  -0.336 0.737194    
## mpaa_ratingUnrated               0.07283    3.04342   0.024 0.980916    
## best_pic_winyes                 -4.05259    4.26764  -0.950 0.342676    
## imdb_rating                     14.78189    0.49967  29.583  < 2e-16 ***
## best_pic_nomyes                  3.39110    2.51526   1.348 0.178076    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.795 on 629 degrees of freedom
## Multiple R-squared:  0.7724, Adjusted R-squared:  0.7652 
## F-statistic: 106.8 on 20 and 629 DF,  p-value: < 2.2e-16

summary(mlr4)$adj.r.squared

## [1] 0.7652102

# Fifth fit: mpaa_rating removed
mlr5 <- with(movies, lm(audience_score ~  genre + critics_rating +
                                + imdb_rating + best_pic_win + best_pic_nom))
summary(mlr5)

## 
## Call:
## lm(formula = audience_score ~ genre + critics_rating + +imdb_rating + 
##     best_pic_win + best_pic_nom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.510  -6.113   0.299   5.639  48.791 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -30.6559     3.8020  -8.063 3.72e-15 ***
## genreAnimation                   8.7316     3.4885   2.503   0.0126 *  
## genreArt House & International  -0.2080     2.8943  -0.072   0.9427    
## genreComedy                      2.0479     1.6085   1.273   0.2034    
## genreDocumentary                 1.7144     1.9759   0.868   0.3859    
## genreDrama                      -0.2010     1.3750  -0.146   0.8838    
## genreHorror                     -4.8999     2.3762  -2.062   0.0396 *  
## genreMusical & Performing Arts   4.8181     3.1243   1.542   0.1235    
## genreMystery & Suspense         -6.1389     1.7800  -3.449   0.0006 ***
## genreOther                       1.1866     2.7666   0.429   0.6681    
## genreScience Fiction & Fantasy  -0.7791     3.4963  -0.223   0.8237    
## critics_ratingFresh             -2.0872     1.1371  -1.836   0.0669 .  
## critics_ratingRotten            -5.0269     1.2797  -3.928 9.50e-05 ***
## imdb_rating                     14.8124     0.4970  29.805  < 2e-16 ***
## best_pic_winyes                 -3.7715     4.2558  -0.886   0.3758    
## best_pic_nomyes                  3.3267     2.5015   1.330   0.1840    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.778 on 634 degrees of freedom
## Multiple R-squared:  0.7714, Adjusted R-squared:  0.766 
## F-statistic: 142.7 on 15 and 634 DF,  p-value: < 2.2e-16

summary(mlr5)$adj.r.squared

## [1] 0.7660312

# Sixth fit: best_pic_win removed
mlr6 <- with(movies, lm(audience_score ~ genre + critics_rating +
                                + imdb_rating + best_pic_nom))
summary(mlr6)

## 
## Call:
## lm(formula = audience_score ~ genre + critics_rating + +imdb_rating + 
##     best_pic_nom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.522  -6.126   0.300   5.750  48.796 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -30.6640     3.8013  -8.067 3.61e-15 ***
## genreAnimation                   8.7465     3.4878   2.508 0.012400 *  
## genreArt House & International  -0.1976     2.8938  -0.068 0.945582    
## genreComedy                      2.0126     1.6077   1.252 0.211084    
## genreDocumentary                 1.7364     1.9754   0.879 0.379718    
## genreDrama                      -0.1961     1.3748  -0.143 0.886605    
## genreHorror                     -4.9058     2.3758  -2.065 0.039336 *  
## genreMusical & Performing Arts   4.8417     3.1237   1.550 0.121638    
## genreMystery & Suspense         -6.1692     1.7793  -3.467 0.000562 ***
## genreOther                       1.3194     2.7621   0.478 0.633037    
## genreScience Fiction & Fantasy  -0.7772     3.4957  -0.222 0.824133    
## critics_ratingFresh             -1.9987     1.1325  -1.765 0.078071 .  
## critics_ratingRotten            -4.9545     1.2768  -3.880 0.000115 ***
## imdb_rating                     14.8028     0.4968  29.798  < 2e-16 ***
## best_pic_nomyes                  2.3501     2.2453   1.047 0.295641    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.776 on 635 degrees of freedom
## Multiple R-squared:  0.7712, Adjusted R-squared:  0.7661 
## F-statistic: 152.8 on 14 and 635 DF,  p-value: < 2.2e-16

summary(mlr6)$adj.r.squared

## [1] 0.7661102

# Seventh fit: best_pic_nom removed
mlr7 <- with(movies, lm(audience_score ~ genre + critics_rating + imdb_rating))
summary(mlr7)

## 
## Call:
## lm(formula = audience_score ~ genre + critics_rating + imdb_rating)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.561  -6.054   0.445   5.683  48.943 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -30.9065     3.7946  -8.145 2.01e-15 ***
## genreAnimation                   8.7103     3.4879   2.497 0.012767 *  
## genreArt House & International  -0.2558     2.8934  -0.088 0.929571    
## genreComedy                      2.0575     1.6072   1.280 0.200964    
## genreDocumentary                 1.5969     1.9710   0.810 0.418143    
## genreDrama                      -0.1319     1.3735  -0.096 0.923538    
## genreHorror                     -4.8821     2.3758  -2.055 0.040298 *  
## genreMusical & Performing Arts   4.7165     3.1216   1.511 0.131306    
## genreMystery & Suspense         -6.1172     1.7788  -3.439 0.000622 ***
## genreOther                       1.5510     2.7534   0.563 0.573422    
## genreScience Fiction & Fantasy  -0.7760     3.4960  -0.222 0.824414    
## critics_ratingFresh             -2.2182     1.1130  -1.993 0.046682 *  
## critics_ratingRotten            -5.1468     1.2636  -4.073 5.23e-05 ***
## imdb_rating                     14.8723     0.4924  30.207  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.777 on 636 degrees of freedom
## Multiple R-squared:  0.7708, Adjusted R-squared:  0.7661 
## F-statistic: 164.5 on 13 and 636 DF,  p-value: < 2.2e-16

summary(mlr7)$adj.r.squared

## [1] 0.7660751

By removing best_pic_nom the adjusted R squared decreased, so the sixth model (mlr6) is chosen. Final adjusted R squared: 0.7661102.

The final model explains 77% of the variability in audience score based on the R squared. It’s worth noting some slopes/coefficients of the the model:

An increase of 1 unit in the imdb rating may result in an increase of 14.8 for the audience score on average
On average, movies with genre “Mystery & Suspense” obtain a lower audience score by 6.17 points
If the movie was nominated for the best picture Oscar we would expect its audience score to be higher by 2.35 points on average
The movie audience score may decrease 4.95 points on average if it gets a “Rotten” critic rating

Diagnostics

Linearity: Explanatory variables and response variable follow a linear relationship.

ggplot(data=movies,aes(x=imdb_rating, y=audience_score)) + geom_point() + geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

Nearly normal residuals centered at 0: Residuals seem to be fairly centered at zero

mlr6 %>% ggplot(aes(x=.resid)) + geom_histogram(binwidth = 5)

Constant variability (predicted values vs residuals): Residuals locate constantly between 25 and -25, however, points do not behave in a completely random manner

mlr6 %>% ggplot(aes(x=.fitted,y=.resid)) + geom_point() +
    geom_hline(yintercept = 0, linetype="dashed")

Independent explanatory variables

Model variables are: genre, critics_rating, imdb_rating and best_pic_nom

imdb_rating seems to have a positive relationship with critics_rating

ggplot(movies, aes(x=imdb_rating, fill=critics_rating)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 5: Prediction

Movie: Doctor Strange
genre: Adventure, Action, Fantasy or “Action & Adventure”
critics_rating: Fresh
imdb_rating: 7.5
best_pic_nom: “no”
Real audience score: 86

Sources: https://www.rottentomatoes.com/m/doctor_strange_2016 https://en.wikipedia.org/wiki/Doctor_Strange_(2016_film)#Accolades https://www.imdb.com/title/tt1211837/

doc_strange <- data.frame(genre="Action & Adventure", critics_rating="Fresh", imdb_rating=7.5, best_pic_nom="no")

The model predicted an audience score of 78.36

predict(mlr6, doc_strange)

##        1 
## 78.35826

Prediction interval

predict(mlr6, doc_strange, interval = "prediction", level=0.95)

##        fit     lwr      upr
## 1 78.35826 58.9358 97.78072

There is a 95% confidence that the movie Doctor Strange, which genre belongs to “Action & Adventure”, with a “Fresh” critic rating, an imdb_rating of 7.5 and that wasn’t nominated for a best picture Oscar price will have an audience score between 58.94 and 97.78

Part 6: Conclusion

With these results I realized that some of the most significant significant attributes that would make a movie popular are its imdb rating, whether the movie was nominated for a best picture Oscar or not, its critics rating on the Rotten Tomatoes website and its genre. But I think that there are several limitations of the study done here because of the collection of data, as mentioned in Part 1, IMDB and Rotten Tomatoes collect data from many visitors on their website, so their opinions can be subjective or biased which affects the randomness of the sample. For example, a movie might get a “Certified Fresh” rating when it might not deserve it.

Similarly, these results are not generalizable because I could have dropped a significant variable that wasn’t in the model. For future research I would like to test if the total box office depends on the rating from websites like IMDB.