Modeling and Prediction for Movies

Setup

Load packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

library(statsr)

## Warning: package 'statsr' was built under R version 4.2.3

## Warning: package 'BayesFactor' was built under R version 4.2.3

## Warning: package 'coda' was built under R version 4.2.3

## Warning: package 'Matrix' was built under R version 4.2.3

library(rstatix)

## Warning: package 'rstatix' was built under R version 4.2.3

Load data

load("movies.Rdata")

Part 1: Data

The dataset is comprised of information from Rotten Tomatoes and IMBD about 651 movies produced and released before 2016. Aside from the size of this dataset, this data is generalizable due to the data-collection methods of Rotten Tomatoes and IMDB. Rotten Tomatoes employs a team of curators to read through thousands of movie and TV reviews every week and have a specific algorithm for accurately representing the data. IMDB pulls industry information through on-screen credits, press kits, interviews, and so forth while also allowing users to upload information. Their data consistently undergoes consistency checks to ensure accuracy and reliability.

Inherently, bias is possible within the analysis of data. Selection bias may occur if any genres or ratings are underrepresented in the sample (for example, there are 9 Animation movies and 305 Drama movies). Confirmation bias may be present if the reviewers only selected or positively rated movies of interest to them. Due to the user-included information of IMDB, some selections may be subject to response bias if users submitted incorrect information.

This dataset is comprehensive, randomly selected, is gathered from reliable sources, and compromises multiple decades and genres, so it is generalizable for the purposes of this project. Causality cannot be inferred since this data is observational and no random assignment was used for this data . This project is strictly for proof of concept.

Part 2: Research question

To predict what makes a movie popular, audience_score and imdb_rating (Rotten Tomatoes and IMDB, respectively) need to be considered. What variables are positively associated with audience_score?

Part 3: Exploratory Data Analysis

The main variables of interest are the response variables audience_score and imdb_rating. In addition to these, explore the relationships among these and the explanatory variables title_type, genre, runtime, thtr_rel_year (theatrical release year), and dvd_rel_year (DVD release year). These may be associated with the popularity of a given movie.

First, get a look at the structure of the dataset (651 observations (rows) over 32 variables (columns)); it is comprised of character, factor, numeric, and integer data types. A brief survey of which columns contain missing values (NAs) is helpful. The majority of missing values are found in the director, and actor1 through actor5 variables. The actor1 through actor5 are contain the names of the first, second, third, fourth, and fifth principal actors in a film. The specificity of these variables and the lack of principal actors in a film may explain these missing values. Most variables do not contain missing values, however studio, dvd_rel_year, dvd_rel_month, dvd_rel_day all contain 8 missing values. This merits a closer look.

str(movies)

## tibble [651 × 32] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

colSums(is.na(movies))

##            title       title_type            genre          runtime 
##                0                0                0                1 
##      mpaa_rating           studio    thtr_rel_year   thtr_rel_month 
##                0                8                0                0 
##     thtr_rel_day     dvd_rel_year    dvd_rel_month      dvd_rel_day 
##                0                8                8                8 
##      imdb_rating   imdb_num_votes   critics_rating    critics_score 
##                0                0                0                0 
##  audience_rating   audience_score     best_pic_nom     best_pic_win 
##                0                0                0                0 
##   best_actor_win best_actress_win     best_dir_win       top200_box 
##                0                0                0                0 
##         director           actor1           actor2           actor3 
##                2                2                7                9 
##           actor4           actor5         imdb_url           rt_url 
##               13               15                0                0

A common mistake would be assuming that there are only eight movies with missing data since there are several variables with 8 missing observations. That is why accurate filtering is important since there are 16 observations in total. Notice that if_all allows filtering over multiple rows with a given function by searching for a string of characters in the variable names. if_all was chosen rather than if_any because if_all is a stricter condition for filtering, requiring that (as seen above in the counts of missing values per variable) all variables with “dvd” in their name must have missing values rather than at least one of them.

missing_subset <- movies %>%
    filter(
        is.na(studio) |
            if_all(contains("dvd"), is.na)
        )

missing_subset

## # A tibble: 16 × 32
##    title               title_type genre runtime mpaa_rating studio thtr_rel_year
##    <chr>               <fct>      <fct>   <dbl> <fct>       <fct>          <dbl>
##  1 Dirty Sanchez: The… Documenta… Come…      94 R           <NA>            2007
##  2 Charlie: The Life … Documenta… Docu…     132 Unrated     Warne…          2004
##  3 Streets of Gold     Feature F… Drama      95 R           Live …          1986
##  4 Caveman             Feature F… Acti…      91 PG          <NA>            1981
##  5 Attack of the 50 F… TV Movie   Other      90 R           <NA>            1993
##  6 The Squeeze         Feature F… Acti…     101 PG-13       HBO V…          1987
##  7 Oliver & Company    Feature F… Anim…      74 G           <NA>            1988
##  8 The Man Who Sued G… Feature F… Drama      97 R           <NA>            2001
##  9 Electric Dreams     Feature F… Drama      95 PG          MGM             1984
## 10 Porky's Revenge     Feature F… Art …      92 R           20th …          1985
## 11 Teen Wolf Too       Feature F… Scie…      95 PG          Param…          1987
## 12 Inserts             Feature F… Drama     117 NC-17       <NA>            1975
## 13 The Last Remake of… Feature F… Acti…      85 PG          MCA U…          1977
## 14 Let It Be           Documenta… Docu…      81 G           Unite…          1970
## 15 Inbred              Feature F… Art …      90 R           <NA>            2011
## 16 Death Line (Raw Me… Feature F… Horr…      87 R           <NA>            1972
## # ℹ 25 more variables: thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## #   dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>,
## #   imdb_rating <dbl>, imdb_num_votes <int>, critics_rating <fct>,
## #   critics_score <dbl>, audience_rating <fct>, audience_score <dbl>,
## #   best_pic_nom <fct>, best_pic_win <fct>, best_actor_win <fct>,
## #   best_actress_win <fct>, best_dir_win <fct>, top200_box <fct>,
## #   director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>, actor4 <chr>, …

Out of these observations, the majority were feature films either of the Drama, Action & Adventure, Art House & International, or Documentary genres. runtime spanned 74 132 minutes with an average runtime of 94.75 minutes and a median runtime of 93 minutes. audience_score (0 to 100) spanned 17 to 90 with an average score of 52.67 and a median score of 55.5. imdb_rating (0 to 10) spanned 3.1 to 8 with an average score of 5.83 and a median score of 6. imdb_num_votes (0 with in theory no maximum number of votes possible) spanned 486 to 30,886 with a mean number of votes of 5063.25 votes and a median number of votes of 3656 votes.

missing_subset %>% count(title_type) %>% rename(`Title Type` = title_type, Total = n)

## # A tibble: 3 × 2
##   `Title Type` Total
##   <fct>        <int>
## 1 Documentary      3
## 2 Feature Film    12
## 3 TV Movie         1

missing_subset %>% count(genre) %>% rename(Genre = genre, Total = n)

## # A tibble: 9 × 2
##   Genre                     Total
##   <fct>                     <int>
## 1 Action & Adventure            3
## 2 Animation                     1
## 3 Art House & International     2
## 4 Comedy                        1
## 5 Documentary                   2
## 6 Drama                         4
## 7 Horror                        1
## 8 Other                         1
## 9 Science Fiction & Fantasy     1

missing_subset %>%
  select(runtime, audience_score, imdb_rating, imdb_num_votes) %>%
  get_summary_stats()

## # A tibble: 4 × 13
##   variable       n   min   max median     q1     q3    iqr     mad   mean     sd
##   <fct>      <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
## 1 runtime       16  74     132   93   8.93e1   95.5 6.25e0    5.19 9.47e1 1.36e1
## 2 audience_…    16  17      90   55.5 3.3 e1   66.8 3.38e1   26.7  5.27e1 2.24e1
## 3 imdb_rati…    16   3.1     8    6   5.12e0    6.5 1.38e0    0.89 5.82e0 1.32e0
## 4 imdb_num_…    16 486   30886 3656   1.55e3 5253.  3.71e3 2879.   5.06e3 7.22e3
## # ℹ 2 more variables: se <dbl>, ci <dbl>

Now consider the variables audience_score, imdb_rating, imdb_num_votes. title_type, genre, runtime, thtr_rel_year (theatrical release year), and dvd_rel_year (DVD release year) for the entire dataset. Out of these variables, title_type, genre, thtr_rel_year, and dvd_rel_year are categorical variables, and audience_score, imdb_rating, imdb_num_votes, and runtime are numerical variables.

First, consider the categorical variables. Bar plots and frequency tables allow the exploration and visualization of categorical data. Among the full dataset, Feature Film was the most popular title type and nearly half of the movies were Dramas. Comedy and Action & Adventure were the second- and third-most popular genres. thr_rel_year (spanned 1970-2014) had releases primarily concentrated in 2000 onward. From dvd_rel_year(spanned 1991-2015), the early 2000s saw a majority of DVD releases. It appears to be more likely that movies had a different Theatrical and DVD release year.

movies %>% count(title_type) %>% arrange(-n) %>% rename(`Title Type` = title_type, Total = n)

## # A tibble: 3 × 2
##   `Title Type` Total
##   <fct>        <int>
## 1 Feature Film   591
## 2 Documentary     55
## 3 TV Movie         5

movies %>% count(genre) %>% arrange(-n) %>% rename(Genre = genre, Total = n)

## # A tibble: 11 × 2
##    Genre                     Total
##    <fct>                     <int>
##  1 Drama                       305
##  2 Comedy                       87
##  3 Action & Adventure           65
##  4 Mystery & Suspense           59
##  5 Documentary                  52
##  6 Horror                       23
##  7 Other                        16
##  8 Art House & International    14
##  9 Musical & Performing Arts    12
## 10 Animation                     9
## 11 Science Fiction & Fantasy     9

movies %>% count(thtr_rel_year) %>% arrange(-n) %>% rename(`Theatrical Release Year` = thtr_rel_year, Total = n)

## # A tibble: 44 × 2
##    `Theatrical Release Year` Total
##                        <dbl> <int>
##  1                      2006    33
##  2                      2007    33
##  3                      2004    28
##  4                      1996    27
##  5                      2012    27
##  6                      2011    26
##  7                      2003    25
##  8                      2002    23
##  9                      1993    22
## 10                      2008    22
## # ℹ 34 more rows

movies %>% count(dvd_rel_year) %>% arrange(-n) %>% rename(`DVD Release Year` = dvd_rel_year, Total = n)

## # A tibble: 23 × 2
##    `DVD Release Year` Total
##                 <dbl> <int>
##  1               2001    73
##  2               2002    58
##  3               2004    57
##  4               2003    51
##  5               2000    44
##  6               2005    42
##  7               2007    41
##  8               1999    37
##  9               1998    32
## 10               2013    31
## # ℹ 13 more rows

ggplot(movies) +
  geom_bar(aes(x = thtr_rel_year)) +
  scale_x_continuous(breaks = seq(1970, 2015, 5)) +
  scale_y_continuous(breaks = seq(0, 35, 5)) +
  labs(
    title = "Movies by Theatrical Release Year",
    x = "Year",
    y = "Total"
  ) +
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

ggplot(movies) +
  geom_bar(aes(x = dvd_rel_year)) +
  scale_x_continuous(breaks = seq(1990,2015, 5)) +
  scale_y_continuous(breaks = seq(0, 75, 5)) +
  labs(
    title = "Movies by DVD Release Year",
    x = "Year",
    y = "Total"
  ) +
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

## Warning: Removed 8 rows containing non-finite outside the scale range
## (`stat_count()`).

movies %>% count(thtr_rel_year == dvd_rel_year) %>% 
  rename(`Same Release Year - Theatrical and DVD` = `thtr_rel_year == dvd_rel_year`, Total = n)

## # A tibble: 3 × 2
##   `Same Release Year - Theatrical and DVD` Total
##   <lgl>                                    <int>
## 1 FALSE                                      466
## 2 TRUE                                       177
## 3 NA                                           8

Now to consider the numerical variables audience_score, imdb_rating, imdb_num_votes, and runtime. First, consider their summary statistics to get a better idea of the spread and measure of the data. audience_score spanned from 11 to 97, with an average score of 62.36 and a median score of 65. Slight left skewness is expected for audience_score since the mean is less than the median. imdb_rating spanned from 1.9 to 9 with an average rating of 6.49 and a median rating of 6.6. Slight left skewness is expected for imdb_rating. imdb_num_votes spanned from 180 votes to 893008 votes with a average number of votes of 57532.98 and a median number of votes of 15116. Right skewness is expected since the mean is greater than the median. Notice that runtime is missing one observation. runtime spanned from 39 minutes to 267 minutes with an average runtime of 105.82 minutes and a median runtime of 103 minutes. Almost no screw is expected since the mean and median values are near-identical.

movies %>%
  select(audience_score, imdb_rating, imdb_num_votes, runtime) %>%
  get_summary_stats()

## # A tibble: 4 × 13
##   variable       n   min    max median     q1     q3    iqr    mad   mean     sd
##   <fct>      <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 audience_…   651  11       97 6.5 e1   46   8   e1 3.4 e1 2.37e1 6.24e1 2.02e1
## 2 imdb_rati…   651   1.9      9 6.6 e0    5.9 7.3 e0 1.4 e0 1.04e0 6.49e0 1.08e0
## 3 imdb_num_…   651 180   893008 1.51e4 4546.  5.83e4 5.38e4 1.90e4 5.75e4 1.12e5
## 4 runtime      650  39      267 1.03e2   92   1.16e2 2.37e1 1.70e1 1.06e2 1.94e1
## # ℹ 2 more variables: se <dbl>, ci <dbl>

Take a closer look at how title type affects audience_score. This can be viewed with a boxplot to capture the range of data and information about its quartiles. It’s no surprise that a feature film would most likely have the highest number of audience_scores. Documentarie and tv-release films had significantly fewer observations, which may explain the difference in ranges. Feature Film had the largest spread in data, Documentary had the highest median, and TV Movie had the largest interquartile range (IQR). Unfortunately there is no variable available that contains the number of audience_scores on Rotten Tomatoes, just the audience_score per movie.

ggplot(movies) +
  geom_boxplot(aes(x = title_type, y = audience_score)) + 
  scale_y_continuous(breaks = seq(0, 100, 20), limits = c(0, 100)) +
  labs(
    title = "Rotten Tomatoes Audience Score by Title Type",
    x = "Title Type",
    y = "Audience Score",
    caption = "Documentary - 55 movies. Feature Film - 591 movies. TV Movie - 5 movies."
    ) +
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

To get a better idea of the above plot, it’s worth looking at the same information with imdb_rating and imdb_num_votes. Consider the IMDB ratings on a scale of 1 to 10 (1 being the worst, 10 being the best). Documentary and Feature Film each have outliers (observations outside of the expected range). For boxplots, outliers are indetified by subtracting 1.5 times the IQR from the first quartile (median of “lower half” of data) and adding 1.5 times the IQR to the third quartile (median of “upper half” of data); these two values are where the respective minimum and maximum would be expected to lie. Points below or above these ones are the outliers. Documentary has one outlier below the minimum, and Feature Film has multiple outliers below the minimum and one outlier above the maximum. Pay attention to the spread of data. Documentary and Feature film appear to be symmetric about the median (middle bar in the boxplot), suggesting little to no skew. The median of TV Movie is the same as the third quartile, suggesting this title type has significant left skew. This may be due to such a low number of observations for TV Movie.

Constructing a boxplot of the number of IMDB votes is near-impossible to interpret (included below for visualization purposes). From the boxplot produced, notice that all three title types have outliers. Very little can be interpreted from Documentary and TV Movie outliers, but Feature Film consistently had movies with a higher number of votes than expected by the range of the boxplot. Examining summary statistics rather than a boxplot would be a more approriate choice. Documentary IMDB votes spanned 180 votes to 39320 votes with an average of 5066.13 votes and a median of 1784 votes. Feature Film IMDB votes spanned 390 votes to 893008 votes with an average of 62.861.28 votes and a median of 17934 votes. TV Movie IMDB votes spanned 2289 votes to 11477 votes with an average of 4864 votes and a median of 3505 votes.

ggplot(movies) +
  geom_boxplot(aes(x = title_type, y = imdb_rating)) + 
  scale_y_continuous(breaks = seq(0, 10, 1), limits = c(0, 10)) +
  labs(
    title = "IMDB Rating by Title Type",
    x = "Title Type",
    y = "IMDB Rating",
    caption = "Documentary - 55 movies. Feature Film - 591 movies. TV Movie - 5 movies."
    ) +
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

ggplot(movies) +
  geom_boxplot(aes(x = title_type, y = imdb_num_votes)) + 
  scale_y_continuous(breaks = seq(0, 893008, 100000), limits = c(0, 893008), labels = scales::label_comma()) +
  labs(
    title = "IMDB Number of Votes by Title Type",
    x = "Title Type",
    y = "IMDB Number of Votes",
    caption = "Documentary - 55 movies. Feature Film - 591 movies. TV Movie - 5 movies."
    ) +
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

movies %>% 
  group_by(title_type) %>%
  select(title_type, imdb_num_votes) %>%
  get_summary_stats()

## # A tibble: 3 × 14
##   title_type variable     n   min    max median    q1     q3   iqr    mad   mean
##   <fct>      <fct>    <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl>  <dbl>  <dbl>
## 1 Documenta… imdb_nu…    55   180  39320   1784  701   3750   3049  1898.  5066.
## 2 Feature F… imdb_nu…   591   390 893008  17934 6276. 66112. 59837 22335. 62861.
## 3 TV Movie   imdb_nu…     5  2289  11477   3505 2598   4451   1853  1403.  4864 
## # ℹ 3 more variables: sd <dbl>, se <dbl>, ci <dbl>

Consider the relationship between audience_score and imdb_rating and runtime. Among title types, the median appears to be concentrated around 100 minutes. Documentary and Feature Film had outliers; the Documentary outliers appeared below and above the minimum and maxium, and the Feature Film outliers appeared above the maximum. Documentary types may have the most symmetric distribution over the variable runtime.

ggplot(movies) +
  geom_boxplot(aes(x = title_type, y = runtime)) +
  scale_y_continuous(breaks = seq(0, 300, 50), limits = c(0, 300)) +
  labs(
    title = "Runtime by Title Type", 
    x = "Title Type",
    y = "Runtime",
    caption = "Documentary - 55 movies. Feature Film - 591 movies. TV Movie - 5 movies."
    ) +
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_boxplot()`).

Check to see if audience_score or imdb_rating are positively associated with runtime. This can be done using a linear model with the ratings as response variables and with runtime as an explanatory variable. Although the p-value was significant, it only explained about 3.3% of the variability in the model. Additionally, notice that the conditions of linear regression were violated: this is not a linear relationship (the residuals versus fitted plot curved up and down on the ends), errors didn’t have constant variance (residuals versus fitted plot shows a blob between 60 and 70 rather than a random cloud of data points dispersed about the line y = 0), and the errors were not normally distributed (the tails of the normal Q-Q plot deviate after the -1 and +1 theoretical quantiles). Note: the documentary “The End of America” is excluded from the residuals plot below due to not having a value for runtime.

ggplot(movies, aes(x = runtime, y = audience_score)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) + 
  scale_x_continuous(breaks = seq(0, 300, 25), limits = c(0, 300)) +
  scale_y_continuous(breaks = seq(0, 100, 10), limits = c(0, 100)) +
  labs(
    title = "Rotten Tomatoes Audience Score versus Runtime", 
    x = "Runtime",
    y = "Audience Score",
    caption = "1 observation from runtime is missing."
  ) + 
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

model_1 <- lm(audience_score ~ runtime, data = movies)
summary(model_1)

## 
## Call:
## lm(formula = audience_score ~ runtime, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.641 -15.626   3.008  17.080  34.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.4203     4.3256   9.807  < 2e-16 ***
## runtime       0.1883     0.0402   4.684 3.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.92 on 648 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03275,    Adjusted R-squared:  0.03125 
## F-statistic: 21.94 on 1 and 648 DF,  p-value: 3.431e-06

plot(model_1$residuals ~ movies$runtime[movies$title != "The End of America"])

hist(model_1$residuals)

This model behaved slightly better. run_time had a significant p-value, yet it only explained about 7.2% of the variability. The conditions of linear modeling were still violated. The residuals versus fitted plot does not show a linear relationship due to the sharp decrease around 6 on the fitted values scale. The residuals still had a blob pattern rather than the desired random cloud, so constant variance was violated. The errors were normally distributed but had noticeable left skew.

ggplot(movies, aes(x = runtime, y = imdb_rating)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) + 
  scale_x_continuous(breaks = seq(0, 300, 25), limits = c(0, 300)) +
  scale_y_continuous(breaks = seq(0, 10, 1), limits = c(0, 10)) +
  labs(
    title = "IMDB Rating versus Runtime", 
    x = "Runtime",
    y = "IMDB Rating",
    caption = "1 observation from runtime is missing."
  ) + 
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

model_2 <- lm(imdb_rating ~ runtime, data = movies)
summary(model_2)

## 
## Call:
## lm(formula = imdb_rating ~ runtime, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3099 -0.5791  0.0714  0.7572  2.2500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.907873   0.227163  21.605  < 2e-16 ***
## runtime     0.014965   0.002111   7.088 3.56e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.046 on 648 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.07195,    Adjusted R-squared:  0.07052 
## F-statistic: 50.24 on 1 and 648 DF,  p-value: 3.564e-12

plot(model_2$residuals ~ movies$runtime[movies$title != "The End of America"])

hist(model_2$residuals)

Consider some other variables of interest: mpaa_rating, audience_rating (“Upright” - at least 60% of audience reviews were 3.5 stars out of 5 stars or more, “Spilled” - at least 60% of audience reviews were 3.5 stars out of 5 stars or higher), critics_rating (“Fresh” - less than 60% review score, “Certified Fresh” - specialized “Fresh” criteria that is harder to obtain, “Rotten” - less than 60% review score), critic_score (Rotten Tomatoes only, 0 to 100 scale), best_pic_nom (best picture nomination), best_pic_win (did the movie win best picture), and top200_box (was the movie on the Top 200 Box Office list from BoxOfficeMojo).

Most ratings were concentrated among PG, PG-13, and R. TV Movies were exclusively R or Unrated, and the only entries for NC-17 were for Feature Films. From the boxplot for MPAA Ratings, be aware that G and NC-17 are small categories (19 and 2 observations, respectively). That may explain the higher medians. Unrated had 50 observations, but several movies scored lower than expected as shown by the outliers. PG, PG-13, and rated-R movies all appear to have median scores concentrated between 50 and 60, where PG had the highest IQR and PG-13 had the largest range. All ratings aside from NC-17 appeared to have left skew to some degree.

movies %>% count(mpaa_rating) %>% rename(`MPAA Rating` = mpaa_rating, Total = n)

## # A tibble: 6 × 2
##   `MPAA Rating` Total
##   <fct>         <int>
## 1 G                19
## 2 NC-17             2
## 3 PG              118
## 4 PG-13           133
## 5 R               329
## 6 Unrated          50

ggplot(movies) +
  geom_bar(aes(x = mpaa_rating, fill = title_type), position = "dodge") +
  scale_y_continuous(breaks = seq(0, 300, 25), limits = c(0, 325)) +
  labs(
    title = "MPAA Ratings by Title Type",
    x = "MPAA Rating",
    y = "Total",
    fill = "Title Type"
  ) +
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

ggplot(movies) +
  geom_boxplot(aes(x = mpaa_rating, y = critics_score)) +
  scale_y_continuous(breaks = seq(0, 100, 20), limits = c(0, 100)) +
  labs(
    title = "Rotten Tomatoes Critic Score by MPAA Rating",
    x = "MPAA Rating",
    y = "Critic Score"
  ) +
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

Out of all movies rated by the audience, only 6 out of 11 genres had more Upright ratings than Spilled ratings. Among genres, Documentary movies had the highest percentage of Upright ratings by proportion. By nature, obtaining a Certified Fresh rating is more difficult than obtaining a Fresh rating. There are no genres in which there were more Certified Fresh ratings than Rotten ratings, but there are 3 genres in which there were more Fresh ratings than Rotten ratings. Again, Documentary movies had the highest percentage of Fresh or higher ratings by proportion.

genre_labels <- c("Action", "Animation", "Art", "Comedy", "Doc", "Drama", "Horror", "Music", "Mystery", "Other", "Sci-Fi")


ggplot(movies) +
  geom_bar(aes(x = genre, fill = audience_rating), position = "dodge") +
  scale_x_discrete(labels = genre_labels) + 
  scale_y_continuous(breaks = seq(0, 200, 20), limits = c(0, 200)) +
  labs(
    title = "Audience Rating by Genre",
    x = "Genre",
    y = "Total",
    fill = "Audience Rating"
    ) +
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

movies %>% 
  group_by(genre) %>% 
  reframe(
    `Number of Movies` = n(),
    `Percentage of Upright Audience Ratings` = round(sum(audience_rating == "Upright")/n() * 100, 2)
    ) %>%
  arrange(-`Percentage of Upright Audience Ratings`) %>%
  rename(Genre = genre)

## # A tibble: 11 × 3
##    Genre                     `Number of Movies` Percentage of Upright Audience…¹
##    <fct>                                  <int>                            <dbl>
##  1 Documentary                               52                             98.1
##  2 Musical & Performing Arts                 12                             91.7
##  3 Art House & International                 14                             71.4
##  4 Other                                     16                             68.8
##  5 Animation                                  9                             66.7
##  6 Drama                                    305                             65.6
##  7 Mystery & Suspense                        59                             40.7
##  8 Action & Adventure                        65                             38.5
##  9 Comedy                                    87                             35.6
## 10 Science Fiction & Fantasy                  9                             33.3
## 11 Horror                                    23                             17.4
## # ℹ abbreviated name: ¹`Percentage of Upright Audience Ratings`

ggplot(movies) +
  geom_bar(aes(x = genre, fill = critics_rating), position = "dodge") +
  scale_x_discrete(labels = genre_labels) + 
  scale_y_continuous(breaks = seq(0, 140, 20), limits = c(0, 140)) +
  labs(
    title = "Critic Rating by Genre",
    x = "Genre",
    y = "Total",
    fill = "Critic Rating"
    ) +
  theme_classic() +
  theme(plot.title = element_text(h = 0.5))

movies %>% 
  group_by(genre) %>% 
  reframe(
    `Number of Movies` = n(),
    `Percentage of Fresh or Higher Critic Ratings` = round(sum(critics_rating != "Rotten")/n() * 100, 2)
    ) %>%
  arrange(-`Percentage of Fresh or Higher Critic Ratings`) %>%
  rename(Genre = genre)

## # A tibble: 11 × 3
##    Genre                     `Number of Movies` Percentage of Fresh or Higher …¹
##    <fct>                                  <int>                            <dbl>
##  1 Documentary                               52                             94.2
##  2 Musical & Performing Arts                 12                             75  
##  3 Other                                     16                             68.8
##  4 Drama                                    305                             59.3
##  5 Science Fiction & Fantasy                  9                             55.6
##  6 Mystery & Suspense                        59                             50.8
##  7 Art House & International                 14                             50  
##  8 Animation                                  9                             44.4
##  9 Horror                                    23                             30.4
## 10 Comedy                                    87                             27.6
## 11 Action & Adventure                        65                             26.2
## # ℹ abbreviated name: ¹`Percentage of Fresh or Higher Critic Ratings`

The only genres to be nominated for best picture were Comedy, Drama, Mystery & Suspense, and Other. The same genres were also present on the Top 200 Box Office list. Out of those categories, only Other did not win best picture. The only genre that was on the Top 200 Box Office list and won best picture was Drama (the movie was Titanic).

movies %>% 
  group_by(genre) %>% 
  select(genre, best_pic_nom) %>% 
  count(best_pic_nom) %>% 
  arrange(genre) %>%
  rename(Genre = genre, `Best Picture Nomination` = best_pic_nom, Total = n)

## # A tibble: 15 × 3
## # Groups:   Genre [11]
##    Genre                     `Best Picture Nomination` Total
##    <fct>                     <fct>                     <int>
##  1 Action & Adventure        no                           65
##  2 Animation                 no                            9
##  3 Art House & International no                           14
##  4 Comedy                    no                           86
##  5 Comedy                    yes                           1
##  6 Documentary               no                           52
##  7 Drama                     no                          288
##  8 Drama                     yes                          17
##  9 Horror                    no                           23
## 10 Musical & Performing Arts no                           12
## 11 Mystery & Suspense        no                           57
## 12 Mystery & Suspense        yes                           2
## 13 Other                     no                           14
## 14 Other                     yes                           2
## 15 Science Fiction & Fantasy no                            9

movies %>% 
  group_by(genre) %>% 
  select(genre, best_pic_win) %>% 
  count(best_pic_win) %>% 
  arrange(genre) %>%
  rename(Genre = genre, `Did it win best picture?` = best_pic_win, Total = n)

## # A tibble: 14 × 3
## # Groups:   Genre [11]
##    Genre                     `Did it win best picture?` Total
##    <fct>                     <fct>                      <int>
##  1 Action & Adventure        no                            65
##  2 Animation                 no                             9
##  3 Art House & International no                            14
##  4 Comedy                    no                            86
##  5 Comedy                    yes                            1
##  6 Documentary               no                            52
##  7 Drama                     no                           300
##  8 Drama                     yes                            5
##  9 Horror                    no                            23
## 10 Musical & Performing Arts no                            12
## 11 Mystery & Suspense        no                            58
## 12 Mystery & Suspense        yes                            1
## 13 Other                     no                            16
## 14 Science Fiction & Fantasy no                             9

movies %>% 
  group_by(genre) %>% 
  select(genre, top200_box) %>%
  count(top200_box) %>% 
  arrange(genre) %>%
  rename(Genre = genre, `Was it on the Top 200 Box Office list?` = top200_box, Total = n)

## # A tibble: 17 × 3
## # Groups:   Genre [11]
##    Genre                     `Was it on the Top 200 Box Office list?` Total
##    <fct>                     <fct>                                    <int>
##  1 Action & Adventure        no                                          60
##  2 Action & Adventure        yes                                          5
##  3 Animation                 no                                           9
##  4 Art House & International no                                          14
##  5 Comedy                    no                                          86
##  6 Comedy                    yes                                          1
##  7 Documentary               no                                          52
##  8 Drama                     no                                         299
##  9 Drama                     yes                                          6
## 10 Horror                    no                                          23
## 11 Musical & Performing Arts no                                          12
## 12 Mystery & Suspense        no                                          58
## 13 Mystery & Suspense        yes                                          1
## 14 Other                     no                                          15
## 15 Other                     yes                                          1
## 16 Science Fiction & Fantasy no                                           8
## 17 Science Fiction & Fantasy yes                                          1

movies %>% 
  group_by(genre) %>% 
  select(genre, top200_box, best_pic_win) %>%
  count(top200_box == "yes" & best_pic_win == "yes") %>% 
  arrange(genre) %>%
  rename(
    Genre = genre, 
    `Was it on the Top 200 Box Office list and won best picture?` = `top200_box == "yes" & best_pic_win == "yes"`, 
    Total = n
    )

## # A tibble: 12 × 3
## # Groups:   Genre [11]
##    Genre                     Was it on the Top 200 Box Office list and w…¹ Total
##    <fct>                     <lgl>                                         <int>
##  1 Action & Adventure        FALSE                                            65
##  2 Animation                 FALSE                                             9
##  3 Art House & International FALSE                                            14
##  4 Comedy                    FALSE                                            87
##  5 Documentary               FALSE                                            52
##  6 Drama                     FALSE                                           304
##  7 Drama                     TRUE                                              1
##  8 Horror                    FALSE                                            23
##  9 Musical & Performing Arts FALSE                                            12
## 10 Mystery & Suspense        FALSE                                            59
## 11 Other                     FALSE                                            16
## 12 Science Fiction & Fantasy FALSE                                             9
## # ℹ abbreviated name:
## #   ¹`Was it on the Top 200 Box Office list and won best picture?`

movies %>% filter(top200_box == "yes" & best_pic_win == "yes") %>% select(title)

## # A tibble: 1 × 1
##   title  
##   <chr>  
## 1 Titanic

Part 4: Modeling

To construct a model for explaining movie popularity through audience_score and imdb_rating, it has to be decided which variables are the best for explaining the relationship. This can be done with forward selection (adding variables to the model) or backward elimination (removing variables from the model) among other techniques. Backward elimination will be done here. In backward elimination, start with a full model and remove variables based on either improved adjusted R-squared or p-value. The adjusted R-squared method requires removing each variable one at a time over and over until the adjusted R-squared does not improve. The p-value method works similarly but removing the largest p-value instead. The adjusted R-squared method results in a stronger prediction power by prioritizing the highest adjusted R-squared value, but the p-value method results in the simplest model with a slightly lower prediction power (parsimony). Here parsimony is favored, so the backward elimination p-value method is used.

It is obvious that the variables imdb_url and rt_url will not offer anything to the model since they’re just URLs to the website entry for each movie. Don’t use title for similar reasons. If absolutely all variables are included, then the adjusted R-squared will be 1 (all variability is explained by the model) but the model will be overfitted (it will look “too busy” and be hard to determine trends). Instead create a full model for audience_score using title_type, genre, runtime, mpaa_rating, thtr_rel_year, imdb_rating, imdb_num_votes, critic_score, critics_rating, audience_rating, best_pic_nom, best_pic_win, and top200_box.

full_model <- lm(audience_score ~ title_type + genre + runtime + mpaa_rating + thtr_rel_year + imdb_rating + imdb_num_votes + critics_score + critics_rating + audience_rating + best_pic_nom + best_pic_win + top200_box, data = movies)

Before running the model, the variables need to be checked for multicollinearity. This happens when independent variables are strongly correlated to each other to the extent that the model is overfitted and as a result it is difficult to interpret. When multicollinearity is present, the variances of the variables are inflated; one way to check for multicollinearity is to check the Variance Inflation Factor with the function vif. If the VIF is measured as greater than 5, there is a suffucient amount of multicollinearity present. Multicollinearity can be handled by dropping the strongly correlated variables, averaging them, or reducing dimensionality using principal component analysis (PCA). Here, the strongly correlated variables will be dropped.

Note that there are three measures present for the variables: GVIF, Df, and GVID^(1/(2*Df)). Since categorical variables are included in the model, the measure GVIF (generalized VIF) is included to calculate a VIF that would be the same regardless of the factor level. Df (degrees of freedom) represents the number of independent pieces to calculate a statistic. The final column represents the generalized standard inflation factor, takign the GVIF to the fractional power of 1/2 times the Df column for something more generalizable should the factors have more than two levels. For that reason, aGSIF will be the chosen measure here. Typically a cutoff value for GVIF is to drop variables with a GVIF of 5 or higher. When using aGSIF, take the square root of the GVIF threshold, so if aGSIF is greater than 2.2361, drop the variable. Based on that, critics_score should be dropped from the model.

car::vif(full_model)

##                      GVIF Df GVIF^(1/(2*Df))
## title_type       7.336109  2        1.645760
## genre           16.397918 10        1.150110
## runtime          1.443902  1        1.201625
## mpaa_rating      3.245946  5        1.124952
## thtr_rel_year    1.451996  1        1.204988
## imdb_rating      3.865947  1        1.966201
## imdb_num_votes   1.904348  1        1.379981
## critics_score    7.134938  1        2.671131
## critics_rating   6.064221  2        1.569256
## audience_rating  2.109643  1        1.452461
## best_pic_nom     1.437729  1        1.199053
## best_pic_win     1.379269  1        1.174423
## top200_box       1.165560  1        1.079611

sqrt(5)

## [1] 2.236068

Initially this model performs fairly well, explaining 88.53% of the total variance. First, drop the variable with the highest p-value. The Other genre has the highest p-value, but genre has levels that are significant, so genre cannot be dropped. The highest p-value is mpaa_rating, which has insignificant p-values on all levels. Drop this variable and refit the model.

model <- lm(audience_score ~ title_type + genre + runtime + mpaa_rating + thtr_rel_year + imdb_rating + imdb_num_votes + critics_rating + audience_rating + best_pic_nom + best_pic_win + top200_box, data = movies)

summary(model)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime + 
##     mpaa_rating + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_rating + audience_rating + best_pic_nom + best_pic_win + 
##     top200_box, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0420  -4.6598   0.4762   4.1295  24.6991 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     8.515e+01  5.925e+01   1.437   0.1512    
## title_typeFeature Film          2.250e+00  2.559e+00   0.879   0.3795    
## title_typeTV Movie              6.568e-01  4.026e+00   0.163   0.8705    
## genreAnimation                  2.801e+00  2.726e+00   1.028   0.3045    
## genreArt House & International -2.511e+00  2.097e+00  -1.198   0.2315    
## genreComedy                     1.480e+00  1.150e+00   1.287   0.1985    
## genreDocumentary                2.508e+00  2.751e+00   0.912   0.3622    
## genreDrama                     -5.175e-01  1.011e+00  -0.512   0.6089    
## genreHorror                    -1.914e+00  1.724e+00  -1.111   0.2671    
## genreMusical & Performing Arts  3.934e+00  2.366e+00   1.663   0.0968 .  
## genreMystery & Suspense        -2.899e+00  1.289e+00  -2.249   0.0249 *  
## genreOther                     -4.853e-01  1.966e+00  -0.247   0.8051    
## genreScience Fiction & Fantasy -4.718e-01  2.457e+00  -0.192   0.8478    
## runtime                        -3.369e-02  1.662e-02  -2.027   0.0430 *  
## mpaa_ratingNC-17               -8.904e-01  5.214e+00  -0.171   0.8645    
## mpaa_ratingPG                  -3.855e-02  1.903e+00  -0.020   0.9838    
## mpaa_ratingPG-13               -5.943e-01  2.000e+00  -0.297   0.7664    
## mpaa_ratingR                   -8.454e-01  1.917e+00  -0.441   0.6593    
## mpaa_ratingUnrated              5.256e-01  2.235e+00   0.235   0.8142    
## thtr_rel_year                  -4.664e-02  2.941e-02  -1.586   0.1133    
## imdb_rating                     9.443e+00  4.308e-01  21.920   <2e-16 ***
## imdb_num_votes                  4.511e-06  3.290e-06   1.371   0.1708    
## critics_ratingFresh            -1.329e-01  8.649e-01  -0.154   0.8780    
## critics_ratingRotten           -1.164e+00  9.435e-01  -1.233   0.2179    
## audience_ratingUpright          1.997e+01  7.901e-01  25.278   <2e-16 ***
## best_pic_nomyes                 3.355e+00  1.782e+00   1.883   0.0602 .  
## best_pic_winyes                -2.875e+00  3.058e+00  -0.940   0.3475    
## top200_boxyes                  -1.028e+00  1.932e+00  -0.532   0.5947    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.852 on 622 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8901, Adjusted R-squared:  0.8853 
## F-statistic: 186.6 on 27 and 622 DF,  p-value: < 2.2e-16

Doing the process again, the next variable to be dropped is critics_rating, which is insignificant on all levels and has the highest p-value on the level Fresh.

model1 <- lm(audience_score ~ title_type + genre + runtime + thtr_rel_year + imdb_rating + imdb_num_votes + critics_rating + audience_rating + best_pic_nom + best_pic_win + top200_box, data = movies)

summary(model1)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_rating + 
##     audience_rating + best_pic_nom + best_pic_win + top200_box, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.4622  -4.6850   0.4051   4.2036  24.5438 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     8.836e+01  5.511e+01   1.603   0.1093    
## title_typeFeature Film          1.920e+00  2.533e+00   0.758   0.4486    
## title_typeTV Movie              5.919e-01  4.008e+00   0.148   0.8826    
## genreAnimation                  3.265e+00  2.467e+00   1.323   0.1862    
## genreArt House & International -2.467e+00  2.053e+00  -1.201   0.2301    
## genreComedy                     1.389e+00  1.137e+00   1.221   0.2224    
## genreDocumentary                2.733e+00  2.717e+00   1.006   0.3148    
## genreDrama                     -7.085e-01  9.897e-01  -0.716   0.4743    
## genreHorror                    -2.121e+00  1.680e+00  -1.262   0.2073    
## genreMusical & Performing Arts  3.840e+00  2.353e+00   1.632   0.1032    
## genreMystery & Suspense        -3.181e+00  1.260e+00  -2.524   0.0119 *  
## genreOther                     -4.398e-01  1.952e+00  -0.225   0.8218    
## genreScience Fiction & Fantasy -4.937e-01  2.449e+00  -0.202   0.8403    
## runtime                        -3.336e-02  1.614e-02  -2.067   0.0392 *  
## thtr_rel_year                  -4.835e-02  2.723e-02  -1.776   0.0763 .  
## imdb_rating                     9.454e+00  4.273e-01  22.124   <2e-16 ***
## imdb_num_votes                  3.968e-06  3.244e-06   1.223   0.2217    
## critics_ratingFresh            -1.010e-01  8.597e-01  -0.118   0.9065    
## critics_ratingRotten           -1.178e+00  9.341e-01  -1.261   0.2077    
## audience_ratingUpright          1.998e+01  7.863e-01  25.410   <2e-16 ***
## best_pic_nomyes                 3.393e+00  1.774e+00   1.913   0.0562 .  
## best_pic_winyes                -2.742e+00  3.044e+00  -0.901   0.3681    
## top200_boxyes                  -7.445e-01  1.905e+00  -0.391   0.6961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.835 on 627 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8898, Adjusted R-squared:  0.8859 
## F-statistic:   230 on 22 and 627 DF,  p-value: < 2.2e-16

The next highest p-value belongs to genre, however there is one genre level with a significant p-value (Mystery & Suspense). Thus, drop title_type instead.

model2 <- lm(audience_score ~ title_type + genre + runtime + thtr_rel_year + imdb_rating + imdb_num_votes + audience_rating + best_pic_nom + best_pic_win + top200_box, data = movies)

summary(model2)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + audience_rating + 
##     best_pic_nom + best_pic_win + top200_box, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.6136  -4.4575   0.5043   4.2266  24.7670 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     9.076e+01  5.374e+01   1.689   0.0918 .  
## title_typeFeature Film          1.680e+00  2.529e+00   0.664   0.5068    
## title_typeTV Movie              4.953e-01  4.005e+00   0.124   0.9016    
## genreAnimation                  3.453e+00  2.463e+00   1.402   0.1615    
## genreArt House & International -2.385e+00  2.053e+00  -1.162   0.2458    
## genreComedy                     1.461e+00  1.137e+00   1.286   0.1991    
## genreDocumentary                2.780e+00  2.717e+00   1.023   0.3068    
## genreDrama                     -5.360e-01  9.840e-01  -0.545   0.5862    
## genreHorror                    -2.001e+00  1.679e+00  -1.192   0.2337    
## genreMusical & Performing Arts  3.943e+00  2.353e+00   1.676   0.0943 .  
## genreMystery & Suspense        -3.024e+00  1.254e+00  -2.412   0.0162 *  
## genreOther                     -2.048e-01  1.947e+00  -0.105   0.9162    
## genreScience Fiction & Fantasy -1.285e-01  2.439e+00  -0.053   0.9580    
## runtime                        -3.536e-02  1.609e-02  -2.197   0.0284 *  
## thtr_rel_year                  -5.054e-02  2.657e-02  -1.902   0.0576 .  
## imdb_rating                     9.700e+00  3.980e-01  24.372   <2e-16 ***
## imdb_num_votes                  4.155e-06  3.164e-06   1.313   0.1896    
## audience_ratingUpright          2.013e+01  7.767e-01  25.922   <2e-16 ***
## best_pic_nomyes                 3.505e+00  1.765e+00   1.986   0.0474 *  
## best_pic_winyes                -2.677e+00  3.041e+00  -0.880   0.3790    
## top200_boxyes                  -6.104e-01  1.898e+00  -0.322   0.7479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.838 on 629 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8858 
## F-statistic: 252.7 on 20 and 629 DF,  p-value: < 2.2e-16

Next, drop top200_box.

model3 <- lm(audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + imdb_num_votes + audience_rating + best_pic_nom + best_pic_win + top200_box, data = movies)

summary(model3)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + audience_rating + best_pic_nom + 
##     best_pic_win + top200_box, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.6550  -4.5107   0.5136   4.2513  24.7317 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     9.518e+01  5.336e+01   1.784   0.0749 .  
## genreAnimation                  3.473e+00  2.460e+00   1.411   0.1586    
## genreArt House & International -2.350e+00  2.050e+00  -1.146   0.2521    
## genreComedy                     1.428e+00  1.134e+00   1.259   0.2084    
## genreDocumentary                1.263e+00  1.470e+00   0.860   0.3903    
## genreDrama                     -5.308e-01  9.813e-01  -0.541   0.5887    
## genreHorror                    -2.001e+00  1.677e+00  -1.193   0.2333    
## genreMusical & Performing Arts  3.429e+00  2.220e+00   1.544   0.1230    
## genreMystery & Suspense        -3.015e+00  1.252e+00  -2.407   0.0164 *  
## genreOther                     -2.811e-01  1.933e+00  -0.145   0.8844    
## genreScience Fiction & Fantasy -1.404e-01  2.436e+00  -0.058   0.9541    
## runtime                        -3.527e-02  1.607e-02  -2.194   0.0286 *  
## thtr_rel_year                  -5.188e-02  2.649e-02  -1.959   0.0506 .  
## imdb_rating                     9.684e+00  3.958e-01  24.467   <2e-16 ***
## imdb_num_votes                  4.363e-06  3.149e-06   1.386   0.1663    
## audience_ratingUpright          2.011e+01  7.750e-01  25.945   <2e-16 ***
## best_pic_nomyes                 3.523e+00  1.762e+00   1.999   0.0460 *  
## best_pic_winyes                -2.722e+00  3.037e+00  -0.896   0.3704    
## top200_boxyes                  -6.212e-01  1.896e+00  -0.328   0.7433    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.83 on 631 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8892, Adjusted R-squared:  0.8861 
## F-statistic: 281.4 on 18 and 631 DF,  p-value: < 2.2e-16

Drop the variable best_pic_win.

model4 <- lm(audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + imdb_num_votes + audience_rating + best_pic_nom + best_pic_win, data = movies)

summary(model4)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + audience_rating + best_pic_nom + 
##     best_pic_win, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.6409  -4.4650   0.4914   4.2564  24.7323 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     9.333e+01  5.302e+01   1.760   0.0788 .  
## genreAnimation                  3.509e+00  2.456e+00   1.429   0.1536    
## genreArt House & International -2.321e+00  2.047e+00  -1.134   0.2573    
## genreComedy                     1.457e+00  1.129e+00   1.290   0.1975    
## genreDocumentary                1.286e+00  1.467e+00   0.877   0.3809    
## genreDrama                     -4.989e-01  9.757e-01  -0.511   0.6093    
## genreHorror                    -1.968e+00  1.673e+00  -1.177   0.2398    
## genreMusical & Performing Arts  3.465e+00  2.216e+00   1.563   0.1185    
## genreMystery & Suspense        -2.980e+00  1.247e+00  -2.390   0.0172 *  
## genreOther                     -2.545e-01  1.930e+00  -0.132   0.8951    
## genreScience Fiction & Fantasy -1.559e-01  2.434e+00  -0.064   0.9489    
## runtime                        -3.545e-02  1.605e-02  -2.209   0.0275 *  
## thtr_rel_year                  -5.096e-02  2.632e-02  -1.936   0.0533 .  
## imdb_rating                     9.687e+00  3.954e-01  24.500   <2e-16 ***
## imdb_num_votes                  4.115e-06  3.054e-06   1.347   0.1783    
## audience_ratingUpright          2.010e+01  7.739e-01  25.971   <2e-16 ***
## best_pic_nomyes                 3.531e+00  1.761e+00   2.005   0.0454 *  
## best_pic_winyes                -2.711e+00  3.035e+00  -0.893   0.3720    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.825 on 632 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8892, Adjusted R-squared:  0.8862 
## F-statistic: 298.4 on 17 and 632 DF,  p-value: < 2.2e-16

Drop imdb_num_votes next.

model5 <- lm(audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + imdb_num_votes + audience_rating + best_pic_nom, data = movies)

summary(model5)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + audience_rating + best_pic_nom, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.6008  -4.4067   0.5665   4.2431  24.7454 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     8.919e+01  5.281e+01   1.689   0.0917 .  
## genreAnimation                  3.473e+00  2.455e+00   1.414   0.1577    
## genreArt House & International -2.374e+00  2.046e+00  -1.161   0.2462    
## genreComedy                     1.410e+00  1.128e+00   1.250   0.2118    
## genreDocumentary                1.205e+00  1.464e+00   0.823   0.4107    
## genreDrama                     -5.253e-01  9.751e-01  -0.539   0.5903    
## genreHorror                    -1.995e+00  1.672e+00  -1.193   0.2333    
## genreMusical & Performing Arts  3.417e+00  2.215e+00   1.542   0.1235    
## genreMystery & Suspense        -3.009e+00  1.246e+00  -2.415   0.0160 *  
## genreOther                     -1.452e-01  1.926e+00  -0.075   0.9399    
## genreScience Fiction & Fantasy -1.407e-01  2.433e+00  -0.058   0.9539    
## runtime                        -3.594e-02  1.604e-02  -2.241   0.0254 *  
## thtr_rel_year                  -4.887e-02  2.621e-02  -1.864   0.0627 .  
## imdb_rating                     9.698e+00  3.952e-01  24.542   <2e-16 ***
## imdb_num_votes                  3.562e-06  2.990e-06   1.191   0.2339    
## audience_ratingUpright          2.011e+01  7.737e-01  25.990   <2e-16 ***
## best_pic_nomyes                 2.896e+00  1.611e+00   1.798   0.0727 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.824 on 633 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8891, Adjusted R-squared:  0.8863 
## F-statistic: 317.1 on 16 and 633 DF,  p-value: < 2.2e-16

Now, drop thtr_rel_year.

model6 <- lm(audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + audience_rating + best_pic_nom, data = movies)

summary(model6)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year + 
##     imdb_rating + audience_rating + best_pic_nom, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.3445  -4.5305   0.6003   4.3237  25.0882 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    70.06877   50.32956   1.392   0.1644    
## genreAnimation                  3.41004    2.45571   1.389   0.1654    
## genreArt House & International -2.73156    2.02426  -1.349   0.1777    
## genreComedy                     1.32740    1.12630   1.179   0.2390    
## genreDocumentary                0.67827    1.39591   0.486   0.6272    
## genreDrama                     -0.74993    0.95703  -0.784   0.4336    
## genreHorror                    -2.07738    1.67144  -1.243   0.2144    
## genreMusical & Performing Arts  2.96907    2.18374   1.360   0.1744    
## genreMystery & Suspense        -3.13053    1.24242  -2.520   0.0120 *  
## genreOther                     -0.12361    1.92638  -0.064   0.9489    
## genreScience Fiction & Fantasy -0.03079    2.43236  -0.013   0.9899    
## runtime                        -0.03122    0.01555  -2.008   0.0451 *  
## thtr_rel_year                  -0.03980    0.02509  -1.586   0.1132    
## imdb_rating                     9.82750    0.38006  25.858   <2e-16 ***
## audience_ratingUpright         20.14844    0.77320  26.058   <2e-16 ***
## best_pic_nomyes                 3.32335    1.57109   2.115   0.0348 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.826 on 634 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8888, Adjusted R-squared:  0.8862 
## F-statistic: 337.9 on 15 and 634 DF,  p-value: < 2.2e-16

Notice that runtime is no longer a significant predictor of audience_score. Drop runtime from the model.

model7 <- lm(audience_score ~ genre + runtime + imdb_rating + audience_rating + best_pic_nom, data = movies)

summary(model7)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating + 
##     audience_rating + best_pic_nom, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.5455  -4.5331   0.6866   4.2669  24.9626 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -9.66166    2.51065  -3.848 0.000131 ***
## genreAnimation                  3.16588    2.45381   1.290 0.197454    
## genreArt House & International -2.82708    2.02578  -1.396 0.163337    
## genreComedy                     1.27478    1.12715   1.131 0.258492    
## genreDocumentary                0.38907    1.38561   0.281 0.778961    
## genreDrama                     -0.80737    0.95748  -0.843 0.399420    
## genreHorror                    -1.98081    1.67232  -1.184 0.236671    
## genreMusical & Performing Arts  2.84670    2.18498   1.303 0.193099    
## genreMystery & Suspense        -3.21443    1.24277  -2.586 0.009917 ** 
## genreOther                      0.08709    1.92409   0.045 0.963911    
## genreScience Fiction & Fantasy  0.17518    2.43179   0.072 0.942595    
## runtime                        -0.02959    0.01553  -1.905 0.057181 .  
## imdb_rating                     9.83886    0.38045  25.861  < 2e-16 ***
## audience_ratingUpright         20.20795    0.77321  26.135  < 2e-16 ***
## best_pic_nomyes                 3.32275    1.57296   2.112 0.035041 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.835 on 635 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8884, Adjusted R-squared:  0.8859 
## F-statistic:   361 on 14 and 635 DF,  p-value: < 2.2e-16

Similarly best_pic_nom is no longer a significant predictor. Drop best_pic_nom.

model8 <- lm(audience_score ~ genre + imdb_rating + audience_rating + best_pic_nom, data = movies)

summary(model8)

## 
## Call:
## lm(formula = audience_score ~ genre + imdb_rating + audience_rating + 
##     best_pic_nom, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.6918  -4.3362   0.6878   4.3046  24.8862 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -11.92671    2.22536  -5.359 1.17e-07 ***
## genreAnimation                   3.62381    2.44920   1.480   0.1395    
## genreArt House & International  -2.71432    2.03076  -1.337   0.1818    
## genreComedy                      1.45515    1.12646   1.292   0.1969    
## genreDocumentary                 0.60282    1.36621   0.441   0.6592    
## genreDrama                      -0.90457    0.95892  -0.943   0.3459    
## genreHorror                     -1.64797    1.66794  -0.988   0.3235    
## genreMusical & Performing Arts   2.69589    2.18990   1.231   0.2188    
## genreMystery & Suspense         -3.31306    1.24533  -2.660   0.0080 ** 
## genreOther                       0.01143    1.92924   0.006   0.9953    
## genreScience Fiction & Fantasy   0.23198    2.43863   0.095   0.9242    
## imdb_rating                      9.69860    0.37378  25.947  < 2e-16 ***
## audience_ratingUpright          20.28844    0.77402  26.212  < 2e-16 ***
## best_pic_nomyes                  2.72532    1.54616   1.763   0.0784 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.854 on 637 degrees of freedom
## Multiple R-squared:  0.8874, Adjusted R-squared:  0.8851 
## F-statistic: 386.2 on 13 and 637 DF,  p-value: < 2.2e-16

There are no variables with insignificant p-values (if categorical on all levels) left. The backward elimination is complete. Notice that the adjusted R-squared is now 88.47% (this is the percentage of total variance explained by the model) is lower than the initial adjusted R-squared of 88.53%. This is the trade-off with choosing a parsimonious model.

final_model <- lm(audience_score ~ genre + imdb_rating + audience_rating, data = movies)

summary(final_model)

## 
## Call:
## lm(formula = audience_score ~ genre + imdb_rating + audience_rating, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.636  -4.441   0.584   4.382  25.095 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -12.5871     2.1972  -5.729 1.56e-08 ***
## genreAnimation                   3.6255     2.4532   1.478  0.13995    
## genreArt House & International  -2.7917     2.0336  -1.373  0.17031    
## genreComedy                      1.5118     1.1279   1.340  0.18060    
## genreDocumentary                 0.4068     1.3639   0.298  0.76558    
## genreDrama                      -0.8352     0.9597  -0.870  0.38446    
## genreHorror                     -1.6205     1.6706  -0.970  0.33241    
## genreMusical & Performing Arts   2.5393     2.1917   1.159  0.24706    
## genreMystery & Suspense         -3.2767     1.2472  -2.627  0.00881 ** 
## genreOther                       0.2735     1.9267   0.142  0.88718    
## genreScience Fiction & Fantasy   0.2566     2.4426   0.105  0.91637    
## imdb_rating                      9.8078     0.3692  26.564  < 2e-16 ***
## audience_ratingUpright          20.3099     0.7752  26.199  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.866 on 638 degrees of freedom
## Multiple R-squared:  0.8869, Adjusted R-squared:  0.8847 
## F-statistic: 416.8 on 12 and 638 DF,  p-value: < 2.2e-16

It needs to be ensured that the conditions for multiple linear regression are met: a linear relationship between numerical variables, nearly normal residuals with mean 0, and constant variability of residuals.

The only numerical value in our final model is imdb_rating. To check for linearity, plot the residuals against imdb_rating. We see a complete random scatter around the line y = 0. The relationship is linear.

plot(
  final_model$residuals ~ movies$imdb_rating, 
  main = "Residuals of Final Model against IMDB Rating", 
  xlab = "IMDB Rating",
  ylab = "Residuals"
  )
abline(h = 0)

The residuals are normally distributed with minimal deviation from the Q-Q line. The mean of residuals is etremely close to 0. The condition of normality is satisfied.

hist(final_model$residuals)

qqnorm(final_model$residuals)
qqline(final_model$residuals)

mean(final_model$residuals)

## [1] -2.251676e-18

Note that clusters are present due to the presence of categorical variables in the model (most likely due to audience_rating since it’s a factor of two levels). Since both clusters have positive and negative residuals, this is fine–neither cluster overestimates or underestimates the result. What’s more important is that the residuals are randomly scattered about the line y = 0 with constant width. The condition of constant variance is satisfied.

plot(
  final_model$residuals ~ final_model$fitted, 
  main = "Residuals of Final Model against Fitted Values of Final Model",
  xlab = "Fitted Values",
  ylab = "Residuals"
  )
abline(h = 0)

plot(
  abs(final_model$residuals) ~ final_model$fitted, 
  main = "Absolute Value of Residuals of Final Model against \n Fitted Values of Final Model",
  xlab = "Fitted Values",
  ylab = "Absolute Value of Residuals"
  )
abline(h = 0)

Part 5: Prediction

Consider the movie Sing released in 2016. This movie had an audience rating of “Upright,” and an IMDB of 7.1, and a genre of “Animation,” “Comedy,” and “Musical & Performing Arts.” This produces three predicted audience scores based on genre including a 95% confidence interval. If the genre is Animation, then the predicted audience score is 80.98 with a possible range of 76.44 to 85.53. If the genre is Comedy, then the predicted audience score is 78.87 with a possible range of 77.20 to 80.54. If the genre is Musical & Performing Arts, then the predicted audience score is 79.90 with a possible range of 76.00 to 83.80.

pred_data <- tibble(
  audience_rating = rep("Upright", times = 3),
  imdb_rating = rep(7.1, times = 3),
  genre = c("Animation", "Comedy", "Musical & Performing Arts")
)

predict(final_model, pred_data)

##        1        2        3 
## 80.98379 78.87006 79.89759

predict(final_model, pred_data, interval = "confidence")

##        fit      lwr      upr
## 1 80.98379 76.44046 85.52712
## 2 78.87006 77.20414 80.53599
## 3 79.89759 75.99790 83.79729

Part 6: Conclusion

According to Rotten Tomatoes, the audience rating for Sing is 73%, which is slightly below the widest confidence interval (Animation). This could be due to many factors, for example the presence of primarily categorical variables in the final model or the choice of parsimony over higher adjusted R-squared. A better possibility is that the true relationship is nonlinear. audience_score is a count variable on a scale of whole numbers from 0 to 100 signifiying a given percentage, so a generalized linear model used for count variables is probably a better choice. This is not demonstrated above since the project topic is multiple linear regression.

Through the backward elimination p-value method for model selection, it has been shown that genre, IMDB rating, and Rotten Tomatoes audience rating are sufficient predictors for popular movies. Additionally, Titanic was the only film to be on the BoxOfficeMojo Top 200 Box Office list and win best picture.