Part 1: Data
The dataset is comprised of information from Rotten Tomatoes and
IMBD about 651 movies produced and released before 2016. Aside from the
size of this dataset, this data is generalizable due to the
data-collection methods of Rotten Tomatoes and IMDB. Rotten Tomatoes
employs a team of curators to read through thousands of movie and TV
reviews every week and have a specific algorithm for accurately
representing the data. IMDB pulls industry information through on-screen
credits, press kits, interviews, and so forth while also allowing users
to upload information. Their data consistently undergoes consistency
checks to ensure accuracy and reliability.
Inherently, bias is possible within the analysis of data. Selection
bias may occur if any genres or ratings are underrepresented in the
sample (for example, there are 9 Animation movies and 305 Drama movies).
Confirmation bias may be present if the reviewers only selected or
positively rated movies of interest to them. Due to the user-included
information of IMDB, some selections may be subject to response bias if
users submitted incorrect information.
This dataset is comprehensive, randomly selected, is gathered from
reliable sources, and compromises multiple decades and genres, so it is
generalizable for the purposes of this project. Causality cannot be
inferred since this data is observational and no random assignment was
used for this data . This project is strictly for proof of concept.
Part 3: Exploratory Data Analysis
The main variables of interest are the response variables
audience_score and imdb_rating. In addition to these, explore the
relationships among these and the explanatory variables title_type,
genre, runtime, thtr_rel_year (theatrical release year), and
dvd_rel_year (DVD release year). These may be associated with the
popularity of a given movie.
First, get a look at the structure of the dataset (651 observations
(rows) over 32 variables (columns)); it is comprised of character,
factor, numeric, and integer data types. A brief survey of which columns
contain missing values (NAs) is helpful. The majority of missing values
are found in the director, and actor1 through actor5 variables. The
actor1 through actor5 are contain the names of the first, second, third,
fourth, and fifth principal actors in a film. The specificity of these
variables and the lack of principal actors in a film may explain these
missing values. Most variables do not contain missing values, however
studio, dvd_rel_year, dvd_rel_month, dvd_rel_day all contain 8 missing
values. This merits a closer look.
## tibble [651 × 32] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num [1:651] 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
## title title_type genre runtime
## 0 0 0 1
## mpaa_rating studio thtr_rel_year thtr_rel_month
## 0 8 0 0
## thtr_rel_day dvd_rel_year dvd_rel_month dvd_rel_day
## 0 8 8 8
## imdb_rating imdb_num_votes critics_rating critics_score
## 0 0 0 0
## audience_rating audience_score best_pic_nom best_pic_win
## 0 0 0 0
## best_actor_win best_actress_win best_dir_win top200_box
## 0 0 0 0
## director actor1 actor2 actor3
## 2 2 7 9
## actor4 actor5 imdb_url rt_url
## 13 15 0 0
A common mistake would be assuming that there are only eight movies
with missing data since there are several variables with 8 missing
observations. That is why accurate filtering is important since there
are 16 observations in total. Notice that if_all allows filtering over
multiple rows with a given function by searching for a string of
characters in the variable names. if_all was chosen rather than if_any
because if_all is a stricter condition for filtering, requiring that (as
seen above in the counts of missing values per variable) all variables
with “dvd” in their name must have missing values rather than at least
one of them.
missing_subset <- movies %>%
filter(
is.na(studio) |
if_all(contains("dvd"), is.na)
)
missing_subset
## # A tibble: 16 × 32
## title title_type genre runtime mpaa_rating studio thtr_rel_year
## <chr> <fct> <fct> <dbl> <fct> <fct> <dbl>
## 1 Dirty Sanchez: The… Documenta… Come… 94 R <NA> 2007
## 2 Charlie: The Life … Documenta… Docu… 132 Unrated Warne… 2004
## 3 Streets of Gold Feature F… Drama 95 R Live … 1986
## 4 Caveman Feature F… Acti… 91 PG <NA> 1981
## 5 Attack of the 50 F… TV Movie Other 90 R <NA> 1993
## 6 The Squeeze Feature F… Acti… 101 PG-13 HBO V… 1987
## 7 Oliver & Company Feature F… Anim… 74 G <NA> 1988
## 8 The Man Who Sued G… Feature F… Drama 97 R <NA> 2001
## 9 Electric Dreams Feature F… Drama 95 PG MGM 1984
## 10 Porky's Revenge Feature F… Art … 92 R 20th … 1985
## 11 Teen Wolf Too Feature F… Scie… 95 PG Param… 1987
## 12 Inserts Feature F… Drama 117 NC-17 <NA> 1975
## 13 The Last Remake of… Feature F… Acti… 85 PG MCA U… 1977
## 14 Let It Be Documenta… Docu… 81 G Unite… 1970
## 15 Inbred Feature F… Art … 90 R <NA> 2011
## 16 Death Line (Raw Me… Feature F… Horr… 87 R <NA> 1972
## # ℹ 25 more variables: thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## # dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>,
## # imdb_rating <dbl>, imdb_num_votes <int>, critics_rating <fct>,
## # critics_score <dbl>, audience_rating <fct>, audience_score <dbl>,
## # best_pic_nom <fct>, best_pic_win <fct>, best_actor_win <fct>,
## # best_actress_win <fct>, best_dir_win <fct>, top200_box <fct>,
## # director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>, actor4 <chr>, …
Now consider the variables audience_score, imdb_rating,
imdb_num_votes. title_type, genre, runtime, thtr_rel_year (theatrical
release year), and dvd_rel_year (DVD release year) for the entire
dataset. Out of these variables, title_type, genre, thtr_rel_year, and
dvd_rel_year are categorical variables, and audience_score, imdb_rating,
imdb_num_votes, and runtime are numerical variables.
First, consider the categorical variables. Bar plots and frequency
tables allow the exploration and visualization of categorical data.
Among the full dataset, Feature Film was the most popular title type and
nearly half of the movies were Dramas. Comedy and Action & Adventure
were the second- and third-most popular genres. thr_rel_year (spanned
1970-2014) had releases primarily concentrated in 2000 onward. From
dvd_rel_year(spanned 1991-2015), the early 2000s saw a majority of DVD
releases. It appears to be more likely that movies had a different
Theatrical and DVD release year.
movies %>% count(title_type) %>% arrange(-n) %>% rename(`Title Type` = title_type, Total = n)
## # A tibble: 3 × 2
## `Title Type` Total
## <fct> <int>
## 1 Feature Film 591
## 2 Documentary 55
## 3 TV Movie 5
movies %>% count(genre) %>% arrange(-n) %>% rename(Genre = genre, Total = n)
## # A tibble: 11 × 2
## Genre Total
## <fct> <int>
## 1 Drama 305
## 2 Comedy 87
## 3 Action & Adventure 65
## 4 Mystery & Suspense 59
## 5 Documentary 52
## 6 Horror 23
## 7 Other 16
## 8 Art House & International 14
## 9 Musical & Performing Arts 12
## 10 Animation 9
## 11 Science Fiction & Fantasy 9
movies %>% count(thtr_rel_year) %>% arrange(-n) %>% rename(`Theatrical Release Year` = thtr_rel_year, Total = n)
## # A tibble: 44 × 2
## `Theatrical Release Year` Total
## <dbl> <int>
## 1 2006 33
## 2 2007 33
## 3 2004 28
## 4 1996 27
## 5 2012 27
## 6 2011 26
## 7 2003 25
## 8 2002 23
## 9 1993 22
## 10 2008 22
## # ℹ 34 more rows
movies %>% count(dvd_rel_year) %>% arrange(-n) %>% rename(`DVD Release Year` = dvd_rel_year, Total = n)
## # A tibble: 23 × 2
## `DVD Release Year` Total
## <dbl> <int>
## 1 2001 73
## 2 2002 58
## 3 2004 57
## 4 2003 51
## 5 2000 44
## 6 2005 42
## 7 2007 41
## 8 1999 37
## 9 1998 32
## 10 2013 31
## # ℹ 13 more rows
ggplot(movies) +
geom_bar(aes(x = thtr_rel_year)) +
scale_x_continuous(breaks = seq(1970, 2015, 5)) +
scale_y_continuous(breaks = seq(0, 35, 5)) +
labs(
title = "Movies by Theatrical Release Year",
x = "Year",
y = "Total"
) +
theme_classic() +
theme(plot.title = element_text(h = 0.5))

ggplot(movies) +
geom_bar(aes(x = dvd_rel_year)) +
scale_x_continuous(breaks = seq(1990,2015, 5)) +
scale_y_continuous(breaks = seq(0, 75, 5)) +
labs(
title = "Movies by DVD Release Year",
x = "Year",
y = "Total"
) +
theme_classic() +
theme(plot.title = element_text(h = 0.5))
## Warning: Removed 8 rows containing non-finite outside the scale range
## (`stat_count()`).

movies %>% count(thtr_rel_year == dvd_rel_year) %>%
rename(`Same Release Year - Theatrical and DVD` = `thtr_rel_year == dvd_rel_year`, Total = n)
## # A tibble: 3 × 2
## `Same Release Year - Theatrical and DVD` Total
## <lgl> <int>
## 1 FALSE 466
## 2 TRUE 177
## 3 NA 8
To get a better idea of the above plot, it’s worth looking at the
same information with imdb_rating and imdb_num_votes. Consider the IMDB
ratings on a scale of 1 to 10 (1 being the worst, 10 being the best).
Documentary and Feature Film each have outliers (observations outside of
the expected range). For boxplots, outliers are indetified by
subtracting 1.5 times the IQR from the first quartile (median of “lower
half” of data) and adding 1.5 times the IQR to the third quartile
(median of “upper half” of data); these two values are where the
respective minimum and maximum would be expected to lie. Points below or
above these ones are the outliers. Documentary has one outlier below the
minimum, and Feature Film has multiple outliers below the minimum and
one outlier above the maximum. Pay attention to the spread of data.
Documentary and Feature film appear to be symmetric about the median
(middle bar in the boxplot), suggesting little to no skew. The median of
TV Movie is the same as the third quartile, suggesting this title type
has significant left skew. This may be due to such a low number of
observations for TV Movie.
Check to see if audience_score or imdb_rating are positively
associated with runtime. This can be done using a linear model with the
ratings as response variables and with runtime as an explanatory
variable. Although the p-value was significant, it only explained about
3.3% of the variability in the model. Additionally, notice that the
conditions of linear regression were violated: this is not a linear
relationship (the residuals versus fitted plot curved up and down on the
ends), errors didn’t have constant variance (residuals versus fitted
plot shows a blob between 60 and 70 rather than a random cloud of data
points dispersed about the line y = 0), and the errors were not normally
distributed (the tails of the normal Q-Q plot deviate after the -1 and
+1 theoretical quantiles). Note: the documentary “The End of America” is
excluded from the residuals plot below due to not having a value for
runtime.
ggplot(movies, aes(x = runtime, y = audience_score)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
scale_x_continuous(breaks = seq(0, 300, 25), limits = c(0, 300)) +
scale_y_continuous(breaks = seq(0, 100, 10), limits = c(0, 100)) +
labs(
title = "Rotten Tomatoes Audience Score versus Runtime",
x = "Runtime",
y = "Audience Score",
caption = "1 observation from runtime is missing."
) +
theme_classic() +
theme(plot.title = element_text(h = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

model_1 <- lm(audience_score ~ runtime, data = movies)
summary(model_1)
##
## Call:
## lm(formula = audience_score ~ runtime, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.641 -15.626 3.008 17.080 34.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.4203 4.3256 9.807 < 2e-16 ***
## runtime 0.1883 0.0402 4.684 3.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.92 on 648 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.03275, Adjusted R-squared: 0.03125
## F-statistic: 21.94 on 1 and 648 DF, p-value: 3.431e-06
plot(model_1$residuals ~ movies$runtime[movies$title != "The End of America"])


This model behaved slightly better. run_time had a significant
p-value, yet it only explained about 7.2% of the variability. The
conditions of linear modeling were still violated. The residuals versus
fitted plot does not show a linear relationship due to the sharp
decrease around 6 on the fitted values scale. The residuals still had a
blob pattern rather than the desired random cloud, so constant variance
was violated. The errors were normally distributed but had noticeable
left skew.
ggplot(movies, aes(x = runtime, y = imdb_rating)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
scale_x_continuous(breaks = seq(0, 300, 25), limits = c(0, 300)) +
scale_y_continuous(breaks = seq(0, 10, 1), limits = c(0, 10)) +
labs(
title = "IMDB Rating versus Runtime",
x = "Runtime",
y = "IMDB Rating",
caption = "1 observation from runtime is missing."
) +
theme_classic() +
theme(plot.title = element_text(h = 0.5))
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

model_2 <- lm(imdb_rating ~ runtime, data = movies)
summary(model_2)
##
## Call:
## lm(formula = imdb_rating ~ runtime, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3099 -0.5791 0.0714 0.7572 2.2500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.907873 0.227163 21.605 < 2e-16 ***
## runtime 0.014965 0.002111 7.088 3.56e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.046 on 648 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.07195, Adjusted R-squared: 0.07052
## F-statistic: 50.24 on 1 and 648 DF, p-value: 3.564e-12
plot(model_2$residuals ~ movies$runtime[movies$title != "The End of America"])


Consider some other variables of interest: mpaa_rating,
audience_rating (“Upright” - at least 60% of audience reviews were 3.5
stars out of 5 stars or more, “Spilled” - at least 60% of audience
reviews were 3.5 stars out of 5 stars or higher), critics_rating
(“Fresh” - less than 60% review score, “Certified Fresh” - specialized
“Fresh” criteria that is harder to obtain, “Rotten” - less than 60%
review score), critic_score (Rotten Tomatoes only, 0 to 100 scale),
best_pic_nom (best picture nomination), best_pic_win (did the movie win
best picture), and top200_box (was the movie on the Top 200 Box Office
list from BoxOfficeMojo).
Out of all movies rated by the audience, only 6 out of 11 genres had
more Upright ratings than Spilled ratings. Among genres, Documentary
movies had the highest percentage of Upright ratings by proportion. By
nature, obtaining a Certified Fresh rating is more difficult than
obtaining a Fresh rating. There are no genres in which there were more
Certified Fresh ratings than Rotten ratings, but there are 3 genres in
which there were more Fresh ratings than Rotten ratings. Again,
Documentary movies had the highest percentage of Fresh or higher ratings
by proportion.
genre_labels <- c("Action", "Animation", "Art", "Comedy", "Doc", "Drama", "Horror", "Music", "Mystery", "Other", "Sci-Fi")
ggplot(movies) +
geom_bar(aes(x = genre, fill = audience_rating), position = "dodge") +
scale_x_discrete(labels = genre_labels) +
scale_y_continuous(breaks = seq(0, 200, 20), limits = c(0, 200)) +
labs(
title = "Audience Rating by Genre",
x = "Genre",
y = "Total",
fill = "Audience Rating"
) +
theme_classic() +
theme(plot.title = element_text(h = 0.5))

movies %>%
group_by(genre) %>%
reframe(
`Number of Movies` = n(),
`Percentage of Upright Audience Ratings` = round(sum(audience_rating == "Upright")/n() * 100, 2)
) %>%
arrange(-`Percentage of Upright Audience Ratings`) %>%
rename(Genre = genre)
## # A tibble: 11 × 3
## Genre `Number of Movies` Percentage of Upright Audience…¹
## <fct> <int> <dbl>
## 1 Documentary 52 98.1
## 2 Musical & Performing Arts 12 91.7
## 3 Art House & International 14 71.4
## 4 Other 16 68.8
## 5 Animation 9 66.7
## 6 Drama 305 65.6
## 7 Mystery & Suspense 59 40.7
## 8 Action & Adventure 65 38.5
## 9 Comedy 87 35.6
## 10 Science Fiction & Fantasy 9 33.3
## 11 Horror 23 17.4
## # ℹ abbreviated name: ¹`Percentage of Upright Audience Ratings`
ggplot(movies) +
geom_bar(aes(x = genre, fill = critics_rating), position = "dodge") +
scale_x_discrete(labels = genre_labels) +
scale_y_continuous(breaks = seq(0, 140, 20), limits = c(0, 140)) +
labs(
title = "Critic Rating by Genre",
x = "Genre",
y = "Total",
fill = "Critic Rating"
) +
theme_classic() +
theme(plot.title = element_text(h = 0.5))

movies %>%
group_by(genre) %>%
reframe(
`Number of Movies` = n(),
`Percentage of Fresh or Higher Critic Ratings` = round(sum(critics_rating != "Rotten")/n() * 100, 2)
) %>%
arrange(-`Percentage of Fresh or Higher Critic Ratings`) %>%
rename(Genre = genre)
## # A tibble: 11 × 3
## Genre `Number of Movies` Percentage of Fresh or Higher …¹
## <fct> <int> <dbl>
## 1 Documentary 52 94.2
## 2 Musical & Performing Arts 12 75
## 3 Other 16 68.8
## 4 Drama 305 59.3
## 5 Science Fiction & Fantasy 9 55.6
## 6 Mystery & Suspense 59 50.8
## 7 Art House & International 14 50
## 8 Animation 9 44.4
## 9 Horror 23 30.4
## 10 Comedy 87 27.6
## 11 Action & Adventure 65 26.2
## # ℹ abbreviated name: ¹`Percentage of Fresh or Higher Critic Ratings`
The only genres to be nominated for best picture were Comedy, Drama,
Mystery & Suspense, and Other. The same genres were also present on
the Top 200 Box Office list. Out of those categories, only Other did not
win best picture. The only genre that was on the Top 200 Box Office list
and won best picture was Drama (the movie was Titanic).
movies %>%
group_by(genre) %>%
select(genre, best_pic_nom) %>%
count(best_pic_nom) %>%
arrange(genre) %>%
rename(Genre = genre, `Best Picture Nomination` = best_pic_nom, Total = n)
## # A tibble: 15 × 3
## # Groups: Genre [11]
## Genre `Best Picture Nomination` Total
## <fct> <fct> <int>
## 1 Action & Adventure no 65
## 2 Animation no 9
## 3 Art House & International no 14
## 4 Comedy no 86
## 5 Comedy yes 1
## 6 Documentary no 52
## 7 Drama no 288
## 8 Drama yes 17
## 9 Horror no 23
## 10 Musical & Performing Arts no 12
## 11 Mystery & Suspense no 57
## 12 Mystery & Suspense yes 2
## 13 Other no 14
## 14 Other yes 2
## 15 Science Fiction & Fantasy no 9
movies %>%
group_by(genre) %>%
select(genre, best_pic_win) %>%
count(best_pic_win) %>%
arrange(genre) %>%
rename(Genre = genre, `Did it win best picture?` = best_pic_win, Total = n)
## # A tibble: 14 × 3
## # Groups: Genre [11]
## Genre `Did it win best picture?` Total
## <fct> <fct> <int>
## 1 Action & Adventure no 65
## 2 Animation no 9
## 3 Art House & International no 14
## 4 Comedy no 86
## 5 Comedy yes 1
## 6 Documentary no 52
## 7 Drama no 300
## 8 Drama yes 5
## 9 Horror no 23
## 10 Musical & Performing Arts no 12
## 11 Mystery & Suspense no 58
## 12 Mystery & Suspense yes 1
## 13 Other no 16
## 14 Science Fiction & Fantasy no 9
movies %>%
group_by(genre) %>%
select(genre, top200_box) %>%
count(top200_box) %>%
arrange(genre) %>%
rename(Genre = genre, `Was it on the Top 200 Box Office list?` = top200_box, Total = n)
## # A tibble: 17 × 3
## # Groups: Genre [11]
## Genre `Was it on the Top 200 Box Office list?` Total
## <fct> <fct> <int>
## 1 Action & Adventure no 60
## 2 Action & Adventure yes 5
## 3 Animation no 9
## 4 Art House & International no 14
## 5 Comedy no 86
## 6 Comedy yes 1
## 7 Documentary no 52
## 8 Drama no 299
## 9 Drama yes 6
## 10 Horror no 23
## 11 Musical & Performing Arts no 12
## 12 Mystery & Suspense no 58
## 13 Mystery & Suspense yes 1
## 14 Other no 15
## 15 Other yes 1
## 16 Science Fiction & Fantasy no 8
## 17 Science Fiction & Fantasy yes 1
movies %>%
group_by(genre) %>%
select(genre, top200_box, best_pic_win) %>%
count(top200_box == "yes" & best_pic_win == "yes") %>%
arrange(genre) %>%
rename(
Genre = genre,
`Was it on the Top 200 Box Office list and won best picture?` = `top200_box == "yes" & best_pic_win == "yes"`,
Total = n
)
## # A tibble: 12 × 3
## # Groups: Genre [11]
## Genre Was it on the Top 200 Box Office list and w…¹ Total
## <fct> <lgl> <int>
## 1 Action & Adventure FALSE 65
## 2 Animation FALSE 9
## 3 Art House & International FALSE 14
## 4 Comedy FALSE 87
## 5 Documentary FALSE 52
## 6 Drama FALSE 304
## 7 Drama TRUE 1
## 8 Horror FALSE 23
## 9 Musical & Performing Arts FALSE 12
## 10 Mystery & Suspense FALSE 59
## 11 Other FALSE 16
## 12 Science Fiction & Fantasy FALSE 9
## # ℹ abbreviated name:
## # ¹`Was it on the Top 200 Box Office list and won best picture?`
movies %>% filter(top200_box == "yes" & best_pic_win == "yes") %>% select(title)
## # A tibble: 1 × 1
## title
## <chr>
## 1 Titanic
Part 4: Modeling
To construct a model for explaining movie popularity through
audience_score and imdb_rating, it has to be decided which variables are
the best for explaining the relationship. This can be done with forward
selection (adding variables to the model) or backward elimination
(removing variables from the model) among other techniques. Backward
elimination will be done here. In backward elimination, start with a
full model and remove variables based on either improved adjusted
R-squared or p-value. The adjusted R-squared method requires removing
each variable one at a time over and over until the adjusted R-squared
does not improve. The p-value method works similarly but removing the
largest p-value instead. The adjusted R-squared method results in a
stronger prediction power by prioritizing the highest adjusted R-squared
value, but the p-value method results in the simplest model with a
slightly lower prediction power (parsimony). Here parsimony is favored,
so the backward elimination p-value method is used.
It is obvious that the variables imdb_url and rt_url will not offer
anything to the model since they’re just URLs to the website entry for
each movie. Don’t use title for similar reasons. If absolutely all
variables are included, then the adjusted R-squared will be 1 (all
variability is explained by the model) but the model will be overfitted
(it will look “too busy” and be hard to determine trends). Instead
create a full model for audience_score using title_type, genre, runtime,
mpaa_rating, thtr_rel_year, imdb_rating, imdb_num_votes, critic_score,
critics_rating, audience_rating, best_pic_nom, best_pic_win, and
top200_box.
full_model <- lm(audience_score ~ title_type + genre + runtime + mpaa_rating + thtr_rel_year + imdb_rating + imdb_num_votes + critics_score + critics_rating + audience_rating + best_pic_nom + best_pic_win + top200_box, data = movies)
Before running the model, the variables need to be checked for
multicollinearity. This happens when independent variables are strongly
correlated to each other to the extent that the model is overfitted and
as a result it is difficult to interpret. When multicollinearity is
present, the variances of the variables are inflated; one way to check
for multicollinearity is to check the Variance Inflation Factor with the
function vif. If the VIF is measured as greater than 5, there is a
suffucient amount of multicollinearity present. Multicollinearity can be
handled by dropping the strongly correlated variables, averaging them,
or reducing dimensionality using principal component analysis (PCA).
Here, the strongly correlated variables will be dropped.
Note that there are three measures present for the variables: GVIF,
Df, and GVID^(1/(2*Df)). Since categorical variables are included in the
model, the measure GVIF (generalized VIF) is included to calculate a VIF
that would be the same regardless of the factor level. Df (degrees of
freedom) represents the number of independent pieces to calculate a
statistic. The final column represents the generalized standard
inflation factor, takign the GVIF to the fractional power of 1/2 times
the Df column for something more generalizable should the factors have
more than two levels. For that reason, aGSIF will be the chosen measure
here. Typically a cutoff value for GVIF is to drop variables with a GVIF
of 5 or higher. When using aGSIF, take the square root of the GVIF
threshold, so if aGSIF is greater than 2.2361, drop the variable. Based
on that, critics_score should be dropped from the model.
## GVIF Df GVIF^(1/(2*Df))
## title_type 7.336109 2 1.645760
## genre 16.397918 10 1.150110
## runtime 1.443902 1 1.201625
## mpaa_rating 3.245946 5 1.124952
## thtr_rel_year 1.451996 1 1.204988
## imdb_rating 3.865947 1 1.966201
## imdb_num_votes 1.904348 1 1.379981
## critics_score 7.134938 1 2.671131
## critics_rating 6.064221 2 1.569256
## audience_rating 2.109643 1 1.452461
## best_pic_nom 1.437729 1 1.199053
## best_pic_win 1.379269 1 1.174423
## top200_box 1.165560 1 1.079611
## [1] 2.236068
Doing the process again, the next variable to be dropped is
critics_rating, which is insignificant on all levels and has the highest
p-value on the level Fresh.
model1 <- lm(audience_score ~ title_type + genre + runtime + thtr_rel_year + imdb_rating + imdb_num_votes + critics_rating + audience_rating + best_pic_nom + best_pic_win + top200_box, data = movies)
summary(model1)
##
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime +
## thtr_rel_year + imdb_rating + imdb_num_votes + critics_rating +
## audience_rating + best_pic_nom + best_pic_win + top200_box,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.4622 -4.6850 0.4051 4.2036 24.5438
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.836e+01 5.511e+01 1.603 0.1093
## title_typeFeature Film 1.920e+00 2.533e+00 0.758 0.4486
## title_typeTV Movie 5.919e-01 4.008e+00 0.148 0.8826
## genreAnimation 3.265e+00 2.467e+00 1.323 0.1862
## genreArt House & International -2.467e+00 2.053e+00 -1.201 0.2301
## genreComedy 1.389e+00 1.137e+00 1.221 0.2224
## genreDocumentary 2.733e+00 2.717e+00 1.006 0.3148
## genreDrama -7.085e-01 9.897e-01 -0.716 0.4743
## genreHorror -2.121e+00 1.680e+00 -1.262 0.2073
## genreMusical & Performing Arts 3.840e+00 2.353e+00 1.632 0.1032
## genreMystery & Suspense -3.181e+00 1.260e+00 -2.524 0.0119 *
## genreOther -4.398e-01 1.952e+00 -0.225 0.8218
## genreScience Fiction & Fantasy -4.937e-01 2.449e+00 -0.202 0.8403
## runtime -3.336e-02 1.614e-02 -2.067 0.0392 *
## thtr_rel_year -4.835e-02 2.723e-02 -1.776 0.0763 .
## imdb_rating 9.454e+00 4.273e-01 22.124 <2e-16 ***
## imdb_num_votes 3.968e-06 3.244e-06 1.223 0.2217
## critics_ratingFresh -1.010e-01 8.597e-01 -0.118 0.9065
## critics_ratingRotten -1.178e+00 9.341e-01 -1.261 0.2077
## audience_ratingUpright 1.998e+01 7.863e-01 25.410 <2e-16 ***
## best_pic_nomyes 3.393e+00 1.774e+00 1.913 0.0562 .
## best_pic_winyes -2.742e+00 3.044e+00 -0.901 0.3681
## top200_boxyes -7.445e-01 1.905e+00 -0.391 0.6961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.835 on 627 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8898, Adjusted R-squared: 0.8859
## F-statistic: 230 on 22 and 627 DF, p-value: < 2.2e-16
The next highest p-value belongs to genre, however there is one
genre level with a significant p-value (Mystery & Suspense). Thus,
drop title_type instead.
model2 <- lm(audience_score ~ title_type + genre + runtime + thtr_rel_year + imdb_rating + imdb_num_votes + audience_rating + best_pic_nom + best_pic_win + top200_box, data = movies)
summary(model2)
##
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime +
## thtr_rel_year + imdb_rating + imdb_num_votes + audience_rating +
## best_pic_nom + best_pic_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.6136 -4.4575 0.5043 4.2266 24.7670
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.076e+01 5.374e+01 1.689 0.0918 .
## title_typeFeature Film 1.680e+00 2.529e+00 0.664 0.5068
## title_typeTV Movie 4.953e-01 4.005e+00 0.124 0.9016
## genreAnimation 3.453e+00 2.463e+00 1.402 0.1615
## genreArt House & International -2.385e+00 2.053e+00 -1.162 0.2458
## genreComedy 1.461e+00 1.137e+00 1.286 0.1991
## genreDocumentary 2.780e+00 2.717e+00 1.023 0.3068
## genreDrama -5.360e-01 9.840e-01 -0.545 0.5862
## genreHorror -2.001e+00 1.679e+00 -1.192 0.2337
## genreMusical & Performing Arts 3.943e+00 2.353e+00 1.676 0.0943 .
## genreMystery & Suspense -3.024e+00 1.254e+00 -2.412 0.0162 *
## genreOther -2.048e-01 1.947e+00 -0.105 0.9162
## genreScience Fiction & Fantasy -1.285e-01 2.439e+00 -0.053 0.9580
## runtime -3.536e-02 1.609e-02 -2.197 0.0284 *
## thtr_rel_year -5.054e-02 2.657e-02 -1.902 0.0576 .
## imdb_rating 9.700e+00 3.980e-01 24.372 <2e-16 ***
## imdb_num_votes 4.155e-06 3.164e-06 1.313 0.1896
## audience_ratingUpright 2.013e+01 7.767e-01 25.922 <2e-16 ***
## best_pic_nomyes 3.505e+00 1.765e+00 1.986 0.0474 *
## best_pic_winyes -2.677e+00 3.041e+00 -0.880 0.3790
## top200_boxyes -6.104e-01 1.898e+00 -0.322 0.7479
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.838 on 629 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8858
## F-statistic: 252.7 on 20 and 629 DF, p-value: < 2.2e-16
Next, drop top200_box.
model3 <- lm(audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + imdb_num_votes + audience_rating + best_pic_nom + best_pic_win + top200_box, data = movies)
summary(model3)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year +
## imdb_rating + imdb_num_votes + audience_rating + best_pic_nom +
## best_pic_win + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.6550 -4.5107 0.5136 4.2513 24.7317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.518e+01 5.336e+01 1.784 0.0749 .
## genreAnimation 3.473e+00 2.460e+00 1.411 0.1586
## genreArt House & International -2.350e+00 2.050e+00 -1.146 0.2521
## genreComedy 1.428e+00 1.134e+00 1.259 0.2084
## genreDocumentary 1.263e+00 1.470e+00 0.860 0.3903
## genreDrama -5.308e-01 9.813e-01 -0.541 0.5887
## genreHorror -2.001e+00 1.677e+00 -1.193 0.2333
## genreMusical & Performing Arts 3.429e+00 2.220e+00 1.544 0.1230
## genreMystery & Suspense -3.015e+00 1.252e+00 -2.407 0.0164 *
## genreOther -2.811e-01 1.933e+00 -0.145 0.8844
## genreScience Fiction & Fantasy -1.404e-01 2.436e+00 -0.058 0.9541
## runtime -3.527e-02 1.607e-02 -2.194 0.0286 *
## thtr_rel_year -5.188e-02 2.649e-02 -1.959 0.0506 .
## imdb_rating 9.684e+00 3.958e-01 24.467 <2e-16 ***
## imdb_num_votes 4.363e-06 3.149e-06 1.386 0.1663
## audience_ratingUpright 2.011e+01 7.750e-01 25.945 <2e-16 ***
## best_pic_nomyes 3.523e+00 1.762e+00 1.999 0.0460 *
## best_pic_winyes -2.722e+00 3.037e+00 -0.896 0.3704
## top200_boxyes -6.212e-01 1.896e+00 -0.328 0.7433
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.83 on 631 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8892, Adjusted R-squared: 0.8861
## F-statistic: 281.4 on 18 and 631 DF, p-value: < 2.2e-16
Drop the variable best_pic_win.
model4 <- lm(audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + imdb_num_votes + audience_rating + best_pic_nom + best_pic_win, data = movies)
summary(model4)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year +
## imdb_rating + imdb_num_votes + audience_rating + best_pic_nom +
## best_pic_win, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.6409 -4.4650 0.4914 4.2564 24.7323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.333e+01 5.302e+01 1.760 0.0788 .
## genreAnimation 3.509e+00 2.456e+00 1.429 0.1536
## genreArt House & International -2.321e+00 2.047e+00 -1.134 0.2573
## genreComedy 1.457e+00 1.129e+00 1.290 0.1975
## genreDocumentary 1.286e+00 1.467e+00 0.877 0.3809
## genreDrama -4.989e-01 9.757e-01 -0.511 0.6093
## genreHorror -1.968e+00 1.673e+00 -1.177 0.2398
## genreMusical & Performing Arts 3.465e+00 2.216e+00 1.563 0.1185
## genreMystery & Suspense -2.980e+00 1.247e+00 -2.390 0.0172 *
## genreOther -2.545e-01 1.930e+00 -0.132 0.8951
## genreScience Fiction & Fantasy -1.559e-01 2.434e+00 -0.064 0.9489
## runtime -3.545e-02 1.605e-02 -2.209 0.0275 *
## thtr_rel_year -5.096e-02 2.632e-02 -1.936 0.0533 .
## imdb_rating 9.687e+00 3.954e-01 24.500 <2e-16 ***
## imdb_num_votes 4.115e-06 3.054e-06 1.347 0.1783
## audience_ratingUpright 2.010e+01 7.739e-01 25.971 <2e-16 ***
## best_pic_nomyes 3.531e+00 1.761e+00 2.005 0.0454 *
## best_pic_winyes -2.711e+00 3.035e+00 -0.893 0.3720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.825 on 632 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8892, Adjusted R-squared: 0.8862
## F-statistic: 298.4 on 17 and 632 DF, p-value: < 2.2e-16
Drop imdb_num_votes next.
model5 <- lm(audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + imdb_num_votes + audience_rating + best_pic_nom, data = movies)
summary(model5)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year +
## imdb_rating + imdb_num_votes + audience_rating + best_pic_nom,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.6008 -4.4067 0.5665 4.2431 24.7454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.919e+01 5.281e+01 1.689 0.0917 .
## genreAnimation 3.473e+00 2.455e+00 1.414 0.1577
## genreArt House & International -2.374e+00 2.046e+00 -1.161 0.2462
## genreComedy 1.410e+00 1.128e+00 1.250 0.2118
## genreDocumentary 1.205e+00 1.464e+00 0.823 0.4107
## genreDrama -5.253e-01 9.751e-01 -0.539 0.5903
## genreHorror -1.995e+00 1.672e+00 -1.193 0.2333
## genreMusical & Performing Arts 3.417e+00 2.215e+00 1.542 0.1235
## genreMystery & Suspense -3.009e+00 1.246e+00 -2.415 0.0160 *
## genreOther -1.452e-01 1.926e+00 -0.075 0.9399
## genreScience Fiction & Fantasy -1.407e-01 2.433e+00 -0.058 0.9539
## runtime -3.594e-02 1.604e-02 -2.241 0.0254 *
## thtr_rel_year -4.887e-02 2.621e-02 -1.864 0.0627 .
## imdb_rating 9.698e+00 3.952e-01 24.542 <2e-16 ***
## imdb_num_votes 3.562e-06 2.990e-06 1.191 0.2339
## audience_ratingUpright 2.011e+01 7.737e-01 25.990 <2e-16 ***
## best_pic_nomyes 2.896e+00 1.611e+00 1.798 0.0727 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.824 on 633 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8891, Adjusted R-squared: 0.8863
## F-statistic: 317.1 on 16 and 633 DF, p-value: < 2.2e-16
Now, drop thtr_rel_year.
model6 <- lm(audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + audience_rating + best_pic_nom, data = movies)
summary(model6)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year +
## imdb_rating + audience_rating + best_pic_nom, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.3445 -4.5305 0.6003 4.3237 25.0882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.06877 50.32956 1.392 0.1644
## genreAnimation 3.41004 2.45571 1.389 0.1654
## genreArt House & International -2.73156 2.02426 -1.349 0.1777
## genreComedy 1.32740 1.12630 1.179 0.2390
## genreDocumentary 0.67827 1.39591 0.486 0.6272
## genreDrama -0.74993 0.95703 -0.784 0.4336
## genreHorror -2.07738 1.67144 -1.243 0.2144
## genreMusical & Performing Arts 2.96907 2.18374 1.360 0.1744
## genreMystery & Suspense -3.13053 1.24242 -2.520 0.0120 *
## genreOther -0.12361 1.92638 -0.064 0.9489
## genreScience Fiction & Fantasy -0.03079 2.43236 -0.013 0.9899
## runtime -0.03122 0.01555 -2.008 0.0451 *
## thtr_rel_year -0.03980 0.02509 -1.586 0.1132
## imdb_rating 9.82750 0.38006 25.858 <2e-16 ***
## audience_ratingUpright 20.14844 0.77320 26.058 <2e-16 ***
## best_pic_nomyes 3.32335 1.57109 2.115 0.0348 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.826 on 634 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8888, Adjusted R-squared: 0.8862
## F-statistic: 337.9 on 15 and 634 DF, p-value: < 2.2e-16
Notice that runtime is no longer a significant predictor of
audience_score. Drop runtime from the model.
model7 <- lm(audience_score ~ genre + runtime + imdb_rating + audience_rating + best_pic_nom, data = movies)
summary(model7)
##
## Call:
## lm(formula = audience_score ~ genre + runtime + imdb_rating +
## audience_rating + best_pic_nom, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.5455 -4.5331 0.6866 4.2669 24.9626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.66166 2.51065 -3.848 0.000131 ***
## genreAnimation 3.16588 2.45381 1.290 0.197454
## genreArt House & International -2.82708 2.02578 -1.396 0.163337
## genreComedy 1.27478 1.12715 1.131 0.258492
## genreDocumentary 0.38907 1.38561 0.281 0.778961
## genreDrama -0.80737 0.95748 -0.843 0.399420
## genreHorror -1.98081 1.67232 -1.184 0.236671
## genreMusical & Performing Arts 2.84670 2.18498 1.303 0.193099
## genreMystery & Suspense -3.21443 1.24277 -2.586 0.009917 **
## genreOther 0.08709 1.92409 0.045 0.963911
## genreScience Fiction & Fantasy 0.17518 2.43179 0.072 0.942595
## runtime -0.02959 0.01553 -1.905 0.057181 .
## imdb_rating 9.83886 0.38045 25.861 < 2e-16 ***
## audience_ratingUpright 20.20795 0.77321 26.135 < 2e-16 ***
## best_pic_nomyes 3.32275 1.57296 2.112 0.035041 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.835 on 635 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8884, Adjusted R-squared: 0.8859
## F-statistic: 361 on 14 and 635 DF, p-value: < 2.2e-16
Similarly best_pic_nom is no longer a significant predictor. Drop
best_pic_nom.
model8 <- lm(audience_score ~ genre + imdb_rating + audience_rating + best_pic_nom, data = movies)
summary(model8)
##
## Call:
## lm(formula = audience_score ~ genre + imdb_rating + audience_rating +
## best_pic_nom, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.6918 -4.3362 0.6878 4.3046 24.8862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.92671 2.22536 -5.359 1.17e-07 ***
## genreAnimation 3.62381 2.44920 1.480 0.1395
## genreArt House & International -2.71432 2.03076 -1.337 0.1818
## genreComedy 1.45515 1.12646 1.292 0.1969
## genreDocumentary 0.60282 1.36621 0.441 0.6592
## genreDrama -0.90457 0.95892 -0.943 0.3459
## genreHorror -1.64797 1.66794 -0.988 0.3235
## genreMusical & Performing Arts 2.69589 2.18990 1.231 0.2188
## genreMystery & Suspense -3.31306 1.24533 -2.660 0.0080 **
## genreOther 0.01143 1.92924 0.006 0.9953
## genreScience Fiction & Fantasy 0.23198 2.43863 0.095 0.9242
## imdb_rating 9.69860 0.37378 25.947 < 2e-16 ***
## audience_ratingUpright 20.28844 0.77402 26.212 < 2e-16 ***
## best_pic_nomyes 2.72532 1.54616 1.763 0.0784 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.854 on 637 degrees of freedom
## Multiple R-squared: 0.8874, Adjusted R-squared: 0.8851
## F-statistic: 386.2 on 13 and 637 DF, p-value: < 2.2e-16
There are no variables with insignificant p-values (if categorical
on all levels) left. The backward elimination is complete. Notice that
the adjusted R-squared is now 88.47% (this is the percentage of total
variance explained by the model) is lower than the initial adjusted
R-squared of 88.53%. This is the trade-off with choosing a parsimonious
model.
final_model <- lm(audience_score ~ genre + imdb_rating + audience_rating, data = movies)
summary(final_model)
##
## Call:
## lm(formula = audience_score ~ genre + imdb_rating + audience_rating,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.636 -4.441 0.584 4.382 25.095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.5871 2.1972 -5.729 1.56e-08 ***
## genreAnimation 3.6255 2.4532 1.478 0.13995
## genreArt House & International -2.7917 2.0336 -1.373 0.17031
## genreComedy 1.5118 1.1279 1.340 0.18060
## genreDocumentary 0.4068 1.3639 0.298 0.76558
## genreDrama -0.8352 0.9597 -0.870 0.38446
## genreHorror -1.6205 1.6706 -0.970 0.33241
## genreMusical & Performing Arts 2.5393 2.1917 1.159 0.24706
## genreMystery & Suspense -3.2767 1.2472 -2.627 0.00881 **
## genreOther 0.2735 1.9267 0.142 0.88718
## genreScience Fiction & Fantasy 0.2566 2.4426 0.105 0.91637
## imdb_rating 9.8078 0.3692 26.564 < 2e-16 ***
## audience_ratingUpright 20.3099 0.7752 26.199 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.866 on 638 degrees of freedom
## Multiple R-squared: 0.8869, Adjusted R-squared: 0.8847
## F-statistic: 416.8 on 12 and 638 DF, p-value: < 2.2e-16
It needs to be ensured that the conditions for multiple linear
regression are met: a linear relationship between numerical variables,
nearly normal residuals with mean 0, and constant variability of
residuals.
The only numerical value in our final model is imdb_rating. To check
for linearity, plot the residuals against imdb_rating. We see a complete
random scatter around the line y = 0. The relationship is linear.
plot(
final_model$residuals ~ movies$imdb_rating,
main = "Residuals of Final Model against IMDB Rating",
xlab = "IMDB Rating",
ylab = "Residuals"
)
abline(h = 0)

The residuals are normally distributed with minimal deviation from
the Q-Q line. The mean of residuals is etremely close to 0. The
condition of normality is satisfied.
hist(final_model$residuals)

qqnorm(final_model$residuals)
qqline(final_model$residuals)

mean(final_model$residuals)
## [1] -2.251676e-18
Note that clusters are present due to the presence of categorical
variables in the model (most likely due to audience_rating since it’s a
factor of two levels). Since both clusters have positive and negative
residuals, this is fine–neither cluster overestimates or underestimates
the result. What’s more important is that the residuals are randomly
scattered about the line y = 0 with constant width. The condition of
constant variance is satisfied.
plot(
final_model$residuals ~ final_model$fitted,
main = "Residuals of Final Model against Fitted Values of Final Model",
xlab = "Fitted Values",
ylab = "Residuals"
)
abline(h = 0)

plot(
abs(final_model$residuals) ~ final_model$fitted,
main = "Absolute Value of Residuals of Final Model against \n Fitted Values of Final Model",
xlab = "Fitted Values",
ylab = "Absolute Value of Residuals"
)
abline(h = 0)
