The data set is comprised of 651 randomly sampled movies produced and released before 2016.Thus any analysis is generalizable to movies produced and released before 2016. Since there were no experimental groups or random assignment, any analysis on this data set cannot determine causality, only relationships and correlations.
#Create new variable based on `title_type`: New variable should be called `feature_film` with levels yes (movies that are feature films) and no
summary(movies$title_type)## Documentary Feature Film TV Movie
## 55 591 5
movies <- movies %>% mutate(feature_film = ifelse(as.character(title_type) == "Feature Film", "yes", "no"))
movies %>% group_by(feature_film) %>% summarise(count = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## feature_film count
## <chr> <int>
## 1 no 60
## 2 yes 591
#Create new variable based on `genre`: New variable should be called `drama` with levels yes (movies that are dramas) and no
summary(movies$genre)## Action & Adventure Animation Art House & International
## 65 9 14
## Comedy Documentary Drama
## 87 52 305
## Horror Musical & Performing Arts Mystery & Suspense
## 23 12 59
## Other Science Fiction & Fantasy
## 16 9
movies <- movies %>% mutate(drama = ifelse(as.character(genre) == "Drama", "yes", "no"))
movies %>% group_by(drama) %>% summarise(count = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## drama count
## <chr> <int>
## 1 no 346
## 2 yes 305
#Create new variable based on `mpaa_rating`: New variable should be called `mpaa_rating_R` with levels yes (movies that are R rated) and no
summary(movies$mpaa_rating)## G NC-17 PG PG-13 R Unrated
## 19 2 118 133 329 50
movies <- movies %>% mutate(mpaa_rating_R = ifelse(as.character(mpaa_rating) == "R", "yes", "no"))
movies %>% group_by(mpaa_rating_R) %>% summarise(count = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## mpaa_rating_R count
## <chr> <int>
## 1 no 322
## 2 yes 329
#Create two new variables based on `thtr_rel_month`: New variable called `oscar_season` with levels yes (if movie is released in November, October, or December) and no (2 pt) New variable called `summer_season` with levels yes (if movie is released in May, June, July, or August) and no
summary(movies$thtr_rel_month)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 7.00 6.74 10.00 12.00
oscar_month <- c(11, 10, 12)
movies <- movies %>% mutate(oscar_season = ifelse(thtr_rel_month %in% oscar_month, "yes", "no"))
movies %>% select(oscar_season, thtr_rel_month) %>% head(movies, n = 10)## # A tibble: 10 x 2
## oscar_season thtr_rel_month
## <chr> <dbl>
## 1 no 4
## 2 no 3
## 3 no 8
## 4 yes 10
## 5 no 9
## 6 no 1
## 7 no 1
## 8 yes 11
## 9 no 9
## 10 no 3
summer_months <- c(5,6,7,8)
movies <- movies %>% mutate(summer_season = ifelse(thtr_rel_month %in% summer_months, "yes", "no"))
movies %>% select(summer_season, thtr_rel_month) %>% head(movies, n = 10)## # A tibble: 10 x 2
## summer_season thtr_rel_month
## <chr> <dbl>
## 1 no 4
## 2 no 3
## 3 yes 8
## 4 no 10
## 5 no 9
## 6 no 1
## 7 no 1
## 8 no 11
## 9 no 9
## 10 no 3
Conduct exploratory data analysis of the relationship between audience_score and the new variables constructed in the previous part
#Audience score vs feature film and title type
ggplot(data = movies, aes(x = title_type, y = audience_score, fill = feature_film)) + geom_boxplot()The audience scores for feature films seem to typically be less than that of documentaries and TV movies. When combining TV movie scores with Documentaries, Feature Films still score less in audience scores.
#Audience score vs dramas and genres
ggplot(data = movies, aes(x = genre, y = audience_score, fill = drama)) + geom_boxplot()Compared to all the other genres, Drama’s score somewhere in the middle range with three different genres having a higher median and the rest scoring generally below dramas. When combined using the drama variable, drama’s have a slightly higher median audience score than all the other genre’s combined.
#Audience score vs MPAA rating and specifically R-rated movies
ggplot(data = movies, aes(x = mpaa_rating, y = audience_score, fill = mpaa_rating_R)) + geom_boxplot()When separated by rating, all the movies generally fall into a similar range except for Unrated movies which are a bit higher but also have some low scored outliers. The R rated movies have an distribution of audience scores that is very similar to that of PG rated movies. When combining all the movies besides R using the mpaa_rating_R variable, the distributions end up being very similar.
#Audience scores vs Oscar seasons
ggplot(data = movies, aes(x = as.character(thtr_rel_month), y = audience_score, fill = oscar_season)) + geom_boxplot() + xlab("Theater Release Month")Movies released during the Oscar Season (months 10, 11, and 12) have median audience scores that are only slightly higher than the rest.
#Audience scores vs summer seasons
ggplot(data = movies, aes(x = as.character(thtr_rel_month), y = audience_score, fill = summer_season)) + geom_boxplot() + xlab("Theater Release Month")Movies realeased in the summer season (months 5 6 7 and 8) have about the same median audience scores as other movies. * * *
Develop a Bayesian regression model to predict audience_score using the following explanatory variables: feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, and top200_box. For the regression model we will start with some exploration of audience_score since it will be the response variable in the model.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.36 80.00 97.00
The median of the distribution is 65. This also shows us that 25% of these randomly sampled movies scored at least 80 points. The distribution is left skewed which means that in this data set, more movies have audience scores above the mean than below it.
m_audscore_full <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = movies)
tidy(m_audscore_full)## # A tibble: 17 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 124. 77.5 1.61 1.09e- 1
## 2 feature_filmyes -2.25 1.69 -1.33 1.83e- 1
## 3 dramayes 1.29 0.877 1.47 1.41e- 1
## 4 runtime -0.0561 0.0242 -2.32 2.04e- 2
## 5 mpaa_rating_Ryes -1.44 0.813 -1.78 7.60e- 2
## 6 thtr_rel_year -0.0766 0.0383 -2.00 4.63e- 2
## 7 oscar_seasonyes -0.533 0.997 -0.535 5.93e- 1
## 8 summer_seasonyes 0.911 0.949 0.959 3.38e- 1
## 9 imdb_rating 14.7 0.607 24.3 2.03e-92
## 10 imdb_num_votes 0.00000723 0.00000452 1.60 1.10e- 1
## 11 critics_score 0.0575 0.0222 2.59 9.73e- 3
## 12 best_pic_nomyes 5.32 2.63 2.02 4.33e- 2
## 13 best_pic_winyes -3.21 4.61 -0.697 4.86e- 1
## 14 best_actor_winyes -1.54 1.18 -1.31 1.91e- 1
## 15 best_actress_winyes -2.20 1.30 -1.69 9.23e- 2
## 16 best_dir_winyes -1.23 1.73 -0.713 4.76e- 1
## 17 top200_boxyes 0.848 2.78 0.305 7.61e- 1
As you can see from a quick summary of the full linear model, many coefficients of independent variables are not statistically significant. We will use the Bayesian Information Criterion (BIC), as our criterion for model selection. BIC is based on model fit, while simultaneously penalizing the number of parameters in proportion to the sample size.
## [1] 4934.145
now we will remove variables and see which ones when removed cause the BIC to decrease.
#Step 1 of model selection
m_audscore_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year
+ oscar_season + summer_season + imdb_rating + imdb_num_votes +
critics_score + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v2 <-lm(audience_score ~ feature_film + runtime + mpaa_rating_R + thtr_rel_year +
oscar_season + summer_season + imdb_rating + imdb_num_votes +
critics_score + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R + thtr_rel_year +
oscar_season + summer_season + imdb_rating + imdb_num_votes +
critics_score + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime + thtr_rel_year +
oscar_season + summer_season + imdb_rating + imdb_num_votes +
critics_score + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
oscar_season + summer_season + imdb_rating + imdb_num_votes +
critics_score + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
critics_score + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + imdb_rating + imdb_num_votes +
critics_score + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_num_votes +
critics_score + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
critics_score + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_actor_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v14 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v15 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + top200_box, data = movies)
m_audscore_wo_v16 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
BIC(m_audscore_wo_v1)## [1] 4929.489
## [1] 4929.897
## [1] 4940.193
## [1] 4930.904
## [1] 4931.75
## [1] 4927.962
## [1] 4928.613
## [1] 5354.924
## [1] 4930.291
## [1] 4934.538
## [1] 4931.865
## [1] 4928.167
## [1] 4929.428
## [1] 4930.581
## [1] 4928.19
## [1] 4927.764
the BIC of the model without the 16th variable, top200_box, is the lowest, so we’ll continue to step with that model.
#Eliminated `top200_box` in step 1. Step 2 of selection:
m_audscore1_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v14 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_dir_win, data = movies)
m_audscore1_wo_v15 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + oscar_season + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win, data = movies)
BIC(m_audscore1_wo_v1)## [1] 4923.097
## [1] 4923.488
## [1] 4933.787
## [1] 4924.684
## [1] 4925.556
## [1] 4921.56
## [1] 4922.261
## [1] 5348.763
## [1] 4924.399
## [1] 4928.271
## [1] 4925.442
## [1] 4921.787
## [1] 4923.032
## [1] 4924.164
## [1] 4921.824
From a BIC of 4927.764 we can bring the BIC down to 4921.56 by removing oscar_season.
#Removed `oscar_season` in step 2. Step 3 of selection:
m_audscore2_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_win +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_dir_win, data = movies)
m_audscore2_wo_v14 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
best_actor_win + best_actress_win, data = movies)
BIC(m_audscore2_wo_v1)## [1] 4916.952
## [1] 4917.37
## [1] 4928.204
## [1] 4918.466
## [1] 4919.302
## [1] 4916.902
## [1] 5342.412
## [1] 4918.194
## [1] 4922.047
## [1] 4919.06
## [1] 4915.554
## [1] 4916.879
## [1] 4917.978
## [1] 4915.657
from a BIC of 4921, the BIC was brought down to 4915.554 by removing best_pic_win.
#Removed `best_pic_win` in step 3. Step 4 of selection:
m_audscore3_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
critics_score + best_pic_nom +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + best_pic_nom +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score +
best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_dir_win, data = movies)
m_audscore3_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
BIC(m_audscore3_wo_v1)## [1] 4910.835
## [1] 4911.341
## [1] 4922.185
## [1] 4912.47
## [1] 4913.16
## [1] 4910.867
## [1] 5337.54
## [1] 4911.865
## [1] 4916.05
## [1] 4912.597
## [1] 4910.753
## [1] 4912.104
## [1] 4910.045
The biggest decrease in BIC was found when removing best_dir_win which brought the BIC to 4910.045
#Removed `best_dir_win` in step 4. Step 5 of selection:
m_audscore4_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score +
best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actress_win, data = movies)
m_audscore4_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year + summer_season + imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win, data = movies)
BIC(m_audscore4_wo_v1)## [1] 4905.453
## [1] 4905.915
## [1] 4917.564
## [1] 4907.106
## [1] 4907.34
## [1] 4905.283
## [1] 5331.766
## [1] 4906.101
## [1] 4910.234
## [1] 4906.911
## [1] 4905.325
## [1] 4906.64
The BIC was made lowest (4905.283) when summer_season was removed from the model.
#Removed `summer_season` in step 5. Step 6 of selection:
m_audscore5_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year+
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score +
best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actress_win, data = movies)
m_audscore5_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win, data = movies)
BIC(m_audscore5_wo_v1)## [1] 4900.403
## [1] 4900.901
## [1] 4912.993
## [1] 4902.447
## [1] 4902.497
## [1] 5325.472
## [1] 4901.47
## [1] 4906.272
## [1] 4901.849
## [1] 4900.807
## [1] 4901.85
The BIC was made lowest (4900.403) when feature_film was removed from the model.
#Removed `feature_film` in step 6. Step 7 of selection:
m_audscore6_wo_v1 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v2 <- lm(audience_score ~ drama + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v3 <- lm(audience_score ~ drama + runtime +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v4 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v5 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
thtr_rel_year+
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v6 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v7 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v8 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score +
best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v9 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actress_win, data = movies)
m_audscore6_wo_v10 <- lm(audience_score ~ drama + runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win, data = movies)
BIC(m_audscore6_wo_v1)## [1] 4895.167
## [1] 4908.164
## [1] 4898.551
## [1] 4896.755
## [1] 5342.94
## [1] 4895.693
## [1] 4902.823
## [1] 4896.895
## [1] 4896.159
## [1] 4897.052
The BIC is made lowest (4895.167) when drama is removed from the model.
#removed `drama` in step 7. Step 8 of selection:
m_audscore7_wo_v1 <- lm(audience_score ~ mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v2 <- lm(audience_score ~ runtime +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v5 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v6 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v7 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score +
best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v8 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actress_win, data = movies)
m_audscore7_wo_v9 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
imdb_num_votes + critics_score + best_pic_nom +
best_actor_win, data = movies)
BIC(m_audscore7_wo_v1)## [1] 4902.088
## [1] 4892.677
## [1] 4891.464
## [1] 5338.34
## [1] 4890.199
## [1] 4897.984
## [1] 4891.817
## [1] 4890.824
## [1] 4891.487
The BIC is made lowest (4890.199) when imdb_num_votes is removed from the model.
#removed `imdb_num_votes` in step 8. Step 9:
m_audscore8_wo_v1 <- lm(audience_score ~ mpaa_rating_R +
thtr_rel_year+ imdb_rating +
critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v2 <- lm(audience_score ~ runtime +
thtr_rel_year+ imdb_rating +
critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating +
critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+
critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v5 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v6 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
critics_score +
best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v7 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
critics_score + best_pic_nom +
best_actress_win, data = movies)
m_audscore8_wo_v8 <- lm(audience_score ~ runtime + mpaa_rating_R +
thtr_rel_year+ imdb_rating +
critics_score + best_pic_nom +
best_actor_win, data = movies)
BIC(m_audscore8_wo_v1)## [1] 4896.011
## [1] 4887.453
## [1] 4885.766
## [1] 5352.361
## [1] 4892.618
## [1] 4888.214
## [1] 4885.954
## [1] 4886.425
The BIC is made lowest (4885.766) when thtr_rel_year is removed from the model.
#removed `thtr_rel_year` in step 9. Step 10:
m_audscore9_wo_v1 <- lm(audience_score ~ mpaa_rating_R +
imdb_rating +
critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore9_wo_v2 <- lm(audience_score ~ runtime +
imdb_rating +
critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore9_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R +
critics_score + best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore9_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating +
best_pic_nom +
best_actor_win + best_actress_win, data = movies)
m_audscore9_wo_v5 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating +
critics_score +
best_actor_win + best_actress_win, data = movies)
m_audscore9_wo_v6 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating +
critics_score + best_pic_nom +
best_actress_win, data = movies)
m_audscore9_wo_v7 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating +
critics_score + best_pic_nom +
best_actor_win, data = movies)
BIC(m_audscore9_wo_v1)## [1] 4891.111
## [1] 4883.072
## [1] 5345.964
## [1] 4889.055
## [1] 4883.82
## [1] 4881.39
## [1] 4881.941
The BIC is made lowest (4881.39) when best_actor_win is removed from the model.
#removed `best_actor_win` in step 10. Step 11 of selection:
m_audscore10_wo_v1 <- lm(audience_score ~ mpaa_rating_R +
imdb_rating +
critics_score + best_pic_nom +
best_actress_win, data = movies)
m_audscore10_wo_v2 <- lm(audience_score ~ runtime +
imdb_rating +
critics_score + best_pic_nom +
best_actress_win, data = movies)
m_audscore10_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R +
critics_score + best_pic_nom +
best_actress_win, data = movies)
m_audscore10_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating +
best_pic_nom +
best_actress_win, data = movies)
m_audscore10_wo_v5 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating +
critics_score +
best_actress_win, data = movies)
m_audscore10_wo_v6 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating +
critics_score + best_pic_nom, data = movies)
BIC(m_audscore10_wo_v1)## [1] 4888.433
## [1] 4878.608
## [1] 5341.127
## [1] 4884.644
## [1] 4878.911
## [1] 4877.909
The BIC is made lowest (4877.909) when best_actress_win is removed from the model.
#removed `best_actress_win` in step 11. Step 12 of selection:
m_audscore11_wo_v1 <- lm(audience_score ~ mpaa_rating_R +
imdb_rating +
critics_score + best_pic_nom, data = movies)
m_audscore11_wo_v2 <- lm(audience_score ~ runtime +
imdb_rating +
critics_score + best_pic_nom, data = movies)
m_audscore11_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R +
critics_score + best_pic_nom, data = movies)
m_audscore11_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating +
best_pic_nom, data = movies)
m_audscore11_wo_v5 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating +
critics_score, data = movies)
BIC(m_audscore11_wo_v1)## [1] 4886.586
## [1] 4874.994
## [1] 5336.746
## [1] 4881.009
## [1] 4874.484
The BIC is made lowest (4874.484) when best_pic_nom is removed from the model.
#removed `best_pic_nom` in step 12. Step 13 of selection:
m_audscore12_wo_v1 <- lm(audience_score ~ mpaa_rating_R +
imdb_rating +
critics_score, data = movies)
m_audscore12_wo_v2 <- lm(audience_score ~ runtime +
imdb_rating +
critics_score, data = movies)
m_audscore12_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R +
critics_score, data = movies)
m_audscore12_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R +
imdb_rating, data = movies)
BIC(m_audscore12_wo_v1)## [1] 4881.401
## [1] 4871.623
## [1] 5335.434
## [1] 4878.238
The BIC is made lowest (4871.623) when mpaa_rating_R is removed from the model.
#removed `mpaa_rating_R` in step 13. Step 14:
m_audscore13_wo_v1 <- lm(audience_score ~ imdb_rating +
critics_score, data = movies)
m_audscore13_wo_v2 <- lm(audience_score ~ runtime +
critics_score, data = movies)
m_audscore13_wo_v3 <- lm(audience_score ~ runtime +
imdb_rating, data = movies)
BIC(m_audscore13_wo_v1)## [1] 4878.542
## [1] 5329.265
## [1] 4875.773
The BIC doesn’t lower upon the removal of any of these variables so the final model will include runtime, imdb_rating, and critics_score.
#final model
m_audscore_final <- lm(audience_score ~ runtime + imdb_rating + critics_score, data = movies)
BIC(m_audscore_final)## [1] 4871.623
##
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.998 -6.565 0.557 5.475 52.448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.28321 3.21939 -10.338 < 2e-16 ***
## runtime -0.05362 0.02107 -2.545 0.01117 *
## imdb_rating 14.98076 0.57735 25.947 < 2e-16 ***
## critics_score 0.07036 0.02156 3.263 0.00116 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.04 on 646 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7549, Adjusted R-squared: 0.7538
## F-statistic: 663.3 on 3 and 646 DF, p-value: < 2.2e-16
The coefficients of these predictor variables indicate a few things: - with every increase in runtime by 1 minute, we can expect the audience score to decrease by .05 points - with a point increase on the imdb_rating, we can expect audience score to increase by 14 points on average - with an increase in the critics_score by 1 point, we can expect the audience score to increase by .07 points
Now we will run some model diagonistics on the variables we’ve deemed to be decent predictors.
m_audscore_final_aug <- augment(m_audscore_final)
#Linearity and constant variance
ggplot(data = m_audscore_final_aug, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.6) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Fitted values", y = "Residuals")#Normality
ggplot(data = m_audscore_final_aug, aes(x = .resid)) +
geom_histogram(binwidth = 5) +
xlab("Residuals")The residuals plot seems to be in a fan shape indicating that the model may not be accounting for all the relationships between the variables. But they do seem mostly normally distributed. * * *
NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.
#Pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using your the model you developed and the `predict` function in R.
movies %>% filter(title == 'Train to Busan')## # A tibble: 0 x 37
## # … with 37 variables: title <chr>, title_type <fct>, genre <fct>,
## # runtime <dbl>, mpaa_rating <fct>, studio <fct>, thtr_rel_year <dbl>,
## # thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## # dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## # imdb_num_votes <int>, critics_rating <fct>, critics_score <dbl>,
## # audience_rating <fct>, audience_score <dbl>, best_pic_nom <fct>,
## # best_pic_win <fct>, best_actor_win <fct>, best_actress_win <fct>,
## # best_dir_win <fct>, top200_box <fct>, director <chr>, actor1 <chr>,
## # actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>, imdb_url <chr>,
## # rt_url <chr>, feature_film <chr>, drama <chr>, mpaa_rating_R <chr>,
## # oscar_season <chr>, summer_season <chr>
#the movie isnt in the data set already so we can do a prediction
#Data for this movie came from the IMDB's and Rotten Tomatoes website
busan <- data.frame(runtime = 118, imdb_rating = 7.5, critics_score = 94)
predict(m_audscore_final, busan)## 1
## 79.35946
## fit lwr upr
## 1 79.35946 59.60028 99.11864
The actual audience score on Rotten Tomatoes is 88. * * *
Using the variable given to us and the one’s generated from the data, we found that the best model for predicting a movie’s audience score on Rotten Tomatoes depends mainly on three variables: the movie’s runtime, the IMDB rating, and the critics score on Rotten Tomatoes. Of the variables explored in the EDA section, the ones that didnt show much difference visually were all understandably eliminated from the model. This model was selected by choosing the one with the lowest BIC. In the above prediction, the model isn’t perfect but in the ball park, the 95% confidence interval of the prediction is quite large, predicting that the score could be anywhere between 59 and 99, while the actual score is 88. One short coming of the predictor variables that we chose from was that many had multiple levels which could make prediction difficult.