This is the capstone project by Kristen Phan for Duke University’s Linear Regression and Modeling course (Course URL).
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.
The data set is comprised of 651 randomly sampled movies produced and released between 1972 and 2014 about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset is provided below, and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.
More information on the dataset’s codebook can be found here.
Because the movies were sampled randomly, the findings of this study can be generalized for movies that were produced nad released beofre 2016.
Because this is an observational study, its findings only imply association and not causation.
What attributes are associated with popular movies?
The purpose of this study is to explore attributes associated with a movie’s popularity and build a regression model to predict a movie’s popularity given its attributes. Keep in mind that the attributes we are about to analyze are not the cause of a movie’s popularity.
First, we take a peek at the dataset.
## title title_type genre runtime
## Length:619 Documentary : 42 Drama :298 Min. : 65.0
## Class :character Feature Film:573 Comedy : 86 1st Qu.: 93.0
## Mode :character TV Movie : 4 Action & Adventure: 62 Median :103.0
## Mystery & Suspense: 56 Mean :106.5
## Documentary : 40 3rd Qu.:116.0
## Horror : 22 Max. :267.0
## (Other) : 55
## mpaa_rating studio thtr_rel_year
## G : 16 Paramount Pictures : 37 Min. :1972
## NC-17 : 1 Warner Bros. Pictures : 30 1st Qu.:1991
## PG :111 Sony Pictures Home Entertainment: 27 Median :2000
## PG-13 :131 Universal Pictures : 23 Mean :1998
## R :319 Warner Home Video : 19 3rd Qu.:2007
## Unrated: 41 Miramax Films : 18 Max. :2014
## (Other) :465
## thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month
## Min. : 1.000 Min. : 1.00 Min. :1991 Min. : 1.000
## 1st Qu.: 4.000 1st Qu.: 7.00 1st Qu.:2001 1st Qu.: 3.000
## Median : 7.000 Median :15.00 Median :2004 Median : 6.000
## Mean : 6.733 Mean :14.43 Mean :2004 Mean : 6.346
## 3rd Qu.:10.000 3rd Qu.:22.00 3rd Qu.:2008 3rd Qu.: 9.000
## Max. :12.000 Max. :31.00 Max. :2015 Max. :12.000
##
## dvd_rel_day imdb_rating imdb_num_votes critics_rating
## Min. : 1.00 Min. :1.900 Min. : 183 Certified Fresh:131
## 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 5026 Fresh :195
## Median :15.00 Median :6.600 Median : 16480 Rotten :293
## Mean :15.08 Mean :6.486 Mean : 60014
## 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 62507
## Max. :31.00 Max. :9.000 Max. :893008
##
## critics_score audience_rating audience_score best_pic_nom best_pic_win
## Min. : 1.00 Spilled:264 Min. :11.00 no :597 no :612
## 1st Qu.: 33.00 Upright:355 1st Qu.:46.00 yes: 22 yes: 7
## Median : 61.00 Median :65.00
## Mean : 57.43 Mean :62.21
## 3rd Qu.: 82.50 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_actor_win best_actress_win best_dir_win top200_box director
## no :528 no :548 no :576 no :604 Length:619
## yes: 91 yes: 71 yes: 43 yes: 15 Class :character
## Mode :character
##
##
##
##
## actor1 actor2 actor3 actor4
## Length:619 Length:619 Length:619 Length:619
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## actor5 imdb_url rt_url
## Length:619 Length:619 Length:619
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
First, it’s worth noting that a movie’s type (documentary, featured file, or TV movie) might affect its popularity depending on the audience taste, so we will focus only Featured Film in this study (sample size = 591 feature films).
## [1] 573
Second, in this analysis, we will use Rotten Tomatoes’ audiene score to measure a movie’s popularity.
Next we will kick things off by analyzing a few attributes which are likely to influence a movie’s popularity:
* genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
* thtr_rel_month: Month the movie is released in theaters
* director: Director of the movie
* studio: Studio that produced the movie
* mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
Let’s examine movie’s popularity among different genres.
ggplot(data = feature_film, aes(genre, audience_score)) +
geom_col(fill='#E69F00') +
labs(title="Movie Popularity in Relation to Genre") +
coord_flip()feature_film %>% group_by(genre) %>% summarise(count = sum(imdb_num_votes)) %>% arrange(desc(count))## # A tibble: 11 x 2
## genre count
## <fct> <int>
## 1 Drama 18971579
## 2 Action & Adventure 5164252
## 3 Mystery & Suspense 4625314
## 4 Comedy 3858154
## 5 Other 1949012
## 6 Science Fiction & Fantasy 763731
## 7 Horror 618755
## 8 Animation 480972
## 9 Musical & Performing Arts 287272
## 10 Art House & International 114019
## 11 Documentary 38015
Drama movies seem to invoke the most audience engagement. Next, we take a look at popularity among movies in relation to the release month and the movie genre.
ggplot(data = feature_film, aes(x = thtr_rel_month, y = audience_score)) +
geom_bar(stat="identity", fill='#E69F00') +
theme(axis.text.x = element_text(size=5)) +
labs(title="Movie Popularity in Relation to Theater Release Month and Genre", x = "Release Month", y="IMDb Number of Votes") +
facet_wrap(~genre) Glancing from the above plot, release time and audience score seem to be associated for Drama movies. Audience score is higher for drama movies released during the holiday season (December and January) and in the summer than other drama movies released throughout the rest of the year.
Before we discuss a MLR, we need to check if the dataset meets all 4 conditions for a MLR.
In our model, we will exclude the following variables:
- “actor1” through “actor5”: refer to whether the movie casts an actor or actress who won a best actor or actress Oscar, so they add no value to the prediction of a movie’s popularity.
- “imdb_url” and “rt_url”: have no relation to the movies
- “title”, “director”, “studio”, “title type”: There categories contains unique data points (ie. outliners) and should be excluded.
Now we are going to build a model with all except for the attributes mentioned above and check if all explantory variables meet the conditions.
Numerical, explanatory variables include: 1. runtime 2. imdb_rating 3. imdb_num_votes 4. critics_score
# condition 1: linear relationship between numerical, explanatory variables and response variable (imdb_num_votes)
m_full <- lm(audience_score ~ genre + mpaa_rating +
thtr_rel_year + thtr_rel_month + thtr_rel_day +
dvd_rel_year + dvd_rel_month + dvd_rel_day +
best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box +
critics_rating + audience_rating +
runtime + imdb_rating + imdb_num_votes + critics_score,
data = feature_film)
plot(m_full$residuals ~ feature_film$runtime)# condition 3: constant variability of residuals --> randomly scattered in a band with consitant width around 0. no fan shape
plot(m_full$residuals ~ m_full$fitted.values)From the above scatter plot, critics_score is most linearly related to the reponse variable while runtime, imdb_rating, and imdb_num_votes don’t. For that reason, we will exclude runtime, imdb_rating, and imdb_num_votes from our model and recompute the model.
# condition 1: linear relationship between numerical, explanatory variables and response variable (imdb_num_votes)
m_full <- lm(audience_score ~ genre + mpaa_rating +
thtr_rel_year + thtr_rel_month + thtr_rel_day +
dvd_rel_year + dvd_rel_month + dvd_rel_day +
best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win + top200_box +
critics_rating + audience_rating +
critics_score,
data = feature_film)
plot(m_full$residuals ~ feature_film$critics_score)# condition 3: constant variability of residuals --> randomly scattered in a band with consitant width around 0. no fan shape
plot(m_full$residuals ~ m_full$fitted)Based on the above visuals, our model seems to meet all condiitons except for condition #3 - constant variability of the residuals. However, because we have a large sample, this might be not an important violations of the model.
In this section, we will further finetune the model using using backward elimination with P-val. Although using adjusted R squared might yield a more reliable model, it’s less computationally intensive to use p-val, and the resulting model will be relatively similar to that by adjusted R squared. The model will be used later to predict a movie’s popularity given its attributes with a movie’s popularity measured by the number of IMDb votes.
Below is the summary of the current model
##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year +
## thtr_rel_month + thtr_rel_day + dvd_rel_year + dvd_rel_month +
## dvd_rel_day + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win + top200_box + critics_rating +
## audience_rating + critics_score, data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.4983 -6.2141 0.6138 6.1294 21.3554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 270.783365 180.686391 1.499 0.1346
## genreAnimation -1.612296 3.741046 -0.431 0.6667
## genreArt House & International 2.457320 2.976627 0.826 0.4094
## genreComedy 0.080045 1.516017 0.053 0.9579
## genreDocumentary 11.361346 5.363629 2.118 0.0346 *
## genreDrama 0.675057 1.345883 0.502 0.6162
## genreHorror -1.205044 2.293313 -0.525 0.5995
## genreMusical & Performing Arts 7.616848 3.402323 2.239 0.0256 *
## genreMystery & Suspense -0.652780 1.735907 -0.376 0.7070
## genreOther 1.650694 2.648408 0.623 0.5334
## genreScience Fiction & Fantasy -0.910663 3.377609 -0.270 0.7876
## mpaa_ratingNC-17 -13.657150 9.363779 -1.459 0.1453
## mpaa_ratingPG -3.936499 2.756223 -1.428 0.1538
## mpaa_ratingPG-13 -4.431838 2.863201 -1.548 0.1222
## mpaa_ratingR -4.711622 2.780709 -1.694 0.0908 .
## mpaa_ratingUnrated -5.992864 3.737399 -1.603 0.1094
## thtr_rel_year 0.065641 0.052074 1.261 0.2080
## thtr_rel_month -0.033367 0.111430 -0.299 0.7647
## thtr_rel_day -0.025636 0.043390 -0.591 0.5549
## dvd_rel_year -0.182948 0.112961 -1.620 0.1059
## dvd_rel_month 0.166015 0.113289 1.465 0.1434
## dvd_rel_day -0.006103 0.042821 -0.143 0.8867
## best_pic_nomyes 6.096324 2.364742 2.578 0.0102 *
## best_pic_winyes -1.336816 4.094849 -0.326 0.7442
## best_actor_winyes 0.499262 1.077095 0.464 0.6432
## best_actress_winyes -0.886144 1.201489 -0.738 0.4611
## best_dir_winyes 1.944774 1.531451 1.270 0.2047
## top200_boxyes -0.211540 2.460796 -0.086 0.9315
## critics_ratingFresh -0.810431 1.267853 -0.639 0.5230
## critics_ratingRotten 2.180468 1.885150 1.157 0.2479
## audience_ratingUpright 27.090926 0.944536 28.682 < 2e-16 ***
## critics_score 0.238003 0.030144 7.896 1.62e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.879 on 541 degrees of freedom
## Multiple R-squared: 0.8102, Adjusted R-squared: 0.7993
## F-statistic: 74.48 on 31 and 541 DF, p-value: < 2.2e-16
There are a few statistical points worth noting. 1. P-val: Because p-val = 2.2e-16 < 0, the data provides sufficient evidence that the set of explanatory variables and the response variable (proxy of a movie’s popularity) included in the model are associated.
Multiple R-squared of 0.4449 44.49% of variation in the response variable is current explained by the model.
Estimate of best_pic_nomyes = 43671.0 The number of imdb votes for movies which have been nomiated for best picture is 43671 votes higher than those without a nomination for best picture.
Next, we will drop one variable with the highest p-val that is greater than our chosen significant level 5%. This time, we will drop thtr_rel_day with p-val of 0.742906
m1 <- lm(audience_score ~ genre + mpaa_rating +
thtr_rel_year + thtr_rel_month + thtr_rel_day +
dvd_rel_year + dvd_rel_month + dvd_rel_day +
best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win +
critics_rating + audience_rating +
critics_score,
data = feature_film)
summary(m1)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year +
## thtr_rel_month + thtr_rel_day + dvd_rel_year + dvd_rel_month +
## dvd_rel_day + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win + critics_rating + audience_rating +
## critics_score, data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.503 -6.202 0.613 6.098 21.346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 270.64382 180.51358 1.499 0.1344
## genreAnimation -1.57676 3.71473 -0.424 0.6714
## genreArt House & International 2.47137 2.96942 0.832 0.4056
## genreComedy 0.09294 1.50719 0.062 0.9509
## genreDocumentary 11.36723 5.35828 2.121 0.0343 *
## genreDrama 0.68764 1.33667 0.514 0.6071
## genreHorror -1.19590 2.28875 -0.523 0.6015
## genreMusical & Performing Arts 7.63674 3.39133 2.252 0.0247 *
## genreMystery & Suspense -0.64375 1.73114 -0.372 0.7101
## genreOther 1.65588 2.64530 0.626 0.5316
## genreScience Fiction & Fantasy -0.91807 3.37342 -0.272 0.7856
## mpaa_ratingNC-17 -13.61190 9.34041 -1.457 0.1456
## mpaa_ratingPG -3.91803 2.74532 -1.427 0.1541
## mpaa_ratingPG-13 -4.41083 2.85014 -1.548 0.1223
## mpaa_ratingR -4.68367 2.75911 -1.698 0.0902 .
## mpaa_ratingUnrated -5.96166 3.71632 -1.604 0.1093
## thtr_rel_year 0.06584 0.05198 1.267 0.2058
## thtr_rel_month -0.03424 0.11086 -0.309 0.7575
## thtr_rel_day -0.02556 0.04334 -0.590 0.5556
## dvd_rel_year -0.18310 0.11284 -1.623 0.1053
## dvd_rel_month 0.16576 0.11315 1.465 0.1435
## dvd_rel_day -0.00610 0.04278 -0.143 0.8867
## best_pic_nomyes 6.10208 2.36163 2.584 0.0100 *
## best_pic_winyes -1.34954 4.08842 -0.330 0.7415
## best_actor_winyes 0.49602 1.07545 0.461 0.6448
## best_actress_winyes -0.89180 1.19859 -0.744 0.4572
## best_dir_winyes 1.94645 1.52992 1.272 0.2038
## critics_ratingFresh -0.79402 1.25225 -0.634 0.5263
## critics_ratingRotten 2.19643 1.87426 1.172 0.2418
## audience_ratingUpright 27.08699 0.94256 28.738 < 2e-16 ***
## critics_score 0.23800 0.03012 7.903 1.53e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.871 on 542 degrees of freedom
## Multiple R-squared: 0.8102, Adjusted R-squared: 0.7997
## F-statistic: 77.11 on 30 and 542 DF, p-value: < 2.2e-16
Keep repeating this step.
m2 <- lm(audience_score ~ genre + mpaa_rating +
thtr_rel_year + thtr_rel_month + thtr_rel_day +
dvd_rel_year + dvd_rel_month +
best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win +
critics_rating + audience_rating +
critics_score,
data = feature_film)
summary(m2)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year +
## thtr_rel_month + thtr_rel_day + dvd_rel_year + dvd_rel_month +
## best_pic_nom + best_pic_win + best_actor_win + best_actress_win +
## best_dir_win + critics_rating + audience_rating + critics_score,
## data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.4409 -6.1865 0.6426 6.1152 21.4281
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 268.10742 179.47289 1.494 0.1358
## genreAnimation -1.59935 3.70800 -0.431 0.6664
## genreArt House & International 2.47559 2.96659 0.834 0.4044
## genreComedy 0.08906 1.50558 0.059 0.9529
## genreDocumentary 11.38401 5.35215 2.127 0.0339 *
## genreDrama 0.67345 1.33175 0.506 0.6133
## genreHorror -1.19410 2.28665 -0.522 0.6017
## genreMusical & Performing Arts 7.60868 3.38256 2.249 0.0249 *
## genreMystery & Suspense -0.65348 1.72823 -0.378 0.7055
## genreOther 1.63023 2.63679 0.618 0.5367
## genreScience Fiction & Fantasy -0.94427 3.36537 -0.281 0.7791
## mpaa_ratingNC-17 -13.64122 9.32972 -1.462 0.1443
## mpaa_ratingPG -3.93287 2.74086 -1.435 0.1519
## mpaa_ratingPG-13 -4.43486 2.84259 -1.560 0.1193
## mpaa_ratingR -4.70241 2.75349 -1.708 0.0882 .
## mpaa_ratingUnrated -5.99319 3.70639 -1.617 0.1065
## thtr_rel_year 0.06551 0.05188 1.263 0.2072
## thtr_rel_month -0.03431 0.11076 -0.310 0.7568
## thtr_rel_day -0.02560 0.04330 -0.591 0.5546
## dvd_rel_year -0.18155 0.11222 -1.618 0.1063
## dvd_rel_month 0.16640 0.11296 1.473 0.1413
## best_pic_nomyes 6.09242 2.35853 2.583 0.0101 *
## best_pic_winyes -1.36850 4.08257 -0.335 0.7376
## best_actor_winyes 0.49915 1.07425 0.465 0.6424
## best_actress_winyes -0.89099 1.19749 -0.744 0.4572
## best_dir_winyes 1.94037 1.52795 1.270 0.2047
## critics_ratingFresh -0.79674 1.25098 -0.637 0.5245
## critics_ratingRotten 2.19739 1.87256 1.173 0.2411
## audience_ratingUpright 27.07630 0.93873 28.844 < 2e-16 ***
## critics_score 0.23832 0.03001 7.941 1.16e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.863 on 543 degrees of freedom
## Multiple R-squared: 0.8102, Adjusted R-squared: 0.8
## F-statistic: 79.91 on 29 and 543 DF, p-value: < 2.2e-16
m3 <- lm(audience_score ~ genre + mpaa_rating +
thtr_rel_year + thtr_rel_day +
dvd_rel_year + dvd_rel_month +
best_pic_nom + best_pic_win + best_actor_win +
best_actress_win + best_dir_win +
critics_rating + audience_rating +
critics_score,
data = feature_film)
summary(m3)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year +
## thtr_rel_day + dvd_rel_year + dvd_rel_month + best_pic_nom +
## best_pic_win + best_actor_win + best_actress_win + best_dir_win +
## critics_rating + audience_rating + critics_score, data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.5557 -6.2992 0.6969 6.1080 21.5953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 267.81028 179.32114 1.493 0.1359
## genreAnimation -1.60310 3.70489 -0.433 0.6654
## genreArt House & International 2.48171 2.96406 0.837 0.4028
## genreComedy 0.08493 1.50427 0.056 0.9550
## genreDocumentary 11.42839 5.34579 2.138 0.0330 *
## genreDrama 0.68591 1.33004 0.516 0.6063
## genreHorror -1.18487 2.28455 -0.519 0.6042
## genreMusical & Performing Arts 7.58594 3.37895 2.245 0.0252 *
## genreMystery & Suspense -0.61498 1.72232 -0.357 0.7212
## genreOther 1.68955 2.62765 0.643 0.5205
## genreScience Fiction & Fantasy -0.93178 3.36233 -0.277 0.7818
## mpaa_ratingNC-17 -13.76368 9.31359 -1.478 0.1400
## mpaa_ratingPG -3.95710 2.73747 -1.446 0.1489
## mpaa_ratingPG-13 -4.43674 2.84022 -1.562 0.1188
## mpaa_ratingR -4.72984 2.74977 -1.720 0.0860 .
## mpaa_ratingUnrated -6.01771 3.70247 -1.625 0.1047
## thtr_rel_year 0.06500 0.05181 1.255 0.2101
## thtr_rel_day -0.02709 0.04300 -0.630 0.5290
## dvd_rel_year -0.18099 0.11211 -1.614 0.1070
## dvd_rel_month 0.17195 0.11144 1.543 0.1234
## best_pic_nomyes 5.97099 2.32379 2.570 0.0104 *
## best_pic_winyes -1.29676 4.07261 -0.318 0.7503
## best_actor_winyes 0.47434 1.07037 0.443 0.6578
## best_actress_winyes -0.89801 1.19628 -0.751 0.4532
## best_dir_winyes 1.91218 1.52397 1.255 0.2101
## critics_ratingFresh -0.79816 1.24993 -0.639 0.5234
## critics_ratingRotten 2.18873 1.87080 1.170 0.2425
## audience_ratingUpright 27.07535 0.93794 28.867 < 2e-16 ***
## critics_score 0.23814 0.02998 7.943 1.14e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.855 on 544 degrees of freedom
## Multiple R-squared: 0.8101, Adjusted R-squared: 0.8004
## F-statistic: 82.9 on 28 and 544 DF, p-value: < 2.2e-16
m4 <- lm(audience_score ~ genre + mpaa_rating +
thtr_rel_year + thtr_rel_day +
dvd_rel_year + dvd_rel_month +
best_pic_nom + best_actor_win +
best_actress_win + best_dir_win +
critics_rating + audience_rating +
critics_score,
data = feature_film)
summary(m4)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year +
## thtr_rel_day + dvd_rel_year + dvd_rel_month + best_pic_nom +
## best_actor_win + best_actress_win + best_dir_win + critics_rating +
## audience_rating + critics_score, data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.5640 -6.0803 0.7117 6.1036 21.5949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 268.76694 179.14809 1.500 0.13413
## genreAnimation -1.60085 3.70183 -0.432 0.66559
## genreArt House & International 2.48297 2.96161 0.838 0.40218
## genreComedy 0.07237 1.50252 0.048 0.96160
## genreDocumentary 11.40972 5.34106 2.136 0.03311 *
## genreDrama 0.68630 1.32894 0.516 0.60577
## genreHorror -1.18024 2.28262 -0.517 0.60533
## genreMusical & Performing Arts 7.60198 3.37579 2.252 0.02473 *
## genreMystery & Suspense -0.62550 1.72058 -0.364 0.71634
## genreOther 1.73630 2.62138 0.662 0.50802
## genreScience Fiction & Fantasy -0.92079 3.35938 -0.274 0.78412
## mpaa_ratingNC-17 -13.75163 9.30583 -1.478 0.14005
## mpaa_ratingPG -3.96285 2.73515 -1.449 0.14795
## mpaa_ratingPG-13 -4.42932 2.83778 -1.561 0.11914
## mpaa_ratingR -4.73159 2.74750 -1.722 0.08561 .
## mpaa_ratingUnrated -6.02911 3.69924 -1.630 0.10372
## thtr_rel_year 0.06583 0.05170 1.273 0.20343
## thtr_rel_day -0.02743 0.04295 -0.639 0.52337
## dvd_rel_year -0.18232 0.11194 -1.629 0.10395
## dvd_rel_month 0.17392 0.11117 1.564 0.11830
## best_pic_nomyes 5.66158 2.10913 2.684 0.00749 **
## best_actor_winyes 0.49911 1.06666 0.468 0.64003
## best_actress_winyes -0.91829 1.19360 -0.769 0.44202
## best_dir_winyes 1.77044 1.45631 1.216 0.22462
## critics_ratingFresh -0.75505 1.24155 -0.608 0.54334
## critics_ratingRotten 2.22099 1.86651 1.190 0.23460
## audience_ratingUpright 27.07608 0.93717 28.891 < 2e-16 ***
## critics_score 0.23830 0.02995 7.956 1.03e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.848 on 545 degrees of freedom
## Multiple R-squared: 0.8101, Adjusted R-squared: 0.8007
## F-statistic: 86.11 on 27 and 545 DF, p-value: < 2.2e-16
m5 <- lm(audience_score ~ genre + mpaa_rating +
thtr_rel_year + thtr_rel_day +
dvd_rel_year + dvd_rel_month +
best_pic_nom +
best_actress_win + best_dir_win +
critics_rating + audience_rating +
critics_score,
data = feature_film)
summary(m5)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year +
## thtr_rel_day + dvd_rel_year + dvd_rel_month + best_pic_nom +
## best_actress_win + best_dir_win + critics_rating + audience_rating +
## critics_score, data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.6419 -6.0921 0.7699 6.0847 21.5376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 277.33699 178.08189 1.557 0.11997
## genreAnimation -1.57074 3.69862 -0.425 0.67124
## genreArt House & International 2.44022 2.95808 0.825 0.40977
## genreComedy 0.05111 1.50075 0.034 0.97284
## genreDocumentary 11.39185 5.33710 2.134 0.03325 *
## genreDrama 0.70693 1.32726 0.533 0.59451
## genreHorror -1.22940 2.27857 -0.540 0.58973
## genreMusical & Performing Arts 7.59775 3.37336 2.252 0.02470 *
## genreMystery & Suspense -0.55223 1.71222 -0.323 0.74718
## genreOther 1.75252 2.61927 0.669 0.50372
## genreScience Fiction & Fantasy -0.96690 3.35553 -0.288 0.77334
## mpaa_ratingNC-17 -13.78165 9.29895 -1.482 0.13890
## mpaa_ratingPG -3.90189 2.73009 -1.429 0.15351
## mpaa_ratingPG-13 -4.38114 2.83388 -1.546 0.12269
## mpaa_ratingR -4.69835 2.74462 -1.712 0.08749 .
## mpaa_ratingUnrated -5.98004 3.69511 -1.618 0.10616
## thtr_rel_year 0.06671 0.05163 1.292 0.19689
## thtr_rel_day -0.02700 0.04291 -0.629 0.52950
## dvd_rel_year -0.18746 0.11132 -1.684 0.09275 .
## dvd_rel_month 0.16919 0.11063 1.529 0.12677
## best_pic_nomyes 5.78540 2.09096 2.767 0.00585 **
## best_actress_winyes -0.87766 1.18959 -0.738 0.46096
## best_dir_winyes 1.80116 1.45378 1.239 0.21590
## critics_ratingFresh -0.74177 1.24034 -0.598 0.55006
## critics_ratingRotten 2.23907 1.86477 1.201 0.23038
## audience_ratingUpright 27.05451 0.93536 28.924 < 2e-16 ***
## critics_score 0.23880 0.02991 7.984 8.44e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.842 on 546 degrees of freedom
## Multiple R-squared: 0.81, Adjusted R-squared: 0.801
## F-statistic: 89.54 on 26 and 546 DF, p-value: < 2.2e-16
m6 <- lm(audience_score ~ genre + mpaa_rating +
thtr_rel_year + thtr_rel_day +
dvd_rel_year + dvd_rel_month +
best_pic_nom +
best_actress_win + best_dir_win +
audience_rating +
critics_score,
data = feature_film)
summary(m6)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year +
## thtr_rel_day + dvd_rel_year + dvd_rel_month + best_pic_nom +
## best_actress_win + best_dir_win + audience_rating + critics_score,
## data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.9289 -5.8651 0.3948 6.4309 21.3742
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 287.79212 177.37559 1.623 0.10527
## genreAnimation -1.95621 3.70019 -0.529 0.59724
## genreArt House & International 2.26757 2.96245 0.765 0.44434
## genreComedy -0.02711 1.50330 -0.018 0.98562
## genreDocumentary 11.08484 5.33872 2.076 0.03833 *
## genreDrama 0.66127 1.32675 0.498 0.61839
## genreHorror -1.16127 2.28283 -0.509 0.61117
## genreMusical & Performing Arts 7.81261 3.37853 2.312 0.02112 *
## genreMystery & Suspense -0.76191 1.70466 -0.447 0.65508
## genreOther 1.55178 2.62259 0.592 0.55430
## genreScience Fiction & Fantasy -1.31888 3.35472 -0.393 0.69437
## mpaa_ratingNC-17 -13.76982 9.29370 -1.482 0.13901
## mpaa_ratingPG -4.11419 2.73131 -1.506 0.13257
## mpaa_ratingPG-13 -4.72767 2.83234 -1.669 0.09565 .
## mpaa_ratingR -5.03037 2.74454 -1.833 0.06736 .
## mpaa_ratingUnrated -6.44485 3.69515 -1.744 0.08170 .
## thtr_rel_year 0.07801 0.05006 1.559 0.11968
## thtr_rel_day -0.02326 0.04274 -0.544 0.58654
## dvd_rel_year -0.20224 0.11126 -1.818 0.06965 .
## dvd_rel_month 0.17425 0.11077 1.573 0.11629
## best_pic_nomyes 6.14388 2.05328 2.992 0.00289 **
## best_actress_winyes -0.71399 1.18734 -0.601 0.54786
## best_dir_winyes 1.76966 1.45611 1.215 0.22476
## audience_ratingUpright 27.12246 0.92924 29.188 < 2e-16 ***
## critics_score 0.19738 0.01745 11.311 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.859 on 548 degrees of freedom
## Multiple R-squared: 0.8086, Adjusted R-squared: 0.8002
## F-statistic: 96.43 on 24 and 548 DF, p-value: < 2.2e-16
m7 <- lm(audience_score ~ genre + mpaa_rating +
thtr_rel_year +
dvd_rel_year + dvd_rel_month +
best_pic_nom +
best_actress_win + best_dir_win +
audience_rating +
critics_score,
data = feature_film)
summary(m7)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year +
## dvd_rel_year + dvd_rel_month + best_pic_nom + best_actress_win +
## best_dir_win + audience_rating + critics_score, data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.124 -5.835 0.547 6.368 21.488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.876e+02 1.773e+02 1.622 0.10527
## genreAnimation -1.930e+00 3.697e+00 -0.522 0.60194
## genreArt House & International 2.166e+00 2.955e+00 0.733 0.46379
## genreComedy -9.287e-04 1.502e+00 -0.001 0.99951
## genreDocumentary 1.111e+01 5.335e+00 2.082 0.03784 *
## genreDrama 6.506e-01 1.326e+00 0.491 0.62380
## genreHorror -1.232e+00 2.278e+00 -0.541 0.58893
## genreMusical & Performing Arts 7.763e+00 3.375e+00 2.300 0.02182 *
## genreMystery & Suspense -7.861e-01 1.703e+00 -0.462 0.64453
## genreOther 1.615e+00 2.618e+00 0.617 0.53761
## genreScience Fiction & Fantasy -1.203e+00 3.346e+00 -0.360 0.71922
## mpaa_ratingNC-17 -1.399e+01 9.279e+00 -1.508 0.13214
## mpaa_ratingPG -4.117e+00 2.730e+00 -1.508 0.13208
## mpaa_ratingPG-13 -4.751e+00 2.830e+00 -1.679 0.09380 .
## mpaa_ratingR -5.003e+00 2.742e+00 -1.825 0.06862 .
## mpaa_ratingUnrated -6.406e+00 3.692e+00 -1.735 0.08329 .
## thtr_rel_year 7.532e-02 4.978e-02 1.513 0.13083
## dvd_rel_year -1.996e-01 1.111e-01 -1.797 0.07287 .
## dvd_rel_month 1.758e-01 1.107e-01 1.588 0.11281
## best_pic_nomyes 6.132e+00 2.052e+00 2.989 0.00293 **
## best_actress_winyes -7.317e-01 1.186e+00 -0.617 0.53758
## best_dir_winyes 1.773e+00 1.455e+00 1.219 0.22355
## audience_ratingUpright 2.711e+01 9.284e-01 29.202 < 2e-16 ***
## critics_score 1.972e-01 1.744e-02 11.310 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.854 on 549 degrees of freedom
## Multiple R-squared: 0.8084, Adjusted R-squared: 0.8004
## F-statistic: 100.7 on 23 and 549 DF, p-value: < 2.2e-16
m8 <- lm(audience_score ~ genre + mpaa_rating +
thtr_rel_year +
dvd_rel_year + dvd_rel_month +
best_pic_nom +
best_dir_win +
audience_rating +
critics_score,
data = feature_film)
summary(m8)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year +
## dvd_rel_year + dvd_rel_month + best_pic_nom + best_dir_win +
## audience_rating + critics_score, data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.024 -5.842 0.570 6.364 21.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 280.98879 176.83690 1.589 0.11264
## genreAnimation -2.03650 3.69137 -0.552 0.58138
## genreArt House & International 2.06015 2.94802 0.699 0.48496
## genreComedy -0.07417 1.49602 -0.050 0.96048
## genreDocumentary 11.05413 5.33149 2.073 0.03860 *
## genreDrama 0.54179 1.31322 0.413 0.68009
## genreHorror -1.25482 2.27612 -0.551 0.58165
## genreMusical & Performing Arts 7.76051 3.37322 2.301 0.02179 *
## genreMystery & Suspense -0.91620 1.68893 -0.542 0.58771
## genreOther 1.54260 2.61422 0.590 0.55538
## genreScience Fiction & Fantasy -1.20498 3.34397 -0.360 0.71873
## mpaa_ratingNC-17 -13.88213 9.27183 -1.497 0.13491
## mpaa_ratingPG -4.14523 2.72763 -1.520 0.12916
## mpaa_ratingPG-13 -4.77961 2.82822 -1.690 0.09160 .
## mpaa_ratingR -5.00022 2.74078 -1.824 0.06864 .
## mpaa_ratingUnrated -6.34568 3.68872 -1.720 0.08594 .
## thtr_rel_year 0.07457 0.04974 1.499 0.13435
## dvd_rel_year -0.19557 0.11083 -1.765 0.07818 .
## dvd_rel_month 0.17567 0.11060 1.588 0.11279
## best_pic_nomyes 5.91029 2.01894 2.927 0.00356 **
## best_dir_winyes 1.73869 1.45327 1.196 0.23206
## audience_ratingUpright 27.12920 0.92740 29.253 < 2e-16 ***
## critics_score 0.19688 0.01742 11.303 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.849 on 550 degrees of freedom
## Multiple R-squared: 0.8083, Adjusted R-squared: 0.8006
## F-statistic: 105.4 on 22 and 550 DF, p-value: < 2.2e-16
m9 <- lm(audience_score ~ genre + mpaa_rating +
thtr_rel_year +
dvd_rel_year + dvd_rel_month +
best_pic_nom +
audience_rating +
critics_score,
data = feature_film)
summary(m9)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year +
## dvd_rel_year + dvd_rel_month + best_pic_nom + audience_rating +
## critics_score, data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.0403 -6.3031 0.5443 6.3357 21.5306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 292.02450 176.66528 1.653 0.09890 .
## genreAnimation -2.03192 3.69281 -0.550 0.58238
## genreArt House & International 1.93855 2.94742 0.658 0.51100
## genreComedy -0.12382 1.49603 -0.083 0.93407
## genreDocumentary 11.03230 5.33355 2.068 0.03906 *
## genreDrama 0.49185 1.31307 0.375 0.70812
## genreHorror -1.30926 2.27655 -0.575 0.56546
## genreMusical & Performing Arts 7.76205 3.37454 2.300 0.02181 *
## genreMystery & Suspense -0.90825 1.68958 -0.538 0.59110
## genreOther 1.41034 2.61290 0.540 0.58958
## genreScience Fiction & Fantasy -1.13438 3.34476 -0.339 0.73463
## mpaa_ratingNC-17 -13.86782 9.27545 -1.495 0.13546
## mpaa_ratingPG -3.94669 2.72364 -1.449 0.14789
## mpaa_ratingPG-13 -4.56296 2.82352 -1.616 0.10666
## mpaa_ratingR -4.78269 2.73582 -1.748 0.08099 .
## mpaa_ratingUnrated -6.22045 3.68868 -1.686 0.09229 .
## thtr_rel_year 0.07156 0.04969 1.440 0.15041
## dvd_rel_year -0.19813 0.11085 -1.787 0.07442 .
## dvd_rel_month 0.16598 0.11035 1.504 0.13311
## best_pic_nomyes 6.17857 2.00723 3.078 0.00219 **
## audience_ratingUpright 27.09866 0.92741 29.220 < 2e-16 ***
## critics_score 0.19977 0.01726 11.576 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.852 on 551 degrees of freedom
## Multiple R-squared: 0.8078, Adjusted R-squared: 0.8005
## F-statistic: 110.3 on 21 and 551 DF, p-value: < 2.2e-16
m10 <- lm(audience_score ~ genre + mpaa_rating +
dvd_rel_year + dvd_rel_month +
best_pic_nom +
audience_rating +
critics_score,
data = feature_film)
summary(m10)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + dvd_rel_year +
## dvd_rel_month + best_pic_nom + audience_rating + critics_score,
## data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.8066 -6.1768 0.4649 6.5124 20.9130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 231.87315 171.82337 1.349 0.17773
## genreAnimation -1.39475 3.66977 -0.380 0.70404
## genreArt House & International 1.59819 2.94079 0.543 0.58704
## genreComedy -0.13578 1.49746 -0.091 0.92779
## genreDocumentary 11.03158 5.33873 2.066 0.03926 *
## genreDrama 0.40104 1.31283 0.305 0.76012
## genreHorror -1.66204 2.26553 -0.734 0.46349
## genreMusical & Performing Arts 7.70931 3.37762 2.282 0.02284 *
## genreMystery & Suspense -1.01801 1.68950 -0.603 0.54705
## genreOther 1.04432 2.60304 0.401 0.68843
## genreScience Fiction & Fantasy -1.41034 3.34251 -0.422 0.67323
## mpaa_ratingNC-17 -13.50588 9.28106 -1.455 0.14618
## mpaa_ratingPG -3.81621 2.72478 -1.401 0.16191
## mpaa_ratingPG-13 -3.87571 2.78561 -1.391 0.16468
## mpaa_ratingR -4.17940 2.70618 -1.544 0.12307
## mpaa_ratingUnrated -5.31591 3.63834 -1.461 0.14456
## dvd_rel_year -0.09696 0.08582 -1.130 0.25908
## dvd_rel_month 0.17355 0.11033 1.573 0.11630
## best_pic_nomyes 6.11497 2.00869 3.044 0.00244 **
## audience_ratingUpright 27.15481 0.92749 29.278 < 2e-16 ***
## critics_score 0.19643 0.01712 11.476 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.861 on 552 degrees of freedom
## Multiple R-squared: 0.8071, Adjusted R-squared: 0.8001
## F-statistic: 115.5 on 20 and 552 DF, p-value: < 2.2e-16
m11 <- lm(audience_score ~ genre + mpaa_rating +
dvd_rel_month +
best_pic_nom +
audience_rating +
critics_score,
data = feature_film)
summary(m11)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + dvd_rel_month +
## best_pic_nom + audience_rating + critics_score, data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.4645 -5.9354 0.3496 6.4574 21.0573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.78487 2.89991 13.030 < 2e-16 ***
## genreAnimation -1.89955 3.64338 -0.521 0.60232
## genreArt House & International 1.57910 2.94148 0.537 0.59160
## genreComedy -0.10487 1.49759 -0.070 0.94420
## genreDocumentary 10.89601 5.33872 2.041 0.04173 *
## genreDrama 0.40356 1.31316 0.307 0.75872
## genreHorror -1.68100 2.26604 -0.742 0.45851
## genreMusical & Performing Arts 7.77517 3.37796 2.302 0.02172 *
## genreMystery & Suspense -1.05985 1.68951 -0.627 0.53072
## genreOther 1.09740 2.60327 0.422 0.67352
## genreScience Fiction & Fantasy -1.38865 3.34329 -0.415 0.67804
## mpaa_ratingNC-17 -13.23366 9.28025 -1.426 0.15443
## mpaa_ratingPG -3.98303 2.72145 -1.464 0.14388
## mpaa_ratingPG-13 -4.23376 2.76821 -1.529 0.12673
## mpaa_ratingR -4.43763 2.69718 -1.645 0.10048
## mpaa_ratingUnrated -6.19086 3.55584 -1.741 0.08223 .
## dvd_rel_month 0.17217 0.11035 1.560 0.11928
## best_pic_nomyes 6.03954 2.00809 3.008 0.00275 **
## audience_ratingUpright 27.27326 0.92178 29.588 < 2e-16 ***
## critics_score 0.19626 0.01712 11.463 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.863 on 553 degrees of freedom
## Multiple R-squared: 0.8066, Adjusted R-squared: 0.8
## F-statistic: 121.4 on 19 and 553 DF, p-value: < 2.2e-16
m12 <- lm(audience_score ~ genre + mpaa_rating +
best_pic_nom +
audience_rating +
critics_score,
data = feature_film)
summary(m12)##
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + best_pic_nom +
## audience_rating + critics_score, data = feature_film)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.2756 -6.1726 0.4954 6.6826 21.0643
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.82065 2.82655 13.734 < 2e-16 ***
## genreAnimation -2.12146 3.64531 -0.582 0.56082
## genreArt House & International 1.55043 2.94522 0.526 0.59881
## genreComedy -0.24183 1.49695 -0.162 0.87172
## genreDocumentary 11.45243 5.33368 2.147 0.03221 *
## genreDrama 0.26224 1.31173 0.200 0.84162
## genreHorror -1.72419 2.26880 -0.760 0.44760
## genreMusical & Performing Arts 7.45515 3.37609 2.208 0.02764 *
## genreMystery & Suspense -1.19925 1.68933 -0.710 0.47807
## genreOther 0.96024 2.60515 0.369 0.71257
## genreScience Fiction & Fantasy -1.19316 3.34527 -0.357 0.72147
## mpaa_ratingNC-17 -12.32387 9.27389 -1.329 0.18444
## mpaa_ratingPG -3.88810 2.72429 -1.427 0.15409
## mpaa_ratingPG-13 -4.13563 2.77108 -1.492 0.13616
## mpaa_ratingR -4.31268 2.69948 -1.598 0.11070
## mpaa_ratingUnrated -6.02956 3.55894 -1.694 0.09079 .
## best_pic_nomyes 5.96819 2.01016 2.969 0.00312 **
## audience_ratingUpright 27.31076 0.92266 29.600 < 2e-16 ***
## critics_score 0.19686 0.01714 11.487 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.875 on 554 degrees of freedom
## Multiple R-squared: 0.8058, Adjusted R-squared: 0.7995
## F-statistic: 127.7 on 18 and 554 DF, p-value: < 2.2e-16
At this point, we have reached the final model as all individual explanatory variables are less than the significant level 5%.
In this section, we will attempt to predict the number of IMDb votes for a 2016 movie La La Land.
From this IMDb webpage, the movie has an audience rating of 81 and the following attributes:
- genre = “Drama”
- mpaa_rating = “PG-13”,
- best_pic_nom = “yes”,
- audience_rating = “Upright”,
- critics_score = 91)
Sources: https://www.imdb.com/title/tt3783958/?ref_=tt_rt https://www.rottentomatoes.com/m/la_la_land
lalaland <- data.frame(genre = "Drama",
mpaa_rating = "PG-13",
best_pic_nom = "yes",
audience_rating = "Upright",
critics_score = 91)
round(predict(m12, lalaland, digits = 0, interval = 'prediction', level = 0.95), digits = 0)## fit lwr upr
## 1 86 68 104
When choosing our confidence interval at 99%, the true audience rating of 81 falls within the 95% CI of (68, 104) and 5 points off the predicted value of 86.
In this analysis, we perform EDA on the dataset comprised of 600+ randomly sampled movies produced and released before 2016 and their attributes including such as number of IMDb votes, IMDb rating, genre, runtime, etc. The objective is to explore variables associated with popular movies.
We measure a movie’s popularity using the number of IMDb votes. We then build a multiple linear regression model using backward elimination with p-value. In this model, we use genre, mpaa rating, Oscar best picture nomiation, Rotten Tomatoes audience rating and critics score to predict the movie’s popularity measured by the movie’s Rotten Tomatoes audience rating.
We then use the model to predict the audience rating for the movie La La Land (2016). The true audience rating of 81 falls within the 95% CI of (68, 104) and 5 points off the predicted value of 86.
Last but not least through our EDA, we notice that the majority of movies produced between 1972 and 2014 are drama movies. Addiitonally, drama movies released during the holidays (December) and the summer are more popular compared to other drame movies released in the remaining of the year in the dataset.