Setup

This is the capstone project by Kristen Phan for Duke University’s Linear Regression and Modeling course (Course URL).

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.


Part 1: Data

Background information:

The data set is comprised of 651 randomly sampled movies produced and released between 1972 and 2014 about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset is provided below, and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.

More information on the dataset’s codebook can be found here.

Generalization:

Because the movies were sampled randomly, the findings of this study can be generalized for movies that were produced nad released beofre 2016.

Causation:

Because this is an observational study, its findings only imply association and not causation.


Part 2: Research question

What attributes are associated with popular movies?
The purpose of this study is to explore attributes associated with a movie’s popularity and build a regression model to predict a movie’s popularity given its attributes. Keep in mind that the attributes we are about to analyze are not the cause of a movie’s popularity.


Part 3: Exploratory data analysis

First, we take a peek at the dataset.

##     title                  title_type                 genre        runtime     
##  Length:619         Documentary : 42   Drama             :298   Min.   : 65.0  
##  Class :character   Feature Film:573   Comedy            : 86   1st Qu.: 93.0  
##  Mode  :character   TV Movie    :  4   Action & Adventure: 62   Median :103.0  
##                                        Mystery & Suspense: 56   Mean   :106.5  
##                                        Documentary       : 40   3rd Qu.:116.0  
##                                        Horror            : 22   Max.   :267.0  
##                                        (Other)           : 55                  
##   mpaa_rating                               studio    thtr_rel_year 
##  G      : 16   Paramount Pictures              : 37   Min.   :1972  
##  NC-17  :  1   Warner Bros. Pictures           : 30   1st Qu.:1991  
##  PG     :111   Sony Pictures Home Entertainment: 27   Median :2000  
##  PG-13  :131   Universal Pictures              : 23   Mean   :1998  
##  R      :319   Warner Home Video               : 19   3rd Qu.:2007  
##  Unrated: 41   Miramax Films                   : 18   Max.   :2014  
##                (Other)                         :465                 
##  thtr_rel_month    thtr_rel_day    dvd_rel_year  dvd_rel_month   
##  Min.   : 1.000   Min.   : 1.00   Min.   :1991   Min.   : 1.000  
##  1st Qu.: 4.000   1st Qu.: 7.00   1st Qu.:2001   1st Qu.: 3.000  
##  Median : 7.000   Median :15.00   Median :2004   Median : 6.000  
##  Mean   : 6.733   Mean   :14.43   Mean   :2004   Mean   : 6.346  
##  3rd Qu.:10.000   3rd Qu.:22.00   3rd Qu.:2008   3rd Qu.: 9.000  
##  Max.   :12.000   Max.   :31.00   Max.   :2015   Max.   :12.000  
##                                                                  
##   dvd_rel_day     imdb_rating    imdb_num_votes           critics_rating
##  Min.   : 1.00   Min.   :1.900   Min.   :   183   Certified Fresh:131   
##  1st Qu.: 7.00   1st Qu.:5.900   1st Qu.:  5026   Fresh          :195   
##  Median :15.00   Median :6.600   Median : 16480   Rotten         :293   
##  Mean   :15.08   Mean   :6.486   Mean   : 60014                         
##  3rd Qu.:23.00   3rd Qu.:7.300   3rd Qu.: 62507                         
##  Max.   :31.00   Max.   :9.000   Max.   :893008                         
##                                                                         
##  critics_score    audience_rating audience_score  best_pic_nom best_pic_win
##  Min.   :  1.00   Spilled:264     Min.   :11.00   no :597      no :612     
##  1st Qu.: 33.00   Upright:355     1st Qu.:46.00   yes: 22      yes:  7     
##  Median : 61.00                   Median :65.00                            
##  Mean   : 57.43                   Mean   :62.21                            
##  3rd Qu.: 82.50                   3rd Qu.:80.00                            
##  Max.   :100.00                   Max.   :97.00                            
##                                                                            
##  best_actor_win best_actress_win best_dir_win top200_box   director        
##  no :528        no :548          no :576      no :604    Length:619        
##  yes: 91        yes: 71          yes: 43      yes: 15    Class :character  
##                                                          Mode  :character  
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##     actor1             actor2             actor3             actor4         
##  Length:619         Length:619         Length:619         Length:619        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     actor5            imdb_url            rt_url         
##  Length:619         Length:619         Length:619        
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
## 

First, it’s worth noting that a movie’s type (documentary, featured file, or TV movie) might affect its popularity depending on the audience taste, so we will focus only Featured Film in this study (sample size = 591 feature films).

## [1] 573

Second, in this analysis, we will use Rotten Tomatoes’ audiene score to measure a movie’s popularity.


Next we will kick things off by analyzing a few attributes which are likely to influence a movie’s popularity:
* genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
* thtr_rel_month: Month the movie is released in theaters
* director: Director of the movie
* studio: Studio that produced the movie
* mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)

Let’s examine movie’s popularity among different genres.

## # A tibble: 11 x 2
##    genre                        count
##    <fct>                        <int>
##  1 Drama                     18971579
##  2 Action & Adventure         5164252
##  3 Mystery & Suspense         4625314
##  4 Comedy                     3858154
##  5 Other                      1949012
##  6 Science Fiction & Fantasy   763731
##  7 Horror                      618755
##  8 Animation                   480972
##  9 Musical & Performing Arts   287272
## 10 Art House & International   114019
## 11 Documentary                  38015

Drama movies seem to invoke the most audience engagement. Next, we take a look at popularity among movies in relation to the release month and the movie genre.

Glancing from the above plot, release time and audience score seem to be associated for Drama movies. Audience score is higher for drama movies released during the holiday season (December and January) and in the summer than other drama movies released throughout the rest of the year.


Part 4: Modeling

Model diagnostics

Before we discuss a MLR, we need to check if the dataset meets all 4 conditions for a MLR.

In our model, we will exclude the following variables:
- “actor1” through “actor5”: refer to whether the movie casts an actor or actress who won a best actor or actress Oscar, so they add no value to the prediction of a movie’s popularity.
- “imdb_url” and “rt_url”: have no relation to the movies
- “title”, “director”, “studio”, “title type”: There categories contains unique data points (ie. outliners) and should be excluded.

Now we are going to build a model with all except for the attributes mentioned above and check if all explantory variables meet the conditions.

Numerical, explanatory variables include: 1. runtime 2. imdb_rating 3. imdb_num_votes 4. critics_score

From the above scatter plot, critics_score is most linearly related to the reponse variable while runtime, imdb_rating, and imdb_num_votes don’t. For that reason, we will exclude runtime, imdb_rating, and imdb_num_votes from our model and recompute the model.

Based on the above visuals, our model seems to meet all condiitons except for condition #3 - constant variability of the residuals. However, because we have a large sample, this might be not an important violations of the model.

Model finetuning

In this section, we will further finetune the model using using backward elimination with P-val. Although using adjusted R squared might yield a more reliable model, it’s less computationally intensive to use p-val, and the resulting model will be relatively similar to that by adjusted R squared. The model will be used later to predict a movie’s popularity given its attributes with a movie’s popularity measured by the number of IMDb votes.

Below is the summary of the current model

## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year + 
##     thtr_rel_month + thtr_rel_day + dvd_rel_year + dvd_rel_month + 
##     dvd_rel_day + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + top200_box + critics_rating + 
##     audience_rating + critics_score, data = feature_film)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.4983  -6.2141   0.6138   6.1294  21.3554 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    270.783365 180.686391   1.499   0.1346    
## genreAnimation                  -1.612296   3.741046  -0.431   0.6667    
## genreArt House & International   2.457320   2.976627   0.826   0.4094    
## genreComedy                      0.080045   1.516017   0.053   0.9579    
## genreDocumentary                11.361346   5.363629   2.118   0.0346 *  
## genreDrama                       0.675057   1.345883   0.502   0.6162    
## genreHorror                     -1.205044   2.293313  -0.525   0.5995    
## genreMusical & Performing Arts   7.616848   3.402323   2.239   0.0256 *  
## genreMystery & Suspense         -0.652780   1.735907  -0.376   0.7070    
## genreOther                       1.650694   2.648408   0.623   0.5334    
## genreScience Fiction & Fantasy  -0.910663   3.377609  -0.270   0.7876    
## mpaa_ratingNC-17               -13.657150   9.363779  -1.459   0.1453    
## mpaa_ratingPG                   -3.936499   2.756223  -1.428   0.1538    
## mpaa_ratingPG-13                -4.431838   2.863201  -1.548   0.1222    
## mpaa_ratingR                    -4.711622   2.780709  -1.694   0.0908 .  
## mpaa_ratingUnrated              -5.992864   3.737399  -1.603   0.1094    
## thtr_rel_year                    0.065641   0.052074   1.261   0.2080    
## thtr_rel_month                  -0.033367   0.111430  -0.299   0.7647    
## thtr_rel_day                    -0.025636   0.043390  -0.591   0.5549    
## dvd_rel_year                    -0.182948   0.112961  -1.620   0.1059    
## dvd_rel_month                    0.166015   0.113289   1.465   0.1434    
## dvd_rel_day                     -0.006103   0.042821  -0.143   0.8867    
## best_pic_nomyes                  6.096324   2.364742   2.578   0.0102 *  
## best_pic_winyes                 -1.336816   4.094849  -0.326   0.7442    
## best_actor_winyes                0.499262   1.077095   0.464   0.6432    
## best_actress_winyes             -0.886144   1.201489  -0.738   0.4611    
## best_dir_winyes                  1.944774   1.531451   1.270   0.2047    
## top200_boxyes                   -0.211540   2.460796  -0.086   0.9315    
## critics_ratingFresh             -0.810431   1.267853  -0.639   0.5230    
## critics_ratingRotten             2.180468   1.885150   1.157   0.2479    
## audience_ratingUpright          27.090926   0.944536  28.682  < 2e-16 ***
## critics_score                    0.238003   0.030144   7.896 1.62e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.879 on 541 degrees of freedom
## Multiple R-squared:  0.8102, Adjusted R-squared:  0.7993 
## F-statistic: 74.48 on 31 and 541 DF,  p-value: < 2.2e-16

There are a few statistical points worth noting. 1. P-val: Because p-val = 2.2e-16 < 0, the data provides sufficient evidence that the set of explanatory variables and the response variable (proxy of a movie’s popularity) included in the model are associated.

  1. Multiple R-squared of 0.4449 44.49% of variation in the response variable is current explained by the model.

  2. Estimate of best_pic_nomyes = 43671.0 The number of imdb votes for movies which have been nomiated for best picture is 43671 votes higher than those without a nomination for best picture.

Next, we will drop one variable with the highest p-val that is greater than our chosen significant level 5%. This time, we will drop thtr_rel_day with p-val of 0.742906

## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year + 
##     thtr_rel_month + thtr_rel_day + dvd_rel_year + dvd_rel_month + 
##     dvd_rel_day + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + critics_rating + audience_rating + 
##     critics_score, data = feature_film)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.503  -6.202   0.613   6.098  21.346 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    270.64382  180.51358   1.499   0.1344    
## genreAnimation                  -1.57676    3.71473  -0.424   0.6714    
## genreArt House & International   2.47137    2.96942   0.832   0.4056    
## genreComedy                      0.09294    1.50719   0.062   0.9509    
## genreDocumentary                11.36723    5.35828   2.121   0.0343 *  
## genreDrama                       0.68764    1.33667   0.514   0.6071    
## genreHorror                     -1.19590    2.28875  -0.523   0.6015    
## genreMusical & Performing Arts   7.63674    3.39133   2.252   0.0247 *  
## genreMystery & Suspense         -0.64375    1.73114  -0.372   0.7101    
## genreOther                       1.65588    2.64530   0.626   0.5316    
## genreScience Fiction & Fantasy  -0.91807    3.37342  -0.272   0.7856    
## mpaa_ratingNC-17               -13.61190    9.34041  -1.457   0.1456    
## mpaa_ratingPG                   -3.91803    2.74532  -1.427   0.1541    
## mpaa_ratingPG-13                -4.41083    2.85014  -1.548   0.1223    
## mpaa_ratingR                    -4.68367    2.75911  -1.698   0.0902 .  
## mpaa_ratingUnrated              -5.96166    3.71632  -1.604   0.1093    
## thtr_rel_year                    0.06584    0.05198   1.267   0.2058    
## thtr_rel_month                  -0.03424    0.11086  -0.309   0.7575    
## thtr_rel_day                    -0.02556    0.04334  -0.590   0.5556    
## dvd_rel_year                    -0.18310    0.11284  -1.623   0.1053    
## dvd_rel_month                    0.16576    0.11315   1.465   0.1435    
## dvd_rel_day                     -0.00610    0.04278  -0.143   0.8867    
## best_pic_nomyes                  6.10208    2.36163   2.584   0.0100 *  
## best_pic_winyes                 -1.34954    4.08842  -0.330   0.7415    
## best_actor_winyes                0.49602    1.07545   0.461   0.6448    
## best_actress_winyes             -0.89180    1.19859  -0.744   0.4572    
## best_dir_winyes                  1.94645    1.52992   1.272   0.2038    
## critics_ratingFresh             -0.79402    1.25225  -0.634   0.5263    
## critics_ratingRotten             2.19643    1.87426   1.172   0.2418    
## audience_ratingUpright          27.08699    0.94256  28.738  < 2e-16 ***
## critics_score                    0.23800    0.03012   7.903 1.53e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.871 on 542 degrees of freedom
## Multiple R-squared:  0.8102, Adjusted R-squared:  0.7997 
## F-statistic: 77.11 on 30 and 542 DF,  p-value: < 2.2e-16

Keep repeating this step.

## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year + 
##     thtr_rel_month + thtr_rel_day + dvd_rel_year + dvd_rel_month + 
##     best_pic_nom + best_pic_win + best_actor_win + best_actress_win + 
##     best_dir_win + critics_rating + audience_rating + critics_score, 
##     data = feature_film)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.4409  -6.1865   0.6426   6.1152  21.4281 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    268.10742  179.47289   1.494   0.1358    
## genreAnimation                  -1.59935    3.70800  -0.431   0.6664    
## genreArt House & International   2.47559    2.96659   0.834   0.4044    
## genreComedy                      0.08906    1.50558   0.059   0.9529    
## genreDocumentary                11.38401    5.35215   2.127   0.0339 *  
## genreDrama                       0.67345    1.33175   0.506   0.6133    
## genreHorror                     -1.19410    2.28665  -0.522   0.6017    
## genreMusical & Performing Arts   7.60868    3.38256   2.249   0.0249 *  
## genreMystery & Suspense         -0.65348    1.72823  -0.378   0.7055    
## genreOther                       1.63023    2.63679   0.618   0.5367    
## genreScience Fiction & Fantasy  -0.94427    3.36537  -0.281   0.7791    
## mpaa_ratingNC-17               -13.64122    9.32972  -1.462   0.1443    
## mpaa_ratingPG                   -3.93287    2.74086  -1.435   0.1519    
## mpaa_ratingPG-13                -4.43486    2.84259  -1.560   0.1193    
## mpaa_ratingR                    -4.70241    2.75349  -1.708   0.0882 .  
## mpaa_ratingUnrated              -5.99319    3.70639  -1.617   0.1065    
## thtr_rel_year                    0.06551    0.05188   1.263   0.2072    
## thtr_rel_month                  -0.03431    0.11076  -0.310   0.7568    
## thtr_rel_day                    -0.02560    0.04330  -0.591   0.5546    
## dvd_rel_year                    -0.18155    0.11222  -1.618   0.1063    
## dvd_rel_month                    0.16640    0.11296   1.473   0.1413    
## best_pic_nomyes                  6.09242    2.35853   2.583   0.0101 *  
## best_pic_winyes                 -1.36850    4.08257  -0.335   0.7376    
## best_actor_winyes                0.49915    1.07425   0.465   0.6424    
## best_actress_winyes             -0.89099    1.19749  -0.744   0.4572    
## best_dir_winyes                  1.94037    1.52795   1.270   0.2047    
## critics_ratingFresh             -0.79674    1.25098  -0.637   0.5245    
## critics_ratingRotten             2.19739    1.87256   1.173   0.2411    
## audience_ratingUpright          27.07630    0.93873  28.844  < 2e-16 ***
## critics_score                    0.23832    0.03001   7.941 1.16e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.863 on 543 degrees of freedom
## Multiple R-squared:  0.8102, Adjusted R-squared:    0.8 
## F-statistic: 79.91 on 29 and 543 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year + 
##     thtr_rel_day + dvd_rel_year + dvd_rel_month + best_pic_nom + 
##     best_pic_win + best_actor_win + best_actress_win + best_dir_win + 
##     critics_rating + audience_rating + critics_score, data = feature_film)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5557  -6.2992   0.6969   6.1080  21.5953 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    267.81028  179.32114   1.493   0.1359    
## genreAnimation                  -1.60310    3.70489  -0.433   0.6654    
## genreArt House & International   2.48171    2.96406   0.837   0.4028    
## genreComedy                      0.08493    1.50427   0.056   0.9550    
## genreDocumentary                11.42839    5.34579   2.138   0.0330 *  
## genreDrama                       0.68591    1.33004   0.516   0.6063    
## genreHorror                     -1.18487    2.28455  -0.519   0.6042    
## genreMusical & Performing Arts   7.58594    3.37895   2.245   0.0252 *  
## genreMystery & Suspense         -0.61498    1.72232  -0.357   0.7212    
## genreOther                       1.68955    2.62765   0.643   0.5205    
## genreScience Fiction & Fantasy  -0.93178    3.36233  -0.277   0.7818    
## mpaa_ratingNC-17               -13.76368    9.31359  -1.478   0.1400    
## mpaa_ratingPG                   -3.95710    2.73747  -1.446   0.1489    
## mpaa_ratingPG-13                -4.43674    2.84022  -1.562   0.1188    
## mpaa_ratingR                    -4.72984    2.74977  -1.720   0.0860 .  
## mpaa_ratingUnrated              -6.01771    3.70247  -1.625   0.1047    
## thtr_rel_year                    0.06500    0.05181   1.255   0.2101    
## thtr_rel_day                    -0.02709    0.04300  -0.630   0.5290    
## dvd_rel_year                    -0.18099    0.11211  -1.614   0.1070    
## dvd_rel_month                    0.17195    0.11144   1.543   0.1234    
## best_pic_nomyes                  5.97099    2.32379   2.570   0.0104 *  
## best_pic_winyes                 -1.29676    4.07261  -0.318   0.7503    
## best_actor_winyes                0.47434    1.07037   0.443   0.6578    
## best_actress_winyes             -0.89801    1.19628  -0.751   0.4532    
## best_dir_winyes                  1.91218    1.52397   1.255   0.2101    
## critics_ratingFresh             -0.79816    1.24993  -0.639   0.5234    
## critics_ratingRotten             2.18873    1.87080   1.170   0.2425    
## audience_ratingUpright          27.07535    0.93794  28.867  < 2e-16 ***
## critics_score                    0.23814    0.02998   7.943 1.14e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.855 on 544 degrees of freedom
## Multiple R-squared:  0.8101, Adjusted R-squared:  0.8004 
## F-statistic:  82.9 on 28 and 544 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year + 
##     thtr_rel_day + dvd_rel_year + dvd_rel_month + best_pic_nom + 
##     best_actor_win + best_actress_win + best_dir_win + critics_rating + 
##     audience_rating + critics_score, data = feature_film)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5640  -6.0803   0.7117   6.1036  21.5949 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    268.76694  179.14809   1.500  0.13413    
## genreAnimation                  -1.60085    3.70183  -0.432  0.66559    
## genreArt House & International   2.48297    2.96161   0.838  0.40218    
## genreComedy                      0.07237    1.50252   0.048  0.96160    
## genreDocumentary                11.40972    5.34106   2.136  0.03311 *  
## genreDrama                       0.68630    1.32894   0.516  0.60577    
## genreHorror                     -1.18024    2.28262  -0.517  0.60533    
## genreMusical & Performing Arts   7.60198    3.37579   2.252  0.02473 *  
## genreMystery & Suspense         -0.62550    1.72058  -0.364  0.71634    
## genreOther                       1.73630    2.62138   0.662  0.50802    
## genreScience Fiction & Fantasy  -0.92079    3.35938  -0.274  0.78412    
## mpaa_ratingNC-17               -13.75163    9.30583  -1.478  0.14005    
## mpaa_ratingPG                   -3.96285    2.73515  -1.449  0.14795    
## mpaa_ratingPG-13                -4.42932    2.83778  -1.561  0.11914    
## mpaa_ratingR                    -4.73159    2.74750  -1.722  0.08561 .  
## mpaa_ratingUnrated              -6.02911    3.69924  -1.630  0.10372    
## thtr_rel_year                    0.06583    0.05170   1.273  0.20343    
## thtr_rel_day                    -0.02743    0.04295  -0.639  0.52337    
## dvd_rel_year                    -0.18232    0.11194  -1.629  0.10395    
## dvd_rel_month                    0.17392    0.11117   1.564  0.11830    
## best_pic_nomyes                  5.66158    2.10913   2.684  0.00749 ** 
## best_actor_winyes                0.49911    1.06666   0.468  0.64003    
## best_actress_winyes             -0.91829    1.19360  -0.769  0.44202    
## best_dir_winyes                  1.77044    1.45631   1.216  0.22462    
## critics_ratingFresh             -0.75505    1.24155  -0.608  0.54334    
## critics_ratingRotten             2.22099    1.86651   1.190  0.23460    
## audience_ratingUpright          27.07608    0.93717  28.891  < 2e-16 ***
## critics_score                    0.23830    0.02995   7.956 1.03e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.848 on 545 degrees of freedom
## Multiple R-squared:  0.8101, Adjusted R-squared:  0.8007 
## F-statistic: 86.11 on 27 and 545 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year + 
##     thtr_rel_day + dvd_rel_year + dvd_rel_month + best_pic_nom + 
##     best_actress_win + best_dir_win + critics_rating + audience_rating + 
##     critics_score, data = feature_film)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.6419  -6.0921   0.7699   6.0847  21.5376 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    277.33699  178.08189   1.557  0.11997    
## genreAnimation                  -1.57074    3.69862  -0.425  0.67124    
## genreArt House & International   2.44022    2.95808   0.825  0.40977    
## genreComedy                      0.05111    1.50075   0.034  0.97284    
## genreDocumentary                11.39185    5.33710   2.134  0.03325 *  
## genreDrama                       0.70693    1.32726   0.533  0.59451    
## genreHorror                     -1.22940    2.27857  -0.540  0.58973    
## genreMusical & Performing Arts   7.59775    3.37336   2.252  0.02470 *  
## genreMystery & Suspense         -0.55223    1.71222  -0.323  0.74718    
## genreOther                       1.75252    2.61927   0.669  0.50372    
## genreScience Fiction & Fantasy  -0.96690    3.35553  -0.288  0.77334    
## mpaa_ratingNC-17               -13.78165    9.29895  -1.482  0.13890    
## mpaa_ratingPG                   -3.90189    2.73009  -1.429  0.15351    
## mpaa_ratingPG-13                -4.38114    2.83388  -1.546  0.12269    
## mpaa_ratingR                    -4.69835    2.74462  -1.712  0.08749 .  
## mpaa_ratingUnrated              -5.98004    3.69511  -1.618  0.10616    
## thtr_rel_year                    0.06671    0.05163   1.292  0.19689    
## thtr_rel_day                    -0.02700    0.04291  -0.629  0.52950    
## dvd_rel_year                    -0.18746    0.11132  -1.684  0.09275 .  
## dvd_rel_month                    0.16919    0.11063   1.529  0.12677    
## best_pic_nomyes                  5.78540    2.09096   2.767  0.00585 ** 
## best_actress_winyes             -0.87766    1.18959  -0.738  0.46096    
## best_dir_winyes                  1.80116    1.45378   1.239  0.21590    
## critics_ratingFresh             -0.74177    1.24034  -0.598  0.55006    
## critics_ratingRotten             2.23907    1.86477   1.201  0.23038    
## audience_ratingUpright          27.05451    0.93536  28.924  < 2e-16 ***
## critics_score                    0.23880    0.02991   7.984 8.44e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.842 on 546 degrees of freedom
## Multiple R-squared:   0.81,  Adjusted R-squared:  0.801 
## F-statistic: 89.54 on 26 and 546 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year + 
##     thtr_rel_day + dvd_rel_year + dvd_rel_month + best_pic_nom + 
##     best_actress_win + best_dir_win + audience_rating + critics_score, 
##     data = feature_film)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.9289  -5.8651   0.3948   6.4309  21.3742 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    287.79212  177.37559   1.623  0.10527    
## genreAnimation                  -1.95621    3.70019  -0.529  0.59724    
## genreArt House & International   2.26757    2.96245   0.765  0.44434    
## genreComedy                     -0.02711    1.50330  -0.018  0.98562    
## genreDocumentary                11.08484    5.33872   2.076  0.03833 *  
## genreDrama                       0.66127    1.32675   0.498  0.61839    
## genreHorror                     -1.16127    2.28283  -0.509  0.61117    
## genreMusical & Performing Arts   7.81261    3.37853   2.312  0.02112 *  
## genreMystery & Suspense         -0.76191    1.70466  -0.447  0.65508    
## genreOther                       1.55178    2.62259   0.592  0.55430    
## genreScience Fiction & Fantasy  -1.31888    3.35472  -0.393  0.69437    
## mpaa_ratingNC-17               -13.76982    9.29370  -1.482  0.13901    
## mpaa_ratingPG                   -4.11419    2.73131  -1.506  0.13257    
## mpaa_ratingPG-13                -4.72767    2.83234  -1.669  0.09565 .  
## mpaa_ratingR                    -5.03037    2.74454  -1.833  0.06736 .  
## mpaa_ratingUnrated              -6.44485    3.69515  -1.744  0.08170 .  
## thtr_rel_year                    0.07801    0.05006   1.559  0.11968    
## thtr_rel_day                    -0.02326    0.04274  -0.544  0.58654    
## dvd_rel_year                    -0.20224    0.11126  -1.818  0.06965 .  
## dvd_rel_month                    0.17425    0.11077   1.573  0.11629    
## best_pic_nomyes                  6.14388    2.05328   2.992  0.00289 ** 
## best_actress_winyes             -0.71399    1.18734  -0.601  0.54786    
## best_dir_winyes                  1.76966    1.45611   1.215  0.22476    
## audience_ratingUpright          27.12246    0.92924  29.188  < 2e-16 ***
## critics_score                    0.19738    0.01745  11.311  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.859 on 548 degrees of freedom
## Multiple R-squared:  0.8086, Adjusted R-squared:  0.8002 
## F-statistic: 96.43 on 24 and 548 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year + 
##     dvd_rel_year + dvd_rel_month + best_pic_nom + best_actress_win + 
##     best_dir_win + audience_rating + critics_score, data = feature_film)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.124  -5.835   0.547   6.368  21.488 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.876e+02  1.773e+02   1.622  0.10527    
## genreAnimation                 -1.930e+00  3.697e+00  -0.522  0.60194    
## genreArt House & International  2.166e+00  2.955e+00   0.733  0.46379    
## genreComedy                    -9.287e-04  1.502e+00  -0.001  0.99951    
## genreDocumentary                1.111e+01  5.335e+00   2.082  0.03784 *  
## genreDrama                      6.506e-01  1.326e+00   0.491  0.62380    
## genreHorror                    -1.232e+00  2.278e+00  -0.541  0.58893    
## genreMusical & Performing Arts  7.763e+00  3.375e+00   2.300  0.02182 *  
## genreMystery & Suspense        -7.861e-01  1.703e+00  -0.462  0.64453    
## genreOther                      1.615e+00  2.618e+00   0.617  0.53761    
## genreScience Fiction & Fantasy -1.203e+00  3.346e+00  -0.360  0.71922    
## mpaa_ratingNC-17               -1.399e+01  9.279e+00  -1.508  0.13214    
## mpaa_ratingPG                  -4.117e+00  2.730e+00  -1.508  0.13208    
## mpaa_ratingPG-13               -4.751e+00  2.830e+00  -1.679  0.09380 .  
## mpaa_ratingR                   -5.003e+00  2.742e+00  -1.825  0.06862 .  
## mpaa_ratingUnrated             -6.406e+00  3.692e+00  -1.735  0.08329 .  
## thtr_rel_year                   7.532e-02  4.978e-02   1.513  0.13083    
## dvd_rel_year                   -1.996e-01  1.111e-01  -1.797  0.07287 .  
## dvd_rel_month                   1.758e-01  1.107e-01   1.588  0.11281    
## best_pic_nomyes                 6.132e+00  2.052e+00   2.989  0.00293 ** 
## best_actress_winyes            -7.317e-01  1.186e+00  -0.617  0.53758    
## best_dir_winyes                 1.773e+00  1.455e+00   1.219  0.22355    
## audience_ratingUpright          2.711e+01  9.284e-01  29.202  < 2e-16 ***
## critics_score                   1.972e-01  1.744e-02  11.310  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.854 on 549 degrees of freedom
## Multiple R-squared:  0.8084, Adjusted R-squared:  0.8004 
## F-statistic: 100.7 on 23 and 549 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year + 
##     dvd_rel_year + dvd_rel_month + best_pic_nom + best_dir_win + 
##     audience_rating + critics_score, data = feature_film)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.024  -5.842   0.570   6.364  21.571 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    280.98879  176.83690   1.589  0.11264    
## genreAnimation                  -2.03650    3.69137  -0.552  0.58138    
## genreArt House & International   2.06015    2.94802   0.699  0.48496    
## genreComedy                     -0.07417    1.49602  -0.050  0.96048    
## genreDocumentary                11.05413    5.33149   2.073  0.03860 *  
## genreDrama                       0.54179    1.31322   0.413  0.68009    
## genreHorror                     -1.25482    2.27612  -0.551  0.58165    
## genreMusical & Performing Arts   7.76051    3.37322   2.301  0.02179 *  
## genreMystery & Suspense         -0.91620    1.68893  -0.542  0.58771    
## genreOther                       1.54260    2.61422   0.590  0.55538    
## genreScience Fiction & Fantasy  -1.20498    3.34397  -0.360  0.71873    
## mpaa_ratingNC-17               -13.88213    9.27183  -1.497  0.13491    
## mpaa_ratingPG                   -4.14523    2.72763  -1.520  0.12916    
## mpaa_ratingPG-13                -4.77961    2.82822  -1.690  0.09160 .  
## mpaa_ratingR                    -5.00022    2.74078  -1.824  0.06864 .  
## mpaa_ratingUnrated              -6.34568    3.68872  -1.720  0.08594 .  
## thtr_rel_year                    0.07457    0.04974   1.499  0.13435    
## dvd_rel_year                    -0.19557    0.11083  -1.765  0.07818 .  
## dvd_rel_month                    0.17567    0.11060   1.588  0.11279    
## best_pic_nomyes                  5.91029    2.01894   2.927  0.00356 ** 
## best_dir_winyes                  1.73869    1.45327   1.196  0.23206    
## audience_ratingUpright          27.12920    0.92740  29.253  < 2e-16 ***
## critics_score                    0.19688    0.01742  11.303  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.849 on 550 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.8006 
## F-statistic: 105.4 on 22 and 550 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + thtr_rel_year + 
##     dvd_rel_year + dvd_rel_month + best_pic_nom + audience_rating + 
##     critics_score, data = feature_film)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.0403  -6.3031   0.5443   6.3357  21.5306 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    292.02450  176.66528   1.653  0.09890 .  
## genreAnimation                  -2.03192    3.69281  -0.550  0.58238    
## genreArt House & International   1.93855    2.94742   0.658  0.51100    
## genreComedy                     -0.12382    1.49603  -0.083  0.93407    
## genreDocumentary                11.03230    5.33355   2.068  0.03906 *  
## genreDrama                       0.49185    1.31307   0.375  0.70812    
## genreHorror                     -1.30926    2.27655  -0.575  0.56546    
## genreMusical & Performing Arts   7.76205    3.37454   2.300  0.02181 *  
## genreMystery & Suspense         -0.90825    1.68958  -0.538  0.59110    
## genreOther                       1.41034    2.61290   0.540  0.58958    
## genreScience Fiction & Fantasy  -1.13438    3.34476  -0.339  0.73463    
## mpaa_ratingNC-17               -13.86782    9.27545  -1.495  0.13546    
## mpaa_ratingPG                   -3.94669    2.72364  -1.449  0.14789    
## mpaa_ratingPG-13                -4.56296    2.82352  -1.616  0.10666    
## mpaa_ratingR                    -4.78269    2.73582  -1.748  0.08099 .  
## mpaa_ratingUnrated              -6.22045    3.68868  -1.686  0.09229 .  
## thtr_rel_year                    0.07156    0.04969   1.440  0.15041    
## dvd_rel_year                    -0.19813    0.11085  -1.787  0.07442 .  
## dvd_rel_month                    0.16598    0.11035   1.504  0.13311    
## best_pic_nomyes                  6.17857    2.00723   3.078  0.00219 ** 
## audience_ratingUpright          27.09866    0.92741  29.220  < 2e-16 ***
## critics_score                    0.19977    0.01726  11.576  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.852 on 551 degrees of freedom
## Multiple R-squared:  0.8078, Adjusted R-squared:  0.8005 
## F-statistic: 110.3 on 21 and 551 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + dvd_rel_year + 
##     dvd_rel_month + best_pic_nom + audience_rating + critics_score, 
##     data = feature_film)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.8066  -6.1768   0.4649   6.5124  20.9130 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    231.87315  171.82337   1.349  0.17773    
## genreAnimation                  -1.39475    3.66977  -0.380  0.70404    
## genreArt House & International   1.59819    2.94079   0.543  0.58704    
## genreComedy                     -0.13578    1.49746  -0.091  0.92779    
## genreDocumentary                11.03158    5.33873   2.066  0.03926 *  
## genreDrama                       0.40104    1.31283   0.305  0.76012    
## genreHorror                     -1.66204    2.26553  -0.734  0.46349    
## genreMusical & Performing Arts   7.70931    3.37762   2.282  0.02284 *  
## genreMystery & Suspense         -1.01801    1.68950  -0.603  0.54705    
## genreOther                       1.04432    2.60304   0.401  0.68843    
## genreScience Fiction & Fantasy  -1.41034    3.34251  -0.422  0.67323    
## mpaa_ratingNC-17               -13.50588    9.28106  -1.455  0.14618    
## mpaa_ratingPG                   -3.81621    2.72478  -1.401  0.16191    
## mpaa_ratingPG-13                -3.87571    2.78561  -1.391  0.16468    
## mpaa_ratingR                    -4.17940    2.70618  -1.544  0.12307    
## mpaa_ratingUnrated              -5.31591    3.63834  -1.461  0.14456    
## dvd_rel_year                    -0.09696    0.08582  -1.130  0.25908    
## dvd_rel_month                    0.17355    0.11033   1.573  0.11630    
## best_pic_nomyes                  6.11497    2.00869   3.044  0.00244 ** 
## audience_ratingUpright          27.15481    0.92749  29.278  < 2e-16 ***
## critics_score                    0.19643    0.01712  11.476  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.861 on 552 degrees of freedom
## Multiple R-squared:  0.8071, Adjusted R-squared:  0.8001 
## F-statistic: 115.5 on 20 and 552 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + dvd_rel_month + 
##     best_pic_nom + audience_rating + critics_score, data = feature_film)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.4645  -5.9354   0.3496   6.4574  21.0573 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     37.78487    2.89991  13.030  < 2e-16 ***
## genreAnimation                  -1.89955    3.64338  -0.521  0.60232    
## genreArt House & International   1.57910    2.94148   0.537  0.59160    
## genreComedy                     -0.10487    1.49759  -0.070  0.94420    
## genreDocumentary                10.89601    5.33872   2.041  0.04173 *  
## genreDrama                       0.40356    1.31316   0.307  0.75872    
## genreHorror                     -1.68100    2.26604  -0.742  0.45851    
## genreMusical & Performing Arts   7.77517    3.37796   2.302  0.02172 *  
## genreMystery & Suspense         -1.05985    1.68951  -0.627  0.53072    
## genreOther                       1.09740    2.60327   0.422  0.67352    
## genreScience Fiction & Fantasy  -1.38865    3.34329  -0.415  0.67804    
## mpaa_ratingNC-17               -13.23366    9.28025  -1.426  0.15443    
## mpaa_ratingPG                   -3.98303    2.72145  -1.464  0.14388    
## mpaa_ratingPG-13                -4.23376    2.76821  -1.529  0.12673    
## mpaa_ratingR                    -4.43763    2.69718  -1.645  0.10048    
## mpaa_ratingUnrated              -6.19086    3.55584  -1.741  0.08223 .  
## dvd_rel_month                    0.17217    0.11035   1.560  0.11928    
## best_pic_nomyes                  6.03954    2.00809   3.008  0.00275 ** 
## audience_ratingUpright          27.27326    0.92178  29.588  < 2e-16 ***
## critics_score                    0.19626    0.01712  11.463  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.863 on 553 degrees of freedom
## Multiple R-squared:  0.8066, Adjusted R-squared:    0.8 
## F-statistic: 121.4 on 19 and 553 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = audience_score ~ genre + mpaa_rating + best_pic_nom + 
##     audience_rating + critics_score, data = feature_film)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.2756  -6.1726   0.4954   6.6826  21.0643 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     38.82065    2.82655  13.734  < 2e-16 ***
## genreAnimation                  -2.12146    3.64531  -0.582  0.56082    
## genreArt House & International   1.55043    2.94522   0.526  0.59881    
## genreComedy                     -0.24183    1.49695  -0.162  0.87172    
## genreDocumentary                11.45243    5.33368   2.147  0.03221 *  
## genreDrama                       0.26224    1.31173   0.200  0.84162    
## genreHorror                     -1.72419    2.26880  -0.760  0.44760    
## genreMusical & Performing Arts   7.45515    3.37609   2.208  0.02764 *  
## genreMystery & Suspense         -1.19925    1.68933  -0.710  0.47807    
## genreOther                       0.96024    2.60515   0.369  0.71257    
## genreScience Fiction & Fantasy  -1.19316    3.34527  -0.357  0.72147    
## mpaa_ratingNC-17               -12.32387    9.27389  -1.329  0.18444    
## mpaa_ratingPG                   -3.88810    2.72429  -1.427  0.15409    
## mpaa_ratingPG-13                -4.13563    2.77108  -1.492  0.13616    
## mpaa_ratingR                    -4.31268    2.69948  -1.598  0.11070    
## mpaa_ratingUnrated              -6.02956    3.55894  -1.694  0.09079 .  
## best_pic_nomyes                  5.96819    2.01016   2.969  0.00312 ** 
## audience_ratingUpright          27.31076    0.92266  29.600  < 2e-16 ***
## critics_score                    0.19686    0.01714  11.487  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.875 on 554 degrees of freedom
## Multiple R-squared:  0.8058, Adjusted R-squared:  0.7995 
## F-statistic: 127.7 on 18 and 554 DF,  p-value: < 2.2e-16

At this point, we have reached the final model as all individual explanatory variables are less than the significant level 5%.


Part 5: Prediction

In this section, we will attempt to predict the number of IMDb votes for a 2016 movie La La Land.

From this IMDb webpage, the movie has an audience rating of 81 and the following attributes:
- genre = “Drama”
- mpaa_rating = “PG-13”,
- best_pic_nom = “yes”,
- audience_rating = “Upright”,
- critics_score = 91)


Sources: https://www.imdb.com/title/tt3783958/?ref_=tt_rt https://www.rottentomatoes.com/m/la_la_land

##   fit lwr upr
## 1  86  68 104

When choosing our confidence interval at 99%, the true audience rating of 81 falls within the 95% CI of (68, 104) and 5 points off the predicted value of 86.


Part 6: Conclusion

In this analysis, we perform EDA on the dataset comprised of 600+ randomly sampled movies produced and released before 2016 and their attributes including such as number of IMDb votes, IMDb rating, genre, runtime, etc. The objective is to explore variables associated with popular movies.

We measure a movie’s popularity using the number of IMDb votes. We then build a multiple linear regression model using backward elimination with p-value. In this model, we use genre, mpaa rating, Oscar best picture nomiation, Rotten Tomatoes audience rating and critics score to predict the movie’s popularity measured by the movie’s Rotten Tomatoes audience rating.

We then use the model to predict the audience rating for the movie La La Land (2016). The true audience rating of 81 falls within the 95% CI of (68, 104) and 5 points off the predicted value of 86.

Last but not least through our EDA, we notice that the majority of movies produced between 1972 and 2014 are drama movies. Addiitonally, drama movies released during the holidays (December) and the summer are more popular compared to other drame movies released in the remaining of the year in the dataset.