Modeling and prediction for movies

1. Introduction
2. Setup
- 2.1 Load packages
- 2.2 Load data
3. Exploratory data analysis
4. Modeling
5. Prediction
6. Conclusion

1. Introduction

Is there any association between audience score and critics score and IMDB rating? This project will focus on how much audience and critics like movies as well as identify other variables about the movies. The dataset was randomly collected from Rotten Tomatoes and IMDB in the US. Also this is an observational data so we will not involve the causality relationship when analyzing the data.

Variables:

imdb_rating: Rating on IMDB
imdb_num_votes: Number of votes on IMDB
genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) audience_score: Audience score on Rotten Tomatoes audience_rating: Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
critics_score: Critics score on Rotten Tomatoes critics_rating: Categorical variable for critics rating on Rotten Tomatoes Certified Fresh, Fresh, Rotten) Movies production and advertising companies will be interested in this information. They want to know if there are any differences between the audience score and critics scores as well as what kind of movies that will attract more audience

2. Setup

2.1 Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

2.2 Load data

load("movies.Rdata")

Scope of Inference: This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies. Thus, the study is obsevational and only shows associational relationship.

3. Exploratory data analysis

dim(movies)

## [1] 651  32

The movies data has 651 observation and 32 variables. We will look at str(movies) and head(movies) for more details

str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

head(movies)

## # A tibble: 6 x 32
##   title title_type genre runtime mpaa_rating studio thtr_rel_year
##   <chr> <fct>      <fct>   <dbl> <fct>       <fct>          <dbl>
## 1 Fill~ Feature F~ Drama      80 R           Indom~          2013
## 2 The ~ Feature F~ Drama     101 PG-13       Warne~          2001
## 3 Wait~ Feature F~ Come~      84 R           Sony ~          1996
## 4 The ~ Feature F~ Drama     139 PG          Colum~          1993
## 5 Male~ Feature F~ Horr~      90 R           Ancho~          2004
## 6 Old ~ Documenta~ Docu~      78 Unrated     Shcal~          2009
## # ... with 25 more variables: thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## #   dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>,
## #   imdb_rating <dbl>, imdb_num_votes <int>, critics_rating <fct>,
## #   critics_score <dbl>, audience_rating <fct>, audience_score <dbl>,
## #   best_pic_nom <fct>, best_pic_win <fct>, best_actor_win <fct>,
## #   best_actress_win <fct>, best_dir_win <fct>, top200_box <fct>,
## #   director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>,
## #   actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr>

To simply the EDA process, we will create a function called “complete” to just take observations that have a value for each variables

complete <- function(...) {
  study <- movies %>%
    select(...)
  
  return(study[complete.cases(study),])
}

Prepare data for analysis

data <- complete(audience_score, critics_score, imdb_rating, audience_rating)

head(data)

## # A tibble: 6 x 4
##   audience_score critics_score imdb_rating audience_rating
##            <dbl>         <dbl>       <dbl> <fct>          
## 1             73            45         5.5 Upright        
## 2             81            96         7.3 Upright        
## 3             91            91         7.6 Upright        
## 4             76            80         7.2 Upright        
## 5             27            33         5.1 Spilled        
## 6             86            91         7.8 Upright

We can see that audience and critics gave different score on many different type of movies so it means that critics and audience have different view of scoring movies.

First, we will focus on audience_score and critics_score to see if there is any association between these variables. We will summary this relatioship via the following plot

ggplot(data, aes(x = critics_score, y = audience_score)) +
  geom_point()

data %>%
  summarize(cor(audience_score, critics_score))

## # A tibble: 1 x 1
##   `cor(audience_score, critics_score)`
##                                  <dbl>
## 1                                0.704

We can see that correlation between audience score and critics score is pretty high.

We will include linear model to the plot to explore the relationship between audience_score and critics_score

ggplot(data, aes(x = critics_score, y = audience_score)) +
  geom_point() +
  stat_smooth(method = 'lm', se = FALSE)

Next, we will visualize the relationship between audience_score and imdb_rating

ggplot(data, aes(x = imdb_rating, y = audience_score)) +
  geom_point()

data %>%
  summarize(cor(audience_score, imdb_rating))

## # A tibble: 1 x 1
##   `cor(audience_score, imdb_rating)`
##                                <dbl>
## 1                              0.865

We will add linear model to the plot

ggplot(data, aes(x = imdb_rating, y = audience_score)) +
  geom_point() +
  stat_smooth(method = 'lm', se = FALSE)

We can see the positive relationship between audience_score and imdb_rating

To check if there is an association between critics_score and imdb_rating, we will plot the chart between these two variables

ggplot(data, aes(x = imdb_rating, y = critics_score)) +
  geom_point()

data %>%
  summarize(cor(imdb_rating, critics_score))

## # A tibble: 1 x 1
##   `cor(imdb_rating, critics_score)`
##                               <dbl>
## 1                             0.765

Finally, we will paiwise imdb, audience and critics variable too overview the relationship between them

ggpairs(movies, columns = 13:18)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see strong association between imdb, critics score with audience_score. Correlation coefficent between audience_score and imdb_rating is 0.865. This figure for audience_score and critics_score is 0.704. And correlation for critics_score and imdb_rating is 0.765. However, We will not use critics_score variable as a predictor because critics_score and audience_score will contribute redundant information to the model and complicate model estimation

4. Modeling

Based on the research question, we will just focus on the relationship between audience_score and imdb_rating. According to the above plots about these variables, we can see that linear model will be fit to answer the research question. However, we have not taken into consideration other variables that may affect the audience_score. In this section, we will use backwards elimination to pick significant predictors and first, we will start with full model. We will pick title_type, genre, runtime, mpaa_rating, thtr_rel_year, thtr_rel_month, imdb_rating, imdb_num_votes, critics_score, critics_rating, audience_rating as predictors and audience_score as response variable

modeling <- complete(title_type, genre, runtime, 
         mpaa_rating, thtr_rel_year, thtr_rel_month, 
         imdb_rating, imdb_num_votes, critics_score,
         critics_rating, audience_rating, audience_score)

fit0 <- lm(audience_score ~ title_type + genre + runtime + 
         mpaa_rating + thtr_rel_year + thtr_rel_month + 
         imdb_rating + imdb_num_votes + critics_score +
         critics_rating + audience_rating, data = modeling)
summary(fit0)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime + 
##     mpaa_rating + thtr_rel_year + thtr_rel_month + imdb_rating + 
##     imdb_num_votes + critics_score + critics_rating + audience_rating, 
##     data = modeling)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.5232  -4.6138   0.5353   4.1824  24.6331 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     8.011e+01  5.911e+01   1.355   0.1758    
## title_typeFeature Film          2.309e+00  2.571e+00   0.898   0.3695    
## title_typeTV Movie              4.797e-01  4.031e+00   0.119   0.9053    
## genreAnimation                  3.136e+00  2.715e+00   1.155   0.2486    
## genreArt House & International -2.414e+00  2.099e+00  -1.150   0.2506    
## genreComedy                     1.647e+00  1.146e+00   1.437   0.1513    
## genreDocumentary                2.603e+00  2.753e+00   0.945   0.3448    
## genreDrama                     -4.092e-01  1.008e+00  -0.406   0.6849    
## genreHorror                    -1.767e+00  1.723e+00  -1.025   0.3056    
## genreMusical & Performing Arts  3.959e+00  2.364e+00   1.675   0.0945 .  
## genreMystery & Suspense        -2.920e+00  1.288e+00  -2.267   0.0237 *  
## genreOther                     -2.399e-01  1.958e+00  -0.123   0.9025    
## genreScience Fiction & Fantasy -5.009e-01  2.457e+00  -0.204   0.8385    
## runtime                        -2.550e-02  1.683e-02  -1.516   0.1302    
## mpaa_ratingNC-17               -1.045e+00  5.208e+00  -0.201   0.8411    
## mpaa_ratingPG                   1.333e-01  1.898e+00   0.070   0.9440    
## mpaa_ratingPG-13               -4.960e-01  1.994e+00  -0.249   0.8036    
## mpaa_ratingR                   -6.981e-01  1.906e+00  -0.366   0.7142    
## mpaa_ratingUnrated              4.893e-01  2.228e+00   0.220   0.8263    
## thtr_rel_year                  -4.444e-02  2.930e-02  -1.516   0.1299    
## thtr_rel_month                 -1.266e-01  7.871e-02  -1.609   0.1082    
## imdb_rating                     9.463e+00  4.868e-01  19.437   <2e-16 ***
## imdb_num_votes                  4.608e-06  3.122e-06   1.476   0.1404    
## critics_score                   4.970e-03  2.528e-02   0.197   0.8442    
## critics_ratingFresh            -2.325e-01  8.783e-01  -0.265   0.7913    
## critics_ratingRotten           -1.073e+00  1.406e+00  -0.763   0.4456    
## audience_ratingUpright          1.995e+01  7.900e-01  25.258   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.854 on 623 degrees of freedom
## Multiple R-squared:  0.8899, Adjusted R-squared:  0.8853 
## F-statistic: 193.6 on 26 and 623 DF,  p-value: < 2.2e-16

Although in the previous section, we said that we will not take into consideration the critics_score variable, we still add it in fit0 to see if its p-value is significant or not. Then we will drop variables with the highest p-value and repeat until all variables left in the model are significant.

# Drop title_type
fit1 <- lm(audience_score ~ genre + runtime + 
         mpaa_rating + thtr_rel_year + thtr_rel_month + 
         imdb_rating + imdb_num_votes + critics_score +
         critics_rating + audience_rating, data = modeling)
summary(fit1)$adj.r.squared

## [1] 0.8854254

#Drop genre
fit2 <- lm(audience_score ~ title_type + runtime + 
         mpaa_rating + thtr_rel_year + thtr_rel_month + 
         imdb_rating + imdb_num_votes + critics_score +
         critics_rating + audience_rating, data = modeling)
summary(fit2)$adj.r.squared

## [1] 0.8831919

#Drop runtime
fit3 <- lm(audience_score ~ title_type + genre + 
         mpaa_rating + thtr_rel_year + thtr_rel_month + 
         imdb_rating + imdb_num_votes + critics_score +
         critics_rating + audience_rating, data = modeling)
summary(fit3)$adj.r.squared

## [1] 0.8850253

#Drop mpaa_rating
fit4 <- lm(audience_score ~ title_type + genre + runtime + 
         thtr_rel_year + thtr_rel_month + 
         imdb_rating + imdb_num_votes + critics_score +
         critics_rating + audience_rating, data = modeling)
summary(fit4)$adj.r.squared

## [1] 0.8858737

#Drop thtr_rel_year
fit5 <- lm(audience_score ~ title_type + genre + runtime + 
         mpaa_rating + thtr_rel_month + 
         imdb_rating + imdb_num_votes + critics_score +
         critics_rating + audience_rating, data = modeling)
summary(fit5)$adj.r.squared

## [1] 0.8850248

#Drop thtr_rel_month
fit6 <- lm(audience_score ~ title_type + genre + runtime + 
         mpaa_rating + thtr_rel_year + 
         imdb_rating + imdb_num_votes + critics_score +
         critics_rating + audience_rating, data = modeling)
summary(fit6)$adj.r.squared

## [1] 0.8849719

#Drop imdb_rating
fit7 <- lm(audience_score ~ title_type + genre + runtime + 
         mpaa_rating + thtr_rel_year + thtr_rel_month + 
         imdb_num_votes + critics_score +
         critics_rating + audience_rating, data = modeling)
summary(fit7)$adj.r.squared

## [1] 0.8159814

#Drop imdb_num_votes
fit8 <- lm(audience_score ~ title_type + genre + runtime + 
         mpaa_rating + thtr_rel_year + thtr_rel_month + 
         imdb_rating + critics_score +
         critics_rating + audience_rating, data = modeling)
summary(fit8)$adj.r.squared

## [1] 0.8850469

#Drop critics_score
fit9 <- lm(audience_score ~ title_type + genre + runtime + 
         mpaa_rating + thtr_rel_year + thtr_rel_month + 
         imdb_rating + imdb_num_votes + 
         critics_rating + audience_rating, data = modeling)
summary(fit9)$adj.r.squared

## [1] 0.8854406

#Drop critics_rating
fit10 <- lm(audience_score ~ title_type + genre + runtime + 
         mpaa_rating + thtr_rel_year + thtr_rel_month + 
         imdb_rating + imdb_num_votes + critics_score +
         audience_rating, data = modeling)
summary(fit10)$adj.r.squared

## [1] 0.8855139

#Drop audience_rating
fit11 <- lm(audience_score ~ title_type + genre + runtime + 
         mpaa_rating + thtr_rel_year + thtr_rel_month + 
         imdb_rating + imdb_num_votes + critics_score +
         critics_rating, data = modeling)
summary(fit11)$adj.r.squared

## [1] 0.7681425

Based on the result, the significant predictors are imdb_rating, audience_rating and genre. We will look at each predictor

fit_imdb <- lm(audience_score ~ imdb_rating, data = modeling)
summary(fit_imdb)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating, data = modeling)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.805  -6.550   0.676   5.676  52.912 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -42.3748     2.4205  -17.51   <2e-16 ***
## imdb_rating  16.1321     0.3678   43.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.16 on 648 degrees of freedom
## Multiple R-squared:  0.7481, Adjusted R-squared:  0.7477 
## F-statistic:  1924 on 1 and 648 DF,  p-value: < 2.2e-16

fit_rating <- lm(audience_score ~ audience_rating, data = modeling)
summary(fit_rating)

## 
## Call:
## lm(formula = audience_score ~ audience_rating, data = modeling)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.9345  -7.3173   0.6827   8.6827  19.6827 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             41.9345     0.6136   68.34   <2e-16 ***
## audience_ratingUpright  35.3828     0.8079   43.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.18 on 648 degrees of freedom
## Multiple R-squared:  0.7475, Adjusted R-squared:  0.7471 
## F-statistic:  1918 on 1 and 648 DF,  p-value: < 2.2e-16

fit_gen <- lm(audience_score ~ genre, data = modeling)
summary(fit_gen)

## 
## Call:
## lm(formula = audience_score ~ genre, data = modeling)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.348 -13.348   1.215  13.652  41.051 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      53.785      2.270  23.690  < 2e-16 ***
## genreAnimation                    8.660      6.510   1.330   0.1839    
## genreArt House & International   10.215      5.393   1.894   0.0587 .  
## genreComedy                      -1.279      3.001  -0.426   0.6701    
## genreDocumentary                 29.176      3.424   8.521  < 2e-16 ***
## genreDrama                       11.563      2.501   4.624 4.56e-06 ***
## genreHorror                      -7.959      4.441  -1.792   0.0736 .  
## genreMusical & Performing Arts   26.382      5.751   4.587 5.40e-06 ***
## genreMystery & Suspense           2.165      3.291   0.658   0.5110    
## genreOther                       12.903      5.108   2.526   0.0118 *  
## genreScience Fiction & Fantasy   -2.896      6.510  -0.445   0.6566    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.3 on 639 degrees of freedom
## Multiple R-squared:  0.1943, Adjusted R-squared:  0.1817 
## F-statistic: 15.41 on 10 and 639 DF,  p-value: < 2.2e-16

We can see that, for genre criteria, Documentary, Drama and Mysical $ Performing Arts are more significant affecting audience_score than other movies types. We will rewite the linear model for the response variable as follow:

fit_all <- lm(audience_score ~ genre + imdb_rating + audience_rating, data = modeling)
summary(fit_all)

## 
## Call:
## lm(formula = audience_score ~ genre + imdb_rating + audience_rating, 
##     data = modeling)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.6395  -4.4288   0.5889   4.2970  25.0845 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -12.5605     2.1955  -5.721 1.63e-08 ***
## genreAnimation                   3.6228     2.4513   1.478  0.13991    
## genreArt House & International  -2.7912     2.0320  -1.374  0.17005    
## genreComedy                      1.5109     1.1269   1.341  0.18050    
## genreDocumentary                 0.6003     1.3696   0.438  0.66130    
## genreDrama                      -0.8339     0.9589  -0.870  0.38481    
## genreHorror                     -1.6199     1.6693  -0.970  0.33222    
## genreMusical & Performing Arts   2.5416     2.1899   1.161  0.24625    
## genreMystery & Suspense         -3.2744     1.2462  -2.627  0.00881 ** 
## genreOther                       0.2743     1.9251   0.142  0.88675    
## genreScience Fiction & Fantasy   0.2559     2.4406   0.105  0.91652    
## imdb_rating                      9.8028     0.3689  26.571  < 2e-16 ***
## audience_ratingUpright          20.3180     0.7746  26.231  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.86 on 637 degrees of freedom
## Multiple R-squared:  0.8872, Adjusted R-squared:  0.8851 
## F-statistic: 417.5 on 12 and 637 DF,  p-value: < 2.2e-16

anova(fit_all)

## Analysis of Variance Table
## 
## Response: audience_score
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## genre            10  51633    5163  109.72 < 2.2e-16 ***
## imdb_rating       1 151738  151738 3224.33 < 2.2e-16 ***
## audience_rating   1  32379   32379  688.04 < 2.2e-16 ***
## Residuals       637  29977      47                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the beginning, we though that critics_score has a really strong relationship with audience_score but from ANOVA table, we can see that imdb_rating, audience_rating, genre are more significant. Next, we will look at correlation coefficients of these variables

coefficients(fit_all)

##                    (Intercept)                 genreAnimation 
##                    -12.5605354                      3.6228430 
## genreArt House & International                    genreComedy 
##                     -2.7911586                      1.5108717 
##               genreDocumentary                     genreDrama 
##                      0.6003104                     -0.8339436 
##                    genreHorror genreMusical & Performing Arts 
##                     -1.6198585                      2.5415754 
##        genreMystery & Suspense                     genreOther 
##                     -3.2743845                      0.2742760 
## genreScience Fiction & Fantasy                    imdb_rating 
##                      0.2559299                      9.8028449 
##         audience_ratingUpright 
##                     20.3180279

Intercept = -12.56053 indicating that if we do not include any predictor in the model, the estimated audience_score will be -12.56053
imdb_rating: the estimated audience_score will be 9.8028449 when the imdb_rating goes up by 1
genre: this is a tricky variable. The audience_score goes up or down depending on the category of the movie audience_rating: the estimated audience_score goes up 20.3180279 for each 1 increase in audience_rating

Lastly, we will perform diagnotics for MLR

fit_all <- lm(audience_score ~ genre + imdb_rating + audience_rating, data = modeling)

#Check linearity
ggplot(fit_all, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 'dashed') +
  xlab('Fitted Values') +
  ylab('Residuals')

#Check normal distribution via histogram
ggplot(fit_all, aes(x = .resid)) +
  geom_histogram(bins = 30) +
  xlab('Residuals')

#or via QQ-plot
ggplot(fit_all, aes(sample = .resid)) +
  stat_qq()

5. Prediction

Build test data cases for the movie “Aquaman (2018)” using the data gathered from IMDB (imdb_rating = 7.5) and rotten tomatoes website (audience_score = 80) and storing the data in the variable named aquaman (test data case) using the following code

newdata <- data.frame(imdb_rating = 7.5, audience_score = 80, audience_rating = 'Upright', genre = "Science Fiction & Fantasy")

aquaman <- round(predict(fit_all, newdata), digit = 0)

c(aquaman , newdata$audience_score)

##  1    
## 82 80

The prediction is higher than the actual audience_score. We also contruct a prediction interval around this prediction which will provide the accuracy of the prediction

predict(fit_all, newdata, interval = 'prediction', level = 0.95)

##        fit      lwr      upr
## 1 81.53476 67.30074 95.76878

We are 95% confident that the Aquaman movie will have audience_score range from 67.30 to 95.77

6. Conclusion

The project uses data from movies dataset to determine if there is any association between audience_score and critics_score and the answer is yes. However, doing exploratory data analysis and modeling help us to know that genre, audience_rating and imdb_rating are significant predictors that have strong association with the audience_score.