Modeling and prediction for movies

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.2

library(statsr)
library(GGally)

## Warning: package 'GGally' was built under R version 4.0.2

Load data

#load("movies.Rdata")
load("~/modulo 3 week 4 projeto/movies.Rdata")

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016. This dataset is provided below, and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.

The study is observational only. It has a training lab and it is not possible to see any causality. Although, the activity can be interesting and reveal some aspects of the world of cinema.

The codebook contains 32 variables (from the film’s title, to being nominated for an Oscar, a note on the critical sites mentioned above, among others)

Below a relationship made through the “str” function with the type of codebook variable

str(movies)

## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

Random sampling was used and a large number of films in the sample may represent that the analysis has some degree of generalization, although parionic and as mentioned above, there is no pretense of causality in this project.

Part 2: Research question

We searched as a research question something that is associated with the variable ‘critics score’. Many of the variables presented above after the exception of ‘str’ are not really variables, like Rotten’s url or actors’ names, for example. Since we can designate function as a relation between two sets, represented by a formation law, which associates the values of x and y. According to the formation law, each value of y depends on the value of x, this dependency relationship is the main characteristic of a function. I opted for y as a ‘critics score’ and others that could be included in the model will be addressed below, some of which I didn’t even consider because they are not really variables.

So my research problem is trying to find out which variables in the sample are best associated with the critics score in a multiple linear regression model.

NOTE: We think is necessary to justify before running the regressions the reason for excluding the analyzes below Variables not considered

-director; ‘title’; ‘actor1,2,3,4 and 5’: many cases

-title_type: risk of collinearity with ‘genre’

-urls - do not apply -oscar wins - excluded because there are few cases - I put only nominated

About VARIABLES:

-thtr_rel_year: Year the movie is released in theaters;

-thtr_rel_month: Month the movie is released in theaters;

-thtr_rel_day: Day of the month the movie is released in theaters;

-dvd_rel_year: Year the movie is released on DVD;

-dvd_rel_month: Month the movie is released on DVD;

-dvd_rel_day: Day of the month the movie is released on DVD; and

-mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)

We judged a lot of things this year and the day of the month - I thought it was better to exclude it, I think it is very random I preferred, due to the nature of the exercise, to do something leaner.

The variables included, in addition to y = ‘critics_score’, were:

‘audience_score’, ‘imdb_rating’, ‘audience_rating’, ‘genre, studio’, ‘best_pic_nom’, ‘top200_box’.

Part 3: Exploratory data analysis

To simply the EDA process, we will create a function called “complete” to just take observations that have a value for each variables

complete <- function(...) {
  study <- movies %>%
    select(...)
  
  return(study[complete.cases(study),])
}

Prepare data for analysis. We sent the program to create ** q1 ** with the variables that can be analyzed

q1 <- complete(audience_score, critics_score, imdb_rating, audience_rating, genre, studio, best_pic_nom, top200_box)

We decided to run a summary:

summary(q1)

##  audience_score  critics_score     imdb_rating    audience_rating
##  Min.   :11.00   Min.   :  1.00   Min.   :1.900   Spilled:269    
##  1st Qu.:46.00   1st Qu.: 33.00   1st Qu.:5.900   Upright:374    
##  Median :65.00   Median : 61.00   Median :6.600                  
##  Mean   :62.51   Mean   : 57.82   Mean   :6.501                  
##  3rd Qu.:80.00   3rd Qu.: 83.00   3rd Qu.:7.300                  
##  Max.   :97.00   Max.   :100.00   Max.   :9.000                  
##                                                                  
##                 genre                                  studio    best_pic_nom
##  Drama             :303   Paramount Pictures              : 37   no :621     
##  Comedy            : 86   Warner Bros. Pictures           : 30   yes: 22     
##  Action & Adventure: 64   Sony Pictures Home Entertainment: 27               
##  Mystery & Suspense: 59   Universal Pictures              : 23               
##  Documentary       : 52   Warner Home Video               : 19               
##  Horror            : 22   20th Century Fox                : 18               
##  (Other)           : 57   (Other)                         :489               
##  top200_box
##  no :628   
##  yes: 15   
##            
##            
##            
##            
##

First, we made a histogram of our response variable.

hist(q1$critics_score)

3.1 Exploratory the possible Xs that are numeric

After, we will focus on audience_score and critics_score to see if there is any association between these variables. We will summary this relatioship via the following plot

ggplot(q1, aes(x = audience_score , y = critics_score)) +
  geom_point()

q1 %>%
  summarize(cor(critics_score, audience_score))

## # A tibble: 1 x 1
##   `cor(critics_score, audience_score)`
##                                  <dbl>
## 1                                0.707

That show a higher correlation. Next, we will visualize the relationship between critics_score and the imdb_rating.

q1 %>%
summarize(cor(critics_score, imdb_rating))

## # A tibble: 1 x 1
##   `cor(critics_score, imdb_rating)`
##                               <dbl>
## 1                             0.766

Our suspicion is of collinearity between these two variables, for that we will carry out the same verification tests suggested in the lab about baseball.

For this, we will do a simple linear regression between the two variables (called ‘parsimon’) and carry out the three tests; (1) linearity, (2) nearly normal residuals, and (3) constant variability.

#create 'parsimon' and made LINEARITY test
parsimon <- lm(critics_score ~ imdb_rating, data = q1)
ggplot(data = parsimon, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

Observing this first test, we do not see the clearly random distribution as required. The other tests can confirm our predilection for excluding ‘imdb_rating’ for redundancy with y. Next Step is (2) nearly normal residuals

#Step 2
ggplot(data = parsimon, aes(x = .resid)) +
  geom_histogram(binwidth = 25) +
  xlab("Residuals")

In this test is Ok Next Step 3:

#Step 3 constant variability test
ggplot(data = parsimon, aes(sample = .resid)) +
  stat_qq()

Since it is ok in the other tests, We decided to go with this variable for the next step: ‘modeling’.

The other possible variables to be checked are categorical, we decided to make an analyses for the relationship of each of these variables with ‘y’ See more details below.

3.2 Exploratory the possible Xs that are categorical

Remember Audience_score is Y

and our

#Possibles categorial X´s are:
 #$ audience_rating: Factor w/ 2 levels
 #$ genre          : Factor w/ 11 levels
 #$ studio         : Factor w/ 211 levels 
 #$ best_pic_nom   : Factor w/ 2 levels "no","yes"
 #$ top200_box     : Factor w/ 2 levels "no","yes"

We start with ‘studio’

Above in this work, when R executes ‘summary (q1)’ it was found that we only thought to ignore the other studies reducing this variable from 211 factors to 5 (the top five of the studios with films in this variable) are they:

Paramount Pictures (n = 37), Warner Bros. Pictures (n = 30), Sony Pictures Home Entertainment (n = 27), Universal (n = 23) and Warner Home Video (n = 19)

But given the dizzying drop in the sample, we opted to leave out ‘studio’ analysis.

plot(q1$studio)

The plot below confirms this excessive profusion of cases at each level and we chose not to take ‘studio’ into account from now on in this analysis.

Next, we analyse ‘audience_rating’.

According to data and research on the web, Rotten Tomattoes has another audience classifier which is this variable with 2 factors called ‘audience_rating’. About this: The website also has a measure for audience reviews; an upright container of popcorn indicates 50% or more liked it, and a spilled container of popcorn indicates fewer than 50% liked it. (from: encurtador.com.br/imKNW)

test_aud_rating <- q1 %>%
select(critics_score, audience_rating)
ggplot(data = test_aud_rating) +
geom_bar(aes(critics_score, fill = audience_rating), position = position_dodge(), width = .75)

As we saw, when we did it at the beginning of part 3, most of the criticisms are in high marks, so it is to be expected the largest number of cases with better grades ‘Upright’.

by(test_aud_rating$critics_score, test_aud_rating$audience_rating, summary)

## test_aud_rating$audience_rating: Spilled
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   36.00   37.97   56.00  100.00 
## ------------------------------------------------------------ 
## test_aud_rating$audience_rating: Upright
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   59.25   79.00   72.09   90.75  100.00

The data above confirm what was seen in the plot. We chose to keep it because there seems to be an association between critics and internet users of Rotten Tomattoes. Next we will do the same with the variable ‘best_pic_nom’

test_nomi <- q1 %>%
select(critics_score, best_pic_nom)
ggplot(data = test_nomi) +
geom_bar(aes(critics_score, fill = best_pic_nom), position = position_dodge(), width = .75)

A greater concentration of greens (indicated films) is clearly perceived in the high scores. The red ones (the films not indicated) are more scattered throughout the sample, the ‘by’ function will show this below:

by(test_nomi$critics_score, test_nomi$best_pic_nom, summary)

## test_nomi$best_pic_nom: no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   33.00   60.00   56.77   81.00  100.00 
## ------------------------------------------------------------ 
## test_nomi$best_pic_nom: yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   31.00   85.75   91.00   87.45   94.00   98.00

Just check from the data above, the differences in the mean values between ‘yes’ and ‘no’ to see a possible association.

Next, thetop200_box variable will be tested in the same way as the two that preceded it First the plot:

test_top200 <- q1 %>%
select(critics_score, top200_box)
ggplot(data = test_top200) +
geom_bar(aes(critics_score, fill = top200_box), position = position_dodge(), width = .75)

This variable measures whether the film is in the top 200 selection. Visually, there does not seem to be such a strong association between critics’ evaluation and being in the Box Top 200. We will see what the analysis generated by the ‘by’ function has to tell us in this case:

by(test_top200$critics_score, test_top200$top200_box, summary)

## test_top200$top200_box: no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   33.00   61.00   57.39   82.25  100.00 
## ------------------------------------------------------------ 
## test_top200$top200_box: yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   31.00   70.50   83.00   75.53   88.50   94.00

Although it was not so visual on the plot, the data above shows us that there is something interesting to note. Especially the difference between 1stQuartile, average and median.

In short, until now ** we have chosen to proceed to the phase (part 4 Modeling) with this variable . We previously defined this also with the categorical variables ‘best_pic_nom’ ** and ** ‘audience_rating’ ** and with the numerics ´audience_score´ and ‘imdb_rating’. So far we have chosen to leave out the ‘studio’ variable. Next, we explore our last ‘genre’ variable.

NOTE: given the high number of factors in ‘genre’, we decided to go straight to the function ‘by’, not making the plot (which would be strange).

by(q1$critics_score, q1$genre, summary)

## q1$genre: Action & Adventure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   18.00   35.00   41.70   60.25   98.00 
## ------------------------------------------------------------ 
## q1$genre: Animation
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   23.25   58.00   51.12   76.00   92.00 
## ------------------------------------------------------------ 
## q1$genre: Art House & International
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   27.00   64.00   54.62   84.00   94.00 
## ------------------------------------------------------------ 
## q1$genre: Comedy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   15.00   36.00   40.93   66.00   98.00 
## ------------------------------------------------------------ 
## q1$genre: Documentary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   80.75   91.50   86.35   96.00  100.00 
## ------------------------------------------------------------ 
## q1$genre: Drama
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00   42.50   67.00   62.13   83.50  100.00 
## ------------------------------------------------------------ 
## q1$genre: Horror
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   19.00   38.00   42.05   59.75   95.00 
## ------------------------------------------------------------ 
## q1$genre: Musical & Performing Arts
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   70.25   89.00   76.67   92.75   97.00 
## ------------------------------------------------------------ 
## q1$genre: Mystery & Suspense
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   31.00   60.00   54.93   77.50   98.00 
## ------------------------------------------------------------ 
## q1$genre: Other
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   57.50   72.00   67.73   89.00  100.00 
## ------------------------------------------------------------ 
## q1$genre: Science Fiction & Fantasy
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       7      24      67      50      72      87

The values above are quite discrepant when comparing the film genres, so we chose to go with ‘genre’.

Next, we start our model modeling process, starting from the full model with q2 as a base.

Then we create ‘q2’ excluding ‘studio’.

q2 <- complete(audience_score, critics_score, imdb_rating, audience_rating, genre, best_pic_nom, top200_box)

Part 4: Modeling

Start with a complete ggpairs of ‘q2’

#
ggpairs(q2, columns = 1:7)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Next Step is the full model of all selected variables to associated with ‘critics_score’

md_full <- lm(critics_score ~ audience_score + imdb_rating + audience_rating + genre + best_pic_nom  + top200_box, data = q1)
summary(md_full)

## 
## Call:
## lm(formula = critics_score ~ audience_score + imdb_rating + audience_rating + 
##     genre + best_pic_nom + top200_box, data = q1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.515 -11.975   2.108  12.291  49.330 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -60.0553     6.0051 -10.001  < 2e-16 ***
## audience_score                   0.1451     0.1048   1.384 0.166803    
## imdb_rating                     15.4482     1.4150  10.918  < 2e-16 ***
## audience_ratingUpright           2.9598     2.9609   1.000 0.317888    
## genreAnimation                  10.6803     6.7957   1.572 0.116542    
## genreArt House & International  -1.0237     5.5235  -0.185 0.853029    
## genreComedy                      3.5769     2.9912   1.196 0.232214    
## genreDocumentary                13.3426     3.6337   3.672 0.000261 ***
## genreDrama                       7.2328     2.5568   2.829 0.004821 ** 
## genreHorror                      6.0741     4.4658   1.360 0.174277    
## genreMusical & Performing Arts   9.6048     5.7747   1.663 0.096758 .  
## genreMystery & Suspense          5.2952     3.3086   1.600 0.110005    
## genreOther                       9.0770     5.2099   1.742 0.081954 .  
## genreScience Fiction & Fantasy  12.0636     6.3970   1.886 0.059780 .  
## best_pic_nomyes                  4.8094     4.0683   1.182 0.237589    
## top200_boxyes                    6.3734     4.8082   1.326 0.185481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.96 on 627 degrees of freedom
## Multiple R-squared:  0.6097, Adjusted R-squared:  0.6004 
## F-statistic:  65.3 on 15 and 627 DF,  p-value: < 2.2e-16

We selected backwards elimination method.

Here, we star with a full model that is a model with all possible co-variants or predictors included, and then we drop variables one at a time until a parsimonius model is reached.

As we see many cases with low p-value, we decided to do both backwards methods with p-value and with R_adjusted Squared. We will start with backwards with p-value. These will be the ‘w’ models After completing this step, we will make the ‘t’ models with the stepwise backwards method with R Adjusted Square.

At the end, we will use the 4 diagnostic tests indicated for multiple linear regression and choose between the two selected from ‘w’ and ‘t’.

Start with Stepwise Backwards with p-value

We started this step by removing the explanatory variable with the highest p-value. This is ‘audience rating’. Although one of the genre levels has a higher p-value, we do not remove it because it has at least one of the levels with a p-value <0.05. We will save the value of R_Adjusted Square at the end of each model in case you need to repeat it later in the R Adjusted step, thus saving time. We start with the ‘w1’ model

#Start with Stepwise Backwards with p-value
w1 <- lm(critics_score ~ audience_score + imdb_rating +  genre + best_pic_nom  + top200_box, data = q1)
summary(w1)

## 
## Call:
## lm(formula = critics_score ~ audience_score + imdb_rating + genre + 
##     best_pic_nom + top200_box, data = q1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -63.48 -12.07   2.26  12.62  47.96 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -61.16964    5.90073 -10.366  < 2e-16 ***
## audience_score                   0.22085    0.07244   3.049 0.002395 ** 
## imdb_rating                     15.14443    1.38197  10.959  < 2e-16 ***
## genreAnimation                  10.68144    6.79569   1.572 0.116501    
## genreArt House & International  -0.64463    5.51051  -0.117 0.906912    
## genreComedy                      3.50986    2.99042   1.174 0.240960    
## genreDocumentary                13.41471    3.63298   3.692 0.000241 ***
## genreDrama                       7.38086    2.55249   2.892 0.003965 ** 
## genreHorror                      5.99561    4.46514   1.343 0.179835    
## genreMusical & Performing Arts   9.57718    5.77458   1.659 0.097715 .  
## genreMystery & Suspense          5.34677    3.30820   1.616 0.106549    
## genreOther                       9.16345    5.20919   1.759 0.079049 .  
## genreScience Fiction & Fantasy  12.03995    6.39696   1.882 0.060280 .  
## best_pic_nomyes                  4.63608    4.06458   1.141 0.254470    
## top200_boxyes                    6.53659    4.80543   1.360 0.174239    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.96 on 628 degrees of freedom
## Multiple R-squared:  0.6091, Adjusted R-squared:  0.6004 
## F-statistic:  69.9 on 14 and 628 DF,  p-value: < 2.2e-16

R_adju_md_full <- 0.6004
R_adju_w1 <- 0.6004

Next we introduce the model ‘w2’ with exclusion of ‘best_pic_nomyes’.

#Model w2 without best_pic_nomyes
w2 <- lm(critics_score ~ audience_score + imdb_rating +  genre + top200_box, data = q1)
summary(w2)

## 
## Call:
## lm(formula = critics_score ~ audience_score + imdb_rating + genre + 
##     top200_box, data = q1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.613 -11.945   2.467  12.511  47.997 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -62.13317    5.84134 -10.637  < 2e-16 ***
## audience_score                   0.22569    0.07233   3.120 0.001890 ** 
## imdb_rating                     15.25812    1.37870  11.067  < 2e-16 ***
## genreAnimation                  10.68356    6.79731   1.572 0.116515    
## genreArt House & International  -0.76749    5.51078  -0.139 0.889281    
## genreComedy                      3.61825    2.98963   1.210 0.226631    
## genreDocumentary                13.10806    3.62389   3.617 0.000322 ***
## genreDrama                       7.52291    2.55006   2.950 0.003295 ** 
## genreHorror                      6.08314    4.46555   1.362 0.173610    
## genreMusical & Performing Arts   9.32261    5.77165   1.615 0.106760    
## genreMystery & Suspense          5.45470    3.30764   1.649 0.099622 .  
## genreOther                       9.61172    5.19559   1.850 0.064786 .  
## genreScience Fiction & Fantasy  12.06873    6.39844   1.886 0.059729 .  
## top200_boxyes                    6.84258    4.79908   1.426 0.154419    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.96 on 629 degrees of freedom
## Multiple R-squared:  0.6083, Adjusted R-squared:  0.6002 
## F-statistic: 75.14 on 13 and 629 DF,  p-value: < 2.2e-16

R_adju_w2 <- 0.6002

Let’s go to the last model of this step. Only ‘top200_boxyes’ left with a high p_value value. Imdb_rating, audience_score and genre has a good p_value.

It is worth noting that in the ‘w1’ model, something that is now repeated in ‘w2’, audience_rating started to have a good value of p_value, something that did not occur in the full model.

It is also important to note that in ‘genre’, the documentary and drama levels have a good sig and high slope values. Then follows our latest w3 model:

w3 <- lm(critics_score ~ audience_score + imdb_rating +  genre, data = q1)
summary(w3)

## 
## Call:
## lm(formula = critics_score ~ audience_score + imdb_rating + genre, 
##     data = q1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.715 -11.988   2.432  12.628  47.472 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -62.33486    5.84442 -10.666  < 2e-16 ***
## audience_score                   0.22899    0.07235   3.165 0.001626 ** 
## imdb_rating                     15.35163    1.37827  11.138  < 2e-16 ***
## genreAnimation                  10.13692    6.79205   1.492 0.136077    
## genreArt House & International  -1.41398    5.49659  -0.257 0.797072    
## genreComedy                      3.19028    2.97696   1.072 0.284284    
## genreDocumentary                12.32133    3.58457   3.437 0.000626 ***
## genreDrama                       7.01997    2.52761   2.777 0.005644 ** 
## genreHorror                      5.59554    4.45608   1.256 0.209687    
## genreMusical & Performing Arts   8.57696    5.75262   1.491 0.136471    
## genreMystery & Suspense          4.98169    3.29365   1.513 0.130905    
## genreOther                       9.40158    5.19775   1.809 0.070962 .  
## genreScience Fiction & Fantasy  12.32446    6.40117   1.925 0.054636 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.97 on 630 degrees of freedom
## Multiple R-squared:  0.607,  Adjusted R-squared:  0.5995 
## F-statistic:  81.1 on 12 and 630 DF,  p-value: < 2.2e-16

R_adju_w3 <- 0.5995

The ‘w3’ model presented a good R Adjusted of 0.5995, only one percent less than the full model, but with all variables within the model with sig <0.05. The trend is that ‘w3’ is our model is the one that stands out so far. However, we will test other models using the method: Stepwise Backwards with p-value.

Stepwise Backwards with R Adjusted Squared

Utilizando como base os modelos já testados temos até aqui:

-Full Model with audience_score + imdb_rating + audience_rating + genre + best_pic_nom + top200_box with R Adjusted = 0.6004.

-Model w1 with audience_score + imdb_rating + genre + best_pic_nom + top200_box with R Adjusted = 0.6004.

-Model w2 with audience_score + imdb_rating + genre + top200_box with R Adjusted = 0.6002.

-Model w3 with audience_score + imdb_rating + genre with R Adjusted = 0.5995.

O primeiro passo é testar o modelo com uma variável a menos que o modelo completo. Já fizemos isso cim a variável audience_rating, em que se gerou o modelo ‘w1’, com mesmo R Adjusted que o full model. A seguir faremos o modelo ‘t2’ com a introdução de todas as variáveis x menos uma e essa será audience_score:

The first step is to test the model with one variable unless the model is complete. We have already done this with the variable audience_rating, in which the ‘w1’ model was generated, with the same R Adjusted as the full model. Next we will make the model ‘t2’ with the introduction of all variables x minus one and this will be audience_score:

t2 <- lm(critics_score ~ audience_rating + imdb_rating +  genre + best_pic_nom  + top200_box, data = q1)
summary(t2)$adj.r.squared

## [1] 0.5998061

R_adju_t2 <- 0.5998061

Até agora ‘w2’ por esse método adotado aqui permanece sendo a melhor escolha. A seguir ‘t3’ com a exclusão de ‘imdb_rating’:

t3 <- lm(critics_score ~ audience_rating + audience_score +  genre + best_pic_nom  + top200_box, data = q1)
summary(t3)$adj.r.square

## [1] 0.5251795

R_adju_t3 <- 0.525

O pior valor de Adjusted R-squared até aqui. Modelo com pior capacidade preditiva. A seguir, seguiremos no Passo 1, com a exclusão de ‘genre’ gerado abaixo no modelo ‘t4’:

t4 <- lm(critics_score ~ audience_rating + audience_score +  imdb_rating + best_pic_nom  + top200_box, data = q1)
summary(t4)$adj.r.square

## [1] 0.5940828

R_adju_t4 <- 0.594

O valor gerado permanece abaixo de 0.6004.

A seguir os modelos ‘t5’ e ‘t6’ excluindo as variáveis que ainda restam nessa etapa 1 de Stepwise Backwards with R Adjusted Squared method:

t5 <- lm(critics_score ~ audience_rating + audience_score +  imdb_rating + genre  + top200_box, data = q1)
t6 <- lm(critics_score ~ audience_rating + audience_score +  imdb_rating + genre  + best_pic_nom, data = q1)
summary(t5)$adj.r.square

## [1] 0.6001359

summary(t6)$adj.r.square

## [1] 0.5999072

R_adju_t5 <- 0.6001359
R_adju_t6 <- 0.5999072

Pela lógica do metódo adotado aqui o modelo com maior valor de Adjusted R Square é ‘w1’. Seguimos ao pasos 2, com esse modelo. Em suma, de todos os modelos com cinco variáveis explicativas, o com maior R Adjusted Square é ‘w1’

Relembre que o modelo ‘w1’ contém: ‘audience_score + imdb_rating + genre + best_pic_nom + top200_box’.

No que concerne à variável ‘best_pic_nom’ não se faz necessário realizar essa etapa, uma vez que já foi executada quando ocorreu a realização do modelo ‘w2’. Nessa, o valor de R Adjusted é 0.6002.

O próximo a se fazer nessa segunda etapa, é a criação de modelos retirando as outras :

#Below the values of the best model so far 'w1' and all with 4 variables
t21 <- lm(critics_score ~ best_pic_nom + imdb_rating +  genre + top200_box, data = q1)
t22 <- lm(critics_score ~ best_pic_nom + audience_score +  genre + top200_box, data = q1)
t23 <- lm(critics_score ~ best_pic_nom + imdb_rating +  audience_score + top200_box, data = q1) 
t24 <- lm(critics_score ~ best_pic_nom + imdb_rating +  genre + audience_score, data = q1)
R_adju_w1

## [1] 0.6004

R_adju_w2

## [1] 0.6002

summary(t21)$adj.r.square

## [1] 0.5951196

summary(t22)$adj.r.square

## [1] 0.5247298

summary(t23)$adj.r.square

## [1] 0.5939886

summary(t24)$adj.r.square

## [1] 0.5998492

As seen above, ‘w1’ remains the best model according to the criteria adopted here (R adjusted Square).

Next, the next step is to generate models with 3 variables.

A part of this step has already been generated in the ‘w3’ model (audience_score + imdb_rating + genre), the value of which will be recalled below.

#Below the values of the best model so far 'w1' and all with 3 variables
t31 <- lm(critics_score ~ top200_box + imdb_rating +  genre, data = q1)
t32 <- lm(critics_score ~ top200_box + audience_score +  genre, data = q1)
t33 <- lm(critics_score ~ top200_box + imdb_rating +  audience_score, data = q1)
R_adju_w1

## [1] 0.6004

R_adju_w3

## [1] 0.5995

summary(t31)$adj.r.square

## [1] 0.5946545

summary(t32)$adj.r.square

## [1] 0.5231067

summary(t33)$adj.r.square

## [1] 0.5939932

As seen above, ‘w1’ remains the best model according to the criteria adopted here (R adjusted Square).

Taking into account only models with 3 variables, ‘w3’ is the best. What was expected, since we reached him with the backwards method with p-value.

Next, the next step is to generate models with 2 variables

#Below the values of the best model so far 'w1' and all with 2 variables
t41 <- lm(critics_score ~ top200_box + imdb_rating, data = q1)
t42 <- lm(critics_score ~ top200_box + genre, data = q1)
t43 <- lm(critics_score ~ top200_box + audience_score, data = q1)
t44 <- lm(critics_score ~ genre + imdb_rating, data = q1)
t45 <- lm(critics_score ~ genre + audience_score, data = q1)
t46 <- lm(critics_score ~ audience_score + imdb_rating, data = q1)
R_adju_w1

## [1] 0.6004

summary(t41)$adj.r.square

## [1] 0.586896

summary(t42)$adj.r.square

## [1] 0.1983559

summary(t43)$adj.r.square

## [1] 0.499364

summary(t44)$adj.r.square

## [1] 0.5938205

summary(t45)$adj.r.square

## [1] 0.5214421

summary(t46)$adj.r.square

## [1] 0.5940279

Conclusion: In this method ‘w1’ is the best choose.

Who is the best model? ‘w1’ ou ‘w3’?????

See below, in this report ….

Who is the best model?

The two methods adopted generated the models ‘w1’ and ‘w3’. Which one is the best? As the values of R Adjusted Square were very close with a difference of 1%, we opted for ‘w3’, because in this model all variables have a good p-value. However, to be sure it is necessary to perform diagnostic tests. In case of approval in these, ‘w3’ will be our choice.

#Step 1 - LINEAR RELATIONSHIPS BETWEEN X AND Y
cog_final = lm(critics_score ~ audience_score + imdb_rating +  genre, data = q1)
plot(cog_final$residuals ~ q1$imdb_rating + q1$audience_score)

In this first step, only the numerical explanation variables are tested. In our case it is ‘imdb_rating’ and ‘audience_score’. The values appear to be distributed randomly. We will see below what appears in the next steps.

Now see step 2:

#Step 2 - Nearly normal residuals
hist(cog_final$residuals)

qqnorm(cog_final$residuals)
qqline(cog_final$residuals)

Seems to be ok. Now, step 3 and 4.

#Step 3 - Constant variability of residuals and Step 4 - INDEPENDENCE OF RESIDUALS
plot(cog_final$residuals ~ cog_final$fitted)

plot(cog_final$residuals)

Concluded: We Decided the best model is ‘Critics_Score’ with ‘audience_score’ ‘imdb_rating’ ‘genre’.

Part 5: Prediction

One of our concerns when making the following prediction is about our chosen model ‘w3’. In all previous diagnostic tests, the variable ‘imdb_rating’ passed the limit between ok and non-ok. So we will make the prediction with both ‘w3’ and ‘t45’ (which is ‘w3’ without ‘imdb_rating’).

In short, we will make two predictions:

1- Prediction with main model ‘w3’

2- Prediction with alternative model ‘t45’

Before starting predictions, see the summary of ‘q45’ below. NOTE: The summary of ‘w3’ can be found above in this report.

summary(t45)

## 
## Call:
## lm(formula = critics_score ~ genre + audience_score, data = q1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.479 -13.588   2.729  14.527  50.917 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -7.02068    3.36851  -2.084 0.037543 *  
## genreAnimation                  1.61984    7.37769   0.220 0.826286    
## genreArt House & International  1.19776    6.00326   0.200 0.841921    
## genreComedy                     0.60588    3.24443   0.187 0.851922    
## genreDocumentary               18.37663    3.87322   4.745 2.59e-06 ***
## genreDrama                      9.88611    2.74876   3.597 0.000348 ***
## genreHorror                     7.46212    4.86782   1.533 0.125790    
## genreMusical & Performing Arts 11.03823    6.28397   1.757 0.079476 .  
## genreMystery & Suspense        11.25031    3.54758   3.171 0.001591 ** 
## genreOther                     11.55986    5.67809   2.036 0.042181 *  
## genreScience Fiction & Fantasy 10.90385    6.99620   1.559 0.119607    
## audience_score                  0.90623    0.04288  21.136  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.65 on 631 degrees of freedom
## Multiple R-squared:  0.5296, Adjusted R-squared:  0.5214 
## F-statistic: 64.59 on 11 and 631 DF,  p-value: < 2.2e-16

The interesting thing here is that in this alternative model ‘t45’. More than two genres of film proved to be significant.

Next we will start our prediction with the main model

5.1- Prediction with main model ‘w3’

Project rubric says to: “Pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using your the model you developed and the predict function in R. Also quantify the uncertainty around this prediction using an appropriate interval.”

So, I chose “The Joker (2019)”. Within my main model. Using the data gathered from IMDB (imdb_rating = 8.5) and rotten tomatoes website (audience_score = 88 and critics_score = 68) and storing the data in the variable named ‘joker’ (test data case) using the following code:

newdata <- data.frame(imdb_rating = 8.5, audience_score = 88, critics_score = 68, genre = "Mystery & Suspense")

joker <- round(predict(w3, newdata), digit = 0)

c(joker , newdata$critics_score)

##  1    
## 93 68

The real critics score is 68. The prediction is higher then expected. We also contruct a prediction interval around this prediction which will provide the accuracy of the prediction.

predict(w3, newdata, interval = 'prediction', level = 0.95)

##        fit      lwr      upr
## 1 93.28714 57.56897 129.0053

We are 95% confident that the Joker movie will have critics_score range from 59.59 to 129.0053.

5.2- Prediction with alternative model ‘t45’

newdata <- data.frame(audience_score = 88, critics_score = 68, genre = "Mystery & Suspense")

joker <- round(predict(t45, newdata), digit = 0)

c(joker , newdata$critics_score)

##  1    
## 84 68

In the alternative model, the values are slightly different. Let´s see the predtcion.

predict(t45, newdata, interval = 'prediction', level = 0.95)

##        fit      lwr      upr
## 1 83.97751 44.97271 122.9823

We are 95% confident that the Joker movie will have critics_score range from 44.97 to 122.98.

Part 6: Conclusion

Doing exploratory data analysis and modeling help us to know that genre, audience_score and imdb_rating are significant predictors that have association with the critics_score.