## Warning: package 'ggplot2' was built under R version 4.0.2
## Warning: package 'dplyr' was built under R version 4.0.2
## Warning: package 'GGally' was built under R version 4.0.2
The data set is comprised of 651 randomly sampled movies produced and released before 2016. This dataset is provided below, and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.
The study is observational only. It has a training lab and it is not possible to see any causality. Although, the activity can be interesting and reveal some aspects of the world of cinema.
The codebook contains 32 variables (from the film’s title, to being nominated for an Oscar, a note on the critical sites mentioned above, among others)
Below a relationship made through the “str” function with the type of codebook variable
## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num [1:651] 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
Random sampling was used and a large number of films in the sample may represent that the analysis has some degree of generalization, although parionic and as mentioned above, there is no pretense of causality in this project.
We searched as a research question something that is associated with the variable ‘critics score’. Many of the variables presented above after the exception of ‘str’ are not really variables, like Rotten’s url or actors’ names, for example. Since we can designate function as a relation between two sets, represented by a formation law, which associates the values of x and y. According to the formation law, each value of y depends on the value of x, this dependency relationship is the main characteristic of a function. I opted for y as a ‘critics score’ and others that could be included in the model will be addressed below, some of which I didn’t even consider because they are not really variables.
So my research problem is trying to find out which variables in the sample are best associated with the critics score in a multiple linear regression model.
NOTE: We think is necessary to justify before running the regressions the reason for excluding the analyzes below Variables not considered
-director; ‘title’; ‘actor1,2,3,4 and 5’: many cases
-title_type: risk of collinearity with ‘genre’
-urls - do not apply -oscar wins - excluded because there are few cases - I put only nominated
About VARIABLES:
-thtr_rel_year: Year the movie is released in theaters;
-thtr_rel_month: Month the movie is released in theaters;
-thtr_rel_day: Day of the month the movie is released in theaters;
-dvd_rel_year: Year the movie is released on DVD;
-dvd_rel_month: Month the movie is released on DVD;
-dvd_rel_day: Day of the month the movie is released on DVD; and
-mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
We judged a lot of things this year and the day of the month - I thought it was better to exclude it, I think it is very random I preferred, due to the nature of the exercise, to do something leaner.
The variables included, in addition to y = ‘critics_score’, were:
‘audience_score’, ‘imdb_rating’, ‘audience_rating’, ‘genre, studio’, ‘best_pic_nom’, ‘top200_box’.
To simply the EDA process, we will create a function called “complete” to just take observations that have a value for each variables
Prepare data for analysis. We sent the program to create ** q1 ** with the variables that can be analyzed
We decided to run a summary:
## audience_score critics_score imdb_rating audience_rating
## Min. :11.00 Min. : 1.00 Min. :1.900 Spilled:269
## 1st Qu.:46.00 1st Qu.: 33.00 1st Qu.:5.900 Upright:374
## Median :65.00 Median : 61.00 Median :6.600
## Mean :62.51 Mean : 57.82 Mean :6.501
## 3rd Qu.:80.00 3rd Qu.: 83.00 3rd Qu.:7.300
## Max. :97.00 Max. :100.00 Max. :9.000
##
## genre studio best_pic_nom
## Drama :303 Paramount Pictures : 37 no :621
## Comedy : 86 Warner Bros. Pictures : 30 yes: 22
## Action & Adventure: 64 Sony Pictures Home Entertainment: 27
## Mystery & Suspense: 59 Universal Pictures : 23
## Documentary : 52 Warner Home Video : 19
## Horror : 22 20th Century Fox : 18
## (Other) : 57 (Other) :489
## top200_box
## no :628
## yes: 15
##
##
##
##
##
First, we made a histogram of our response variable.
After, we will focus on audience_score and critics_score to see if there is any association between these variables. We will summary this relatioship via the following plot
## # A tibble: 1 x 1
## `cor(critics_score, audience_score)`
## <dbl>
## 1 0.707
That show a higher correlation. Next, we will visualize the relationship between critics_score and the imdb_rating.
## # A tibble: 1 x 1
## `cor(critics_score, imdb_rating)`
## <dbl>
## 1 0.766
Our suspicion is of collinearity between these two variables, for that we will carry out the same verification tests suggested in the lab about baseball.
For this, we will do a simple linear regression between the two variables (called ‘parsimon’) and carry out the three tests; (1) linearity, (2) nearly normal residuals, and (3) constant variability.
#create 'parsimon' and made LINEARITY test
parsimon <- lm(critics_score ~ imdb_rating, data = q1)
ggplot(data = parsimon, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")Observing this first test, we do not see the clearly random distribution as required. The other tests can confirm our predilection for excluding ‘imdb_rating’ for redundancy with y. Next Step is (2) nearly normal residuals
#Step 2
ggplot(data = parsimon, aes(x = .resid)) +
geom_histogram(binwidth = 25) +
xlab("Residuals")In this test is Ok Next Step 3:
Since it is ok in the other tests, We decided to go with this variable for the next step: ‘modeling’.
The other possible variables to be checked are categorical, we decided to make an analyses for the relationship of each of these variables with ‘y’ See more details below.
Remember Audience_score is Y
and our
#Possibles categorial X´s are:
#$ audience_rating: Factor w/ 2 levels
#$ genre : Factor w/ 11 levels
#$ studio : Factor w/ 211 levels
#$ best_pic_nom : Factor w/ 2 levels "no","yes"
#$ top200_box : Factor w/ 2 levels "no","yes"We start with ‘studio’
Above in this work, when R executes ‘summary (q1)’ it was found that we only thought to ignore the other studies reducing this variable from 211 factors to 5 (the top five of the studios with films in this variable) are they:
Paramount Pictures (n = 37), Warner Bros. Pictures (n = 30), Sony Pictures Home Entertainment (n = 27), Universal (n = 23) and Warner Home Video (n = 19)
But given the dizzying drop in the sample, we opted to leave out ‘studio’ analysis.
The plot below confirms this excessive profusion of cases at each level and we chose not to take ‘studio’ into account from now on in this analysis.
Next, we analyse ‘audience_rating’.
According to data and research on the web, Rotten Tomattoes has another audience classifier which is this variable with 2 factors called ‘audience_rating’. About this: The website also has a measure for audience reviews; an upright container of popcorn indicates 50% or more liked it, and a spilled container of popcorn indicates fewer than 50% liked it. (from: encurtador.com.br/imKNW)
test_aud_rating <- q1 %>%
select(critics_score, audience_rating)
ggplot(data = test_aud_rating) +
geom_bar(aes(critics_score, fill = audience_rating), position = position_dodge(), width = .75) As we saw, when we did it at the beginning of part 3, most of the criticisms are in high marks, so it is to be expected the largest number of cases with better grades ‘Upright’.
## test_aud_rating$audience_rating: Spilled
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 17.00 36.00 37.97 56.00 100.00
## ------------------------------------------------------------
## test_aud_rating$audience_rating: Upright
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 59.25 79.00 72.09 90.75 100.00
The data above confirm what was seen in the plot. We chose to keep it because there seems to be an association between critics and internet users of Rotten Tomattoes. Next we will do the same with the variable ‘best_pic_nom’
test_nomi <- q1 %>%
select(critics_score, best_pic_nom)
ggplot(data = test_nomi) +
geom_bar(aes(critics_score, fill = best_pic_nom), position = position_dodge(), width = .75) A greater concentration of greens (indicated films) is clearly perceived in the high scores. The red ones (the films not indicated) are more scattered throughout the sample, the ‘by’ function will show this below:
## test_nomi$best_pic_nom: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 33.00 60.00 56.77 81.00 100.00
## ------------------------------------------------------------
## test_nomi$best_pic_nom: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.00 85.75 91.00 87.45 94.00 98.00
Just check from the data above, the differences in the mean values between ‘yes’ and ‘no’ to see a possible association.
Next, thetop200_box variable will be tested in the same way as the two that preceded it First the plot:
test_top200 <- q1 %>%
select(critics_score, top200_box)
ggplot(data = test_top200) +
geom_bar(aes(critics_score, fill = top200_box), position = position_dodge(), width = .75) This variable measures whether the film is in the top 200 selection. Visually, there does not seem to be such a strong association between critics’ evaluation and being in the Box Top 200. We will see what the analysis generated by the ‘by’ function has to tell us in this case:
## test_top200$top200_box: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 33.00 61.00 57.39 82.25 100.00
## ------------------------------------------------------------
## test_top200$top200_box: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.00 70.50 83.00 75.53 88.50 94.00
Although it was not so visual on the plot, the data above shows us that there is something interesting to note. Especially the difference between 1stQuartile, average and median.
In short, until now ** we have chosen to proceed to the phase (part 4 Modeling) with this variable . We previously defined this also with the categorical variables ‘best_pic_nom’ ** and ** ‘audience_rating’ ** and with the numerics ´audience_score´ and ‘imdb_rating’. So far we have chosen to leave out the ‘studio’ variable. Next, we explore our last ‘genre’ variable.
NOTE: given the high number of factors in ‘genre’, we decided to go straight to the function ‘by’, not making the plot (which would be strange).
## q1$genre: Action & Adventure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 18.00 35.00 41.70 60.25 98.00
## ------------------------------------------------------------
## q1$genre: Animation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 23.25 58.00 51.12 76.00 92.00
## ------------------------------------------------------------
## q1$genre: Art House & International
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 27.00 64.00 54.62 84.00 94.00
## ------------------------------------------------------------
## q1$genre: Comedy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 15.00 36.00 40.93 66.00 98.00
## ------------------------------------------------------------
## q1$genre: Documentary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 80.75 91.50 86.35 96.00 100.00
## ------------------------------------------------------------
## q1$genre: Drama
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 42.50 67.00 62.13 83.50 100.00
## ------------------------------------------------------------
## q1$genre: Horror
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 19.00 38.00 42.05 59.75 95.00
## ------------------------------------------------------------
## q1$genre: Musical & Performing Arts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 70.25 89.00 76.67 92.75 97.00
## ------------------------------------------------------------
## q1$genre: Mystery & Suspense
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 31.00 60.00 54.93 77.50 98.00
## ------------------------------------------------------------
## q1$genre: Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 57.50 72.00 67.73 89.00 100.00
## ------------------------------------------------------------
## q1$genre: Science Fiction & Fantasy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7 24 67 50 72 87
The values above are quite discrepant when comparing the film genres, so we chose to go with ‘genre’.
Next, we start our model modeling process, starting from the full model with q2 as a base.
Then we create ‘q2’ excluding ‘studio’.
Start with a complete ggpairs of ‘q2’
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Next Step is the full model of all selected variables to associated with ‘critics_score’
md_full <- lm(critics_score ~ audience_score + imdb_rating + audience_rating + genre + best_pic_nom + top200_box, data = q1)
summary(md_full)##
## Call:
## lm(formula = critics_score ~ audience_score + imdb_rating + audience_rating +
## genre + best_pic_nom + top200_box, data = q1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.515 -11.975 2.108 12.291 49.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -60.0553 6.0051 -10.001 < 2e-16 ***
## audience_score 0.1451 0.1048 1.384 0.166803
## imdb_rating 15.4482 1.4150 10.918 < 2e-16 ***
## audience_ratingUpright 2.9598 2.9609 1.000 0.317888
## genreAnimation 10.6803 6.7957 1.572 0.116542
## genreArt House & International -1.0237 5.5235 -0.185 0.853029
## genreComedy 3.5769 2.9912 1.196 0.232214
## genreDocumentary 13.3426 3.6337 3.672 0.000261 ***
## genreDrama 7.2328 2.5568 2.829 0.004821 **
## genreHorror 6.0741 4.4658 1.360 0.174277
## genreMusical & Performing Arts 9.6048 5.7747 1.663 0.096758 .
## genreMystery & Suspense 5.2952 3.3086 1.600 0.110005
## genreOther 9.0770 5.2099 1.742 0.081954 .
## genreScience Fiction & Fantasy 12.0636 6.3970 1.886 0.059780 .
## best_pic_nomyes 4.8094 4.0683 1.182 0.237589
## top200_boxyes 6.3734 4.8082 1.326 0.185481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.96 on 627 degrees of freedom
## Multiple R-squared: 0.6097, Adjusted R-squared: 0.6004
## F-statistic: 65.3 on 15 and 627 DF, p-value: < 2.2e-16
We selected backwards elimination method.
Here, we star with a full model that is a model with all possible co-variants or predictors included, and then we drop variables one at a time until a parsimonius model is reached.
As we see many cases with low p-value, we decided to do both backwards methods with p-value and with R_adjusted Squared. We will start with backwards with p-value. These will be the ‘w’ models After completing this step, we will make the ‘t’ models with the stepwise backwards method with R Adjusted Square.
At the end, we will use the 4 diagnostic tests indicated for multiple linear regression and choose between the two selected from ‘w’ and ‘t’.
We started this step by removing the explanatory variable with the highest p-value. This is ‘audience rating’. Although one of the genre levels has a higher p-value, we do not remove it because it has at least one of the levels with a p-value <0.05. We will save the value of R_Adjusted Square at the end of each model in case you need to repeat it later in the R Adjusted step, thus saving time. We start with the ‘w1’ model
#Start with Stepwise Backwards with p-value
w1 <- lm(critics_score ~ audience_score + imdb_rating + genre + best_pic_nom + top200_box, data = q1)
summary(w1)##
## Call:
## lm(formula = critics_score ~ audience_score + imdb_rating + genre +
## best_pic_nom + top200_box, data = q1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.48 -12.07 2.26 12.62 47.96
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -61.16964 5.90073 -10.366 < 2e-16 ***
## audience_score 0.22085 0.07244 3.049 0.002395 **
## imdb_rating 15.14443 1.38197 10.959 < 2e-16 ***
## genreAnimation 10.68144 6.79569 1.572 0.116501
## genreArt House & International -0.64463 5.51051 -0.117 0.906912
## genreComedy 3.50986 2.99042 1.174 0.240960
## genreDocumentary 13.41471 3.63298 3.692 0.000241 ***
## genreDrama 7.38086 2.55249 2.892 0.003965 **
## genreHorror 5.99561 4.46514 1.343 0.179835
## genreMusical & Performing Arts 9.57718 5.77458 1.659 0.097715 .
## genreMystery & Suspense 5.34677 3.30820 1.616 0.106549
## genreOther 9.16345 5.20919 1.759 0.079049 .
## genreScience Fiction & Fantasy 12.03995 6.39696 1.882 0.060280 .
## best_pic_nomyes 4.63608 4.06458 1.141 0.254470
## top200_boxyes 6.53659 4.80543 1.360 0.174239
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.96 on 628 degrees of freedom
## Multiple R-squared: 0.6091, Adjusted R-squared: 0.6004
## F-statistic: 69.9 on 14 and 628 DF, p-value: < 2.2e-16
Next we introduce the model ‘w2’ with exclusion of ‘best_pic_nomyes’.
#Model w2 without best_pic_nomyes
w2 <- lm(critics_score ~ audience_score + imdb_rating + genre + top200_box, data = q1)
summary(w2)##
## Call:
## lm(formula = critics_score ~ audience_score + imdb_rating + genre +
## top200_box, data = q1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.613 -11.945 2.467 12.511 47.997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.13317 5.84134 -10.637 < 2e-16 ***
## audience_score 0.22569 0.07233 3.120 0.001890 **
## imdb_rating 15.25812 1.37870 11.067 < 2e-16 ***
## genreAnimation 10.68356 6.79731 1.572 0.116515
## genreArt House & International -0.76749 5.51078 -0.139 0.889281
## genreComedy 3.61825 2.98963 1.210 0.226631
## genreDocumentary 13.10806 3.62389 3.617 0.000322 ***
## genreDrama 7.52291 2.55006 2.950 0.003295 **
## genreHorror 6.08314 4.46555 1.362 0.173610
## genreMusical & Performing Arts 9.32261 5.77165 1.615 0.106760
## genreMystery & Suspense 5.45470 3.30764 1.649 0.099622 .
## genreOther 9.61172 5.19559 1.850 0.064786 .
## genreScience Fiction & Fantasy 12.06873 6.39844 1.886 0.059729 .
## top200_boxyes 6.84258 4.79908 1.426 0.154419
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.96 on 629 degrees of freedom
## Multiple R-squared: 0.6083, Adjusted R-squared: 0.6002
## F-statistic: 75.14 on 13 and 629 DF, p-value: < 2.2e-16
Let’s go to the last model of this step. Only ‘top200_boxyes’ left with a high p_value value. Imdb_rating, audience_score and genre has a good p_value.
It is worth noting that in the ‘w1’ model, something that is now repeated in ‘w2’, audience_rating started to have a good value of p_value, something that did not occur in the full model.
It is also important to note that in ‘genre’, the documentary and drama levels have a good sig and high slope values. Then follows our latest w3 model:
##
## Call:
## lm(formula = critics_score ~ audience_score + imdb_rating + genre,
## data = q1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.715 -11.988 2.432 12.628 47.472
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.33486 5.84442 -10.666 < 2e-16 ***
## audience_score 0.22899 0.07235 3.165 0.001626 **
## imdb_rating 15.35163 1.37827 11.138 < 2e-16 ***
## genreAnimation 10.13692 6.79205 1.492 0.136077
## genreArt House & International -1.41398 5.49659 -0.257 0.797072
## genreComedy 3.19028 2.97696 1.072 0.284284
## genreDocumentary 12.32133 3.58457 3.437 0.000626 ***
## genreDrama 7.01997 2.52761 2.777 0.005644 **
## genreHorror 5.59554 4.45608 1.256 0.209687
## genreMusical & Performing Arts 8.57696 5.75262 1.491 0.136471
## genreMystery & Suspense 4.98169 3.29365 1.513 0.130905
## genreOther 9.40158 5.19775 1.809 0.070962 .
## genreScience Fiction & Fantasy 12.32446 6.40117 1.925 0.054636 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.97 on 630 degrees of freedom
## Multiple R-squared: 0.607, Adjusted R-squared: 0.5995
## F-statistic: 81.1 on 12 and 630 DF, p-value: < 2.2e-16
The ‘w3’ model presented a good R Adjusted of 0.5995, only one percent less than the full model, but with all variables within the model with sig <0.05. The trend is that ‘w3’ is our model is the one that stands out so far. However, we will test other models using the method: Stepwise Backwards with p-value.
Utilizando como base os modelos já testados temos até aqui:
-Full Model with audience_score + imdb_rating + audience_rating + genre + best_pic_nom + top200_box with R Adjusted = 0.6004.
-Model w1 with audience_score + imdb_rating + genre + best_pic_nom + top200_box with R Adjusted = 0.6004.
-Model w2 with audience_score + imdb_rating + genre + top200_box with R Adjusted = 0.6002.
-Model w3 with audience_score + imdb_rating + genre with R Adjusted = 0.5995.
O primeiro passo é testar o modelo com uma variável a menos que o modelo completo. Já fizemos isso cim a variável audience_rating, em que se gerou o modelo ‘w1’, com mesmo R Adjusted que o full model. A seguir faremos o modelo ‘t2’ com a introdução de todas as variáveis x menos uma e essa será audience_score:
The first step is to test the model with one variable unless the model is complete. We have already done this with the variable audience_rating, in which the ‘w1’ model was generated, with the same R Adjusted as the full model. Next we will make the model ‘t2’ with the introduction of all variables x minus one and this will be audience_score:
t2 <- lm(critics_score ~ audience_rating + imdb_rating + genre + best_pic_nom + top200_box, data = q1)
summary(t2)$adj.r.squared## [1] 0.5998061
Até agora ‘w2’ por esse método adotado aqui permanece sendo a melhor escolha. A seguir ‘t3’ com a exclusão de ‘imdb_rating’:
t3 <- lm(critics_score ~ audience_rating + audience_score + genre + best_pic_nom + top200_box, data = q1)
summary(t3)$adj.r.square## [1] 0.5251795
O pior valor de Adjusted R-squared até aqui. Modelo com pior capacidade preditiva. A seguir, seguiremos no Passo 1, com a exclusão de ‘genre’ gerado abaixo no modelo ‘t4’:
t4 <- lm(critics_score ~ audience_rating + audience_score + imdb_rating + best_pic_nom + top200_box, data = q1)
summary(t4)$adj.r.square## [1] 0.5940828
O valor gerado permanece abaixo de 0.6004.
A seguir os modelos ‘t5’ e ‘t6’ excluindo as variáveis que ainda restam nessa etapa 1 de Stepwise Backwards with R Adjusted Squared method:
t5 <- lm(critics_score ~ audience_rating + audience_score + imdb_rating + genre + top200_box, data = q1)
t6 <- lm(critics_score ~ audience_rating + audience_score + imdb_rating + genre + best_pic_nom, data = q1)
summary(t5)$adj.r.square## [1] 0.6001359
## [1] 0.5999072
Pela lógica do metódo adotado aqui o modelo com maior valor de Adjusted R Square é ‘w1’. Seguimos ao pasos 2, com esse modelo. Em suma, de todos os modelos com cinco variáveis explicativas, o com maior R Adjusted Square é ‘w1’
Relembre que o modelo ‘w1’ contém: ‘audience_score + imdb_rating + genre + best_pic_nom + top200_box’.
No que concerne à variável ‘best_pic_nom’ não se faz necessário realizar essa etapa, uma vez que já foi executada quando ocorreu a realização do modelo ‘w2’. Nessa, o valor de R Adjusted é 0.6002.
O próximo a se fazer nessa segunda etapa, é a criação de modelos retirando as outras :
#Below the values of the best model so far 'w1' and all with 4 variables
t21 <- lm(critics_score ~ best_pic_nom + imdb_rating + genre + top200_box, data = q1)
t22 <- lm(critics_score ~ best_pic_nom + audience_score + genre + top200_box, data = q1)
t23 <- lm(critics_score ~ best_pic_nom + imdb_rating + audience_score + top200_box, data = q1)
t24 <- lm(critics_score ~ best_pic_nom + imdb_rating + genre + audience_score, data = q1)
R_adju_w1## [1] 0.6004
## [1] 0.6002
## [1] 0.5951196
## [1] 0.5247298
## [1] 0.5939886
## [1] 0.5998492
As seen above, ‘w1’ remains the best model according to the criteria adopted here (R adjusted Square).
Next, the next step is to generate models with 3 variables.
A part of this step has already been generated in the ‘w3’ model (audience_score + imdb_rating + genre), the value of which will be recalled below.
#Below the values of the best model so far 'w1' and all with 3 variables
t31 <- lm(critics_score ~ top200_box + imdb_rating + genre, data = q1)
t32 <- lm(critics_score ~ top200_box + audience_score + genre, data = q1)
t33 <- lm(critics_score ~ top200_box + imdb_rating + audience_score, data = q1)
R_adju_w1## [1] 0.6004
## [1] 0.5995
## [1] 0.5946545
## [1] 0.5231067
## [1] 0.5939932
As seen above, ‘w1’ remains the best model according to the criteria adopted here (R adjusted Square).
Taking into account only models with 3 variables, ‘w3’ is the best. What was expected, since we reached him with the backwards method with p-value.
Next, the next step is to generate models with 2 variables
#Below the values of the best model so far 'w1' and all with 2 variables
t41 <- lm(critics_score ~ top200_box + imdb_rating, data = q1)
t42 <- lm(critics_score ~ top200_box + genre, data = q1)
t43 <- lm(critics_score ~ top200_box + audience_score, data = q1)
t44 <- lm(critics_score ~ genre + imdb_rating, data = q1)
t45 <- lm(critics_score ~ genre + audience_score, data = q1)
t46 <- lm(critics_score ~ audience_score + imdb_rating, data = q1)
R_adju_w1## [1] 0.6004
## [1] 0.586896
## [1] 0.1983559
## [1] 0.499364
## [1] 0.5938205
## [1] 0.5214421
## [1] 0.5940279
Conclusion: In this method ‘w1’ is the best choose.
Who is the best model? ‘w1’ ou ‘w3’?????
See below, in this report ….
The two methods adopted generated the models ‘w1’ and ‘w3’. Which one is the best? As the values of R Adjusted Square were very close with a difference of 1%, we opted for ‘w3’, because in this model all variables have a good p-value. However, to be sure it is necessary to perform diagnostic tests. In case of approval in these, ‘w3’ will be our choice.
#Step 1 - LINEAR RELATIONSHIPS BETWEEN X AND Y
cog_final = lm(critics_score ~ audience_score + imdb_rating + genre, data = q1)
plot(cog_final$residuals ~ q1$imdb_rating + q1$audience_score)In this first step, only the numerical explanation variables are tested. In our case it is ‘imdb_rating’ and ‘audience_score’. The values appear to be distributed randomly. We will see below what appears in the next steps.
Now see step 2:
Seems to be ok. Now, step 3 and 4.
#Step 3 - Constant variability of residuals and Step 4 - INDEPENDENCE OF RESIDUALS
plot(cog_final$residuals ~ cog_final$fitted)Concluded: We Decided the best model is ‘Critics_Score’ with ‘audience_score’ ‘imdb_rating’ ‘genre’.
One of our concerns when making the following prediction is about our chosen model ‘w3’. In all previous diagnostic tests, the variable ‘imdb_rating’ passed the limit between ok and non-ok. So we will make the prediction with both ‘w3’ and ‘t45’ (which is ‘w3’ without ‘imdb_rating’).
In short, we will make two predictions:
1- Prediction with main model ‘w3’
2- Prediction with alternative model ‘t45’
Before starting predictions, see the summary of ‘q45’ below. NOTE: The summary of ‘w3’ can be found above in this report.
##
## Call:
## lm(formula = critics_score ~ genre + audience_score, data = q1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.479 -13.588 2.729 14.527 50.917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.02068 3.36851 -2.084 0.037543 *
## genreAnimation 1.61984 7.37769 0.220 0.826286
## genreArt House & International 1.19776 6.00326 0.200 0.841921
## genreComedy 0.60588 3.24443 0.187 0.851922
## genreDocumentary 18.37663 3.87322 4.745 2.59e-06 ***
## genreDrama 9.88611 2.74876 3.597 0.000348 ***
## genreHorror 7.46212 4.86782 1.533 0.125790
## genreMusical & Performing Arts 11.03823 6.28397 1.757 0.079476 .
## genreMystery & Suspense 11.25031 3.54758 3.171 0.001591 **
## genreOther 11.55986 5.67809 2.036 0.042181 *
## genreScience Fiction & Fantasy 10.90385 6.99620 1.559 0.119607
## audience_score 0.90623 0.04288 21.136 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.65 on 631 degrees of freedom
## Multiple R-squared: 0.5296, Adjusted R-squared: 0.5214
## F-statistic: 64.59 on 11 and 631 DF, p-value: < 2.2e-16
The interesting thing here is that in this alternative model ‘t45’. More than two genres of film proved to be significant.
Next we will start our prediction with the main model
Project rubric says to: “Pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using your the model you developed and the predict function in R. Also quantify the uncertainty around this prediction using an appropriate interval.”
So, I chose “The Joker (2019)”. Within my main model. Using the data gathered from IMDB (imdb_rating = 8.5) and rotten tomatoes website (audience_score = 88 and critics_score = 68) and storing the data in the variable named ‘joker’ (test data case) using the following code:
newdata <- data.frame(imdb_rating = 8.5, audience_score = 88, critics_score = 68, genre = "Mystery & Suspense")
joker <- round(predict(w3, newdata), digit = 0)
c(joker , newdata$critics_score)## 1
## 93 68
The real critics score is 68. The prediction is higher then expected. We also contruct a prediction interval around this prediction which will provide the accuracy of the prediction.
## fit lwr upr
## 1 93.28714 57.56897 129.0053
We are 95% confident that the Joker movie will have critics_score range from 59.59 to 129.0053.
newdata <- data.frame(audience_score = 88, critics_score = 68, genre = "Mystery & Suspense")
joker <- round(predict(t45, newdata), digit = 0)
c(joker , newdata$critics_score)## 1
## 84 68
In the alternative model, the values are slightly different. Let´s see the predtcion.
## fit lwr upr
## 1 83.97751 44.97271 122.9823
We are 95% confident that the Joker movie will have critics_score range from 44.97 to 122.98.
Doing exploratory data analysis and modeling help us to know that genre, audience_score and imdb_rating are significant predictors that have association with the critics_score.