0.1 Part 1: Setup
This project details my analysis of the movie dataset that contains information from IMDb and Rotten Tomatoes for a random sample of movies between 1970 and 2014. The purpose of this project is to develop a multiple linear regression model to understand what attributes make a film popular, and I want to learn some knowledge about the movie reviews.
0.1.1 Load packages
Loading R packages used besides base R.
library(knitr)
library(kableExtra)
library(tidyverse)
library(patchwork) # stitch ggplots together
library(lubridate)
library(GGally)
library(car)The package Hmisc will be used in this article, but it will not be loaded here to eliminate the conflicts.
0.1.2 Background about the original dataset
Figure1.1 logos of MPAA, IMDb and Rotten Tomatoes
The MPAA(Motion Picture Association of America) rating of the movie with five levels. They are G, PG, PG-13, R, NC-17. In this dataset, value Unrated was added beside the original MPAA’s ratings.
IMDb(Internet Movie Database) offers a rating scale that allows the website’s users to rate films on a scale of one to ten.
Rotten Tomatoes is an American review-aggregation website for film and television. There are three levels from the professional film critics, and they are Certified Fresh, Fresh
to Rotten
. In a way, Rotten Tomatoes uses professional evaluation from the critics as the mainly score, general audience’s assessment as a supplement.
Coursera offered the data set. It is comprised of 651 randomly sampled movies produced and released in the theaters before 2015. Each row in the dataset contains a movie and some characteristic/ratings of it.
For this research, it is essential to state:
- The original dataset about each movie’s score was taken from both the IMDb and Rotten Tomatoes database. Characteristics of the site user community influence trends in specific movie scores.
- This article was an observational study. There is no causation can be established because a random assignment is not used in this study.
- In this study, I will assume some voting results about a film on website A could be used to predict the same movie’s score on website B. However, in the real world the score for the same movie occurs at the same time on two sites A and B.
0.2 Part 2: Data Import
At this step, the raw data which was downloaded from the Coursera webpage will be loaded from the local disk.
load("movies.Rdata")str(movies)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
For the raw dataset movies, there are 651 observations, it is the total number of the movies. There are 32 variables in the raw dataset, they are mixed with numerical and categorical ones. Some variables will be ignored in further research.
There might be some duplicate rows in the raw dataset. It is essential to have a check.
movies.dup <- movies %>% select(title) %>%
table() %>%
data.frame() %>%
filter(Freq >1)
movies.dup %>% rename(title='.') %>%
inner_join(movies,by="title") %>%
select(title, studio, thtr_rel_year) %>%
arrange(title, thtr_rel_year) %>%
kable(align = "c", caption ="TABLE2.1 Movies with duplicated title") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),full_width = TRUE) | title | studio | thtr_rel_year |
|---|---|---|
| Hairspray | New Line Home Entertainment | 1988 |
| Hairspray | New Line Cinema | 2007 |
| Man on Wire | Magnolia Pictures | 2008 |
| Man on Wire | Magnolia Pictures | 2008 |
| Maniac | Analysis | 1980 |
| Maniac | IFC Midnight | 2013 |
| Where the Heart Is | Touchstone Pictures | 1990 |
| Where the Heart Is | Twentieth Century Fox Home Entertainment | 2000 |
There are four pair movies with the same title. Some one could be considered as the remake or duplicate title ones because the release year is different. It is evident that the movie Man on Wire produced by Magnolia Pictures on 2008 had a duplicated record. One row of them will be removed to keep the unique ones.
movies <- movies[!duplicated(movies),]
dim(movies)## [1] 650 32
One duplicated row is removed.
Now start to remove some redundant variables.In this study the below list shows 20 variables are selected to start the analysis in EDA stage of the research.
Table2.2 EDA focused 20 variables and the description in the raw dataset
| Variable Name | Description |
|---|---|
| title | Title of movie |
| title_type | Type of movie (Documentary, Feature Film, TV Movie) |
| genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other) |
| runtime | Runtime of movie (in minutes) |
| mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, NC-17, Unrated) |
| studio | The Studio produced the movie |
| thtr_rel_year | Year the movie is released in theaters |
| thtr_rel_month | Month the movie is released in theaters |
| thtr_rel_day | Day of the month the movie is released in theaters |
| imdb_rating | Rating on IMDB |
| imdb_num_votes | Number of votes on IMDB |
| critics_rating | Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten) |
| critics_score | Critics score on Rotten Tomatoes |
| audience_rating | Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright) |
| audience_score | Audience score on Rotten Tomatoes |
| best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) |
| best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) |
| best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) ?C note that this is not necessarily whether the actor won an Oscar for their role in the given movie |
| best_actress win | Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) ?C note that this is not necessarily whether the actresses won an Oscar for their role in the given movie |
| best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) ?C note that this is not necessarily whether the director won an Oscar for the given movie |
movies.tb0 <- movies %>%
select(title:best_dir_win) %>%
select(-contains("DVD"))Some variables are ignored, such as top200_box, director, actor1~5, imdb_url, rt_url and information about the movie released on DVD. Now the movies.tb0 is a simplified dataset compared with the raw one.
dim(movies.tb0)## [1] 650 20
For the dataset movies.tb0, there are 650 movies in the dataset, 20 variables reserved.
Here shows some information about the movies.
movies.tb0 %>%
select(c("mpaa_rating","title_type","imdb_rating","imdb_num_votes",
"critics_rating","audience_rating","runtime")) %>%
Hmisc::describe()## .
##
## 7 Variables 650 Observations
## ---------------------------------------------------------------------------
## mpaa_rating
## n missing distinct
## 650 0 6
##
## Value G NC-17 PG PG-13 R Unrated
## Frequency 19 2 118 132 329 50
## Proportion 0.029 0.003 0.182 0.203 0.506 0.077
## ---------------------------------------------------------------------------
## title_type
## n missing distinct
## 650 0 3
##
## Value Documentary Feature Film TV Movie
## Frequency 54 591 5
## Proportion 0.083 0.909 0.008
## ---------------------------------------------------------------------------
## imdb_rating
## n missing distinct Info Mean Gmd .05 .10
## 650 0 56 0.999 6.491 1.187 4.40 5.10
## .25 .50 .75 .90 .95
## 5.90 6.60 7.30 7.71 7.90
##
## lowest : 1.9 2.3 2.4 2.7 2.8, highest: 8.2 8.3 8.4 8.5 9.0
## ---------------------------------------------------------------------------
## imdb_num_votes
## n missing distinct Info Mean Gmd .05 .10
## 650 0 644 1 57563 82785 1286 1993
## .25 .50 .75 .90 .95
## 4543 15070 58484 151962 263069
##
## lowest : 180 183 281 285 318, highest: 753592 756602 797101 806911 893008
## ---------------------------------------------------------------------------
## critics_rating
## n missing distinct
## 650 0 3
##
## Value Certified Fresh Fresh Rotten
## Frequency 134 209 307
## Proportion 0.206 0.322 0.472
## ---------------------------------------------------------------------------
## audience_rating
## n missing distinct
## 650 0 2
##
## Value Spilled Upright
## Frequency 275 375
## Proportion 0.423 0.577
## ---------------------------------------------------------------------------
## runtime
## n missing distinct Info Mean Gmd .05 .10
## 649 1 89 1 105.8 20.12 83.4 86.8
## .25 .50 .75 .90 .95
## 92.0 103.0 116.0 129.0 137.6
##
## lowest : 39 40 65 68 74, highest: 178 180 194 202 267
## ---------------------------------------------------------------------------
From the above description of the dataset, here we could say:
- The MPAA rating has a total of 6 levels, as level is arranged in alphabetical order and needs to be rearranged;
- Feature films account for 90.9% of all films;
- More than 50 of the film IMDb score is more than 6.6 points;
- There are some huge number of IMDb votes, they are bigger than 750,000. The medium vote number is 15070.
- Rotten Tomatoes critics rated higher (Certified Fresh) accounted for 20.6% of the film; Fresh ones are slightly more than Rotten one in overall;
- Rotten Tomatoes audience rated good or bad is half to half.
-
One missing value was found in
runtimeand it is necessary to check other ones.
movies.tb0 %>% sapply(is.na) %>% colSums()## title title_type genre runtime
## 0 0 0 1
## mpaa_rating studio thtr_rel_year thtr_rel_month
## 0 8 0 0
## thtr_rel_day imdb_rating imdb_num_votes critics_rating
## 0 0 0 0
## critics_score audience_rating audience_score best_pic_nom
## 0 0 0 0
## best_pic_win best_actor_win best_actress_win best_dir_win
## 0 0 0 0
The result shows eight films are missing studio record, one losing runtime record. One observation missed runtime record will be removed; we can keep the missed studio records because this factor will not affect this research.
movies.tb0 <- movies.tb0 %>% filter(!is.na(runtime))
dim(movies.tb0)## [1] 649 20
Here the dataset will be releveled on the mpaa_rating. The thtr_quat variable will be added for considering the quarterly factors for the film releasing. The osca_opt will be valued one if there was any Yes in the variables of best_*.
movies.tb1 <- movies.tb0 %>%
mutate(mpaa_rating = fct_relevel(mpaa_rating,"Unrated","NC-17","R","PG-13","PG","G"),
thtr_date = make_date(thtr_rel_year,thtr_rel_month,thtr_rel_day),
thtr_quat = quarter(thtr_date),
osca_opt = ifelse(best_pic_nom=="yes"|best_pic_win=="yes"|best_actor_win=="yes"|
best_actress_win=="yes"|best_dir_win=="yes",1,0))The new variables and other ones will be plotted in the next EDA stage.
0.3 Part 3: Research question
Is there a relationship between the ratings that a movie gets on the IMDb website and the grades that the same film comes on the Rotten Tomatoes?
Through the score of a movie on the Rotten Tomatoes, as well as the length of the show, the release season and other factors, can it better speculate its rating on the IMDb website? This article assumes here that users of two sites involved in this data are independent of each other.
Usually, the drama type of the film and the characteristics of the audience group will have a positive impact on the audience’s score. The same is true even for professional film critics. In the EDA part of this article, the film will be further classified and studied in terms of each MPAA rating and genre characteristics. The variable imdb_rating will be considered as a response one to initiate the research.
0.4 Part 4: Exploratory data analysis
0.4.1 Movie type and feature predictors selection
In the EDA stage, first I analysis the movies spread in each quarter with different MPAA ratings the category of the movie will be explored.
temp1 <- movies.tb1 %>%
group_by(thtr_quat,mpaa_rating) %>%
summarise(num_quat=n()) %>%
mutate(label_y = cumsum(num_quat) - 0.5*num_quat)
ggplot(temp1,aes(x=thtr_quat,y=num_quat,fill=fct_rev(mpaa_rating)))+
geom_col(colour="black")+
labs(fill="MPAA_Rating")+ ylab("Films Number") + xlab("Quarter") +
geom_text(aes(y=label_y,label=num_quat)) +
scale_fill_brewer(palette = "Accent") + theme_bw()Figure4.1: Number of releases of different MPAA rating films in each quarter
Discussion for Fig4.1: It shows that more films released in the fourth quarter of each year in the provided dataset. R-rating movies make up a more significant proportion of films released in each quarter.
temp2 <- movies.tb1 %>%
group_by(mpaa_rating,genre) %>%
summarise(num_genre=n()) %>%
ungroup()
ggplot(temp2,aes(x=fct_rev(mpaa_rating),y=num_genre,fill=genre))+
geom_col(color="black")+
ylab("Films Number group by Genre") + xlab("MPAA rating") +
scale_fill_brewer(palette = "Paired") + theme_bw()Figure4.2: Number of films of various genres in different MPAA ratings
Discussion for Fig4.2: In all of MPAA’s rating films, R-rating ones reaches the highest number. Drama ones make up a more significant proportion of all.
Different types of films, the audience will produce matched viewing feelings after appreciation, such as a horror film or a musical, which will have an impact on the film scoring.
MPAA’s rating of the film is based on the age of the audience, the audience to enjoy a child-oriented G-class film and an adult-oriented R-rated film to get the viewing experience is also different, the same will affect the film score.
In this article, the multivariate regression analysis will be proceeded based on the original data R-class feature film data.
dra_rfs <- movies.tb1 %>%
filter(genre=="Drama",mpaa_rating=="R")
dim(dra_rfs)## [1] 179 23
Now a new dataset was created for the modeling, it has 179 observations and 23 variables.
It is necessary to simplify the variables for the further work. All movies are R-rating and Drama ones so the mpaa_rating and genre will be removed because they are fixed value in this dataset. The best_* variables will be removed, and some potential factors will be moved to the start of the data frame.
dra_rfs <- dra_rfs %>%
select(-genre,-mpaa_rating) %>%
select(-contains("best_")) %>%
select(imdb_rating, runtime, imdb_num_votes, critics_score, audience_score,
osca_opt, thtr_quat, everything())
dim(dra_rfs)## [1] 179 16
The variable number of dataset for modeling has been cut down to 16.
ggpairs(dra_rfs, columns = 1:7)Figure4.3: Scatterplot matrices with correlation values
Discussion for Fig4.3:
-
The figure created by GGally::ggpairs is applied for the response variable
imdb_ratingand focused predictors in future modeling. -
On the top row of the figure, the correlation values about
imdb_ratingand other ones are displayed. It showsaudience-scorereaches the highest one 0.897,thtr_quatreaches the lowest value 0.0442.thtr_quathas a very weak correlation with the response variable. -
On the left side of the figure4.3, the scatterplot shows the linear relationship between
imdb_ratingand other ones. In the diagonal it shows a continuous approximation of the distribution of the respective variable. -
There are two predictors
critics_scoreandaudience_scoreare correlated at 0.652 (colinearity). Therefore, one of them might be removed from the model, how to deal with this will be further discussed in the modeling stage.
0.4.2 The feature about the numerical variables
p1 <- ggplot(dra_rfs, aes(x=runtime)) +
geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='lightblue', binwidth = 5) +
ylab('Percentage') + ggtitle('Run Time') + theme_bw()
p2 <- ggplot(dra_rfs, aes(x=log10(imdb_num_votes))) +
geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='lightblue') +
ylab('Percentage') + ggtitle('log(IMDB number of votes)')+ theme_bw()
p3 <- ggplot(dra_rfs, aes(x=critics_score)) +
geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='lightblue', binwidth = 5) +
ylab('Percentage') + ggtitle('Critics Score')+ theme_bw()
p4 <- ggplot(dra_rfs, aes(x=audience_score)) +
geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='lightblue', binwidth = 5) +
ylab('Percentage') + ggtitle('Audience Score') + theme_bw()
p1 + p2 + p3 + p4 + plot_layout(ncol = 2)Figure4.4: Histogram plottings for the numerical variables
Discussion for Fig4.4: The figure shows the four predictors have reasonable distributions and they are suitable to do the regression analysis. The imdb_num_votes was applied log10 to get a proper visualization due to the existence of some large numbers of votes.
0.4.3 The feature about the categorical variables
p1 <- ggplot(dra_rfs, aes(x=as.factor(osca_opt),y=imdb_rating)) +
geom_boxplot(color='black', fill='lightblue',show.legend = FALSE) +
xlab('Oscar options') + ggtitle('IMDB ratings vs Oscar options in the movie') + theme_bw()
p2 <- ggplot(dra_rfs, aes(x=as.factor(thtr_quat),y=imdb_rating)) +
geom_boxplot(color='black', fill='lightblue',show.legend = FALSE) +
xlab('Quarter') + ggtitle('IMDB ratings vs Quarter') + theme_bw()
p1 + p2 + plot_layout(ncol = 2)Figure4.5: Box-plot plottings for the categorical variables
-
The categorical variable
oscs_optseems to have a reasonable significant correlation withimdb_rating. -
The variable
thtr_quatis not a significant correlation withimdb_rating, and it will be evaluated in the modeling stage.
0.5 Part 5: Modeling
0.5.1 Model searching
-
Considering that the number of users voting on a movie at the IMDb website and the IMDb score is obtained at the same time as the IMDB site, as a predictive model for
imdb_ratingthis article will not useimdb_num_votesas an explanatory variable. -
At the EDA stage, it was found that the variable
critics_scoreandaudience_scorehad the colinearity. But these two variables are different types of scores from rotten tomato sites that can be interpreted separately. We will evaluate the modeling situation to determine the extent to which the variable is removed, with the retention or abandonment of the critics’ scores. - We will start with the full model, do the backwards elimination for the model selection.
0.5.1.1 Way#1: backwards elimination with p-value and adjusted \(R^2\)
Step 1. establish the full model
fit.full <- lm( imdb_rating ~ critics_score + audience_score + runtime + osca_opt + thtr_quat,
data = dra_rfs)
summary(fit.full)##
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime +
## osca_opt + thtr_quat, data = dra_rfs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.25624 -0.15579 0.02893 0.20870 0.77029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.562864 0.187543 18.998 < 2e-16 ***
## critics_score 0.007095 0.001369 5.183 6.04e-07 ***
## audience_score 0.033657 0.001830 18.390 < 2e-16 ***
## runtime 0.004099 0.001765 2.322 0.0214 *
## osca_opt 0.145031 0.059096 2.454 0.0151 *
## thtr_quat -0.004506 0.022682 -0.199 0.8428
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3385 on 173 degrees of freedom
## Multiple R-squared: 0.844, Adjusted R-squared: 0.8395
## F-statistic: 187.1 on 5 and 173 DF, p-value: < 2.2e-16
The coefficient for thtr_quat isn’t significantly different from zero (p = 0.8428) suggesting that which quarter the movie released and it’s rating on IMDb site is not linearly related when controlling for the other predictor variables. Generally, the predictor variables in this model account for 84.4% of the variance in IMDb ratings of the Drama movies with R-rating.
Step 2. remove thtr_quat from the full model
fit01 <- update(fit.full, . ~ . - thtr_quat)
summary(fit01)##
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime +
## osca_opt, data = dra_rfs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.25432 -0.15076 0.03392 0.20826 0.77221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.556458 0.184239 19.303 < 2e-16 ***
## critics_score 0.007087 0.001364 5.194 5.71e-07 ***
## audience_score 0.033669 0.001824 18.457 < 2e-16 ***
## runtime 0.004051 0.001744 2.323 0.0213 *
## osca_opt 0.143748 0.058579 2.454 0.0151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3375 on 174 degrees of freedom
## Multiple R-squared: 0.8439, Adjusted R-squared: 0.8403
## F-statistic: 235.2 on 4 and 174 DF, p-value: < 2.2e-16
Step 3. remove thtr_quat and osca_opt from the full model
fit02 <- update(fit01, . ~ . - osca_opt)
summary(fit02)##
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime,
## data = dra_rfs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2919 -0.1433 0.0337 0.2100 0.7288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.454413 0.182042 18.976 < 2e-16 ***
## critics_score 0.006998 0.001383 5.058 1.06e-06 ***
## audience_score 0.033367 0.001846 18.075 < 2e-16 ***
## runtime 0.005632 0.001643 3.427 0.00076 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3424 on 175 degrees of freedom
## Multiple R-squared: 0.8385, Adjusted R-squared: 0.8358
## F-statistic: 302.9 on 3 and 175 DF, p-value: < 2.2e-16
Now the model fit02’s adjusted \(R^2\) value is 0.8357597 and it is less than the one of fit01(0.8403411). So we could confirm the final model would be fit01 with the lm formula imdb_rating ~ critics_score + audience_score + runtime + osca_opt.
Step 4. evaluating multicolinearity Here we use the variance inflation factor(VIF) to detect the multicolinearity. So the colinearity about critics_score and audience_score could be evaluated. The VIF values could be provided by the car::vif() function.
sqrt(vif(fit01)) > 2## critics_score audience_score runtime osca_opt
## FALSE FALSE FALSE FALSE
The results indicate there is no multicolinearity issue with these predictor variables. The variables critics_score and audience_score will be reserved in modeling.
0.5.1.2 Way#2: use AIC to perform a model search and cross check
Here is the way to use Akaike Information Criterion(AIC) to perform a model search with function step() .
fit.final <- step(fit.full, direction = "backward") ## Start: AIC=-381.92
## imdb_rating ~ critics_score + audience_score + runtime + osca_opt +
## thtr_quat
##
## Df Sum of Sq RSS AIC
## - thtr_quat 1 0.005 19.825 -383.88
## <none> 19.820 -381.92
## - runtime 1 0.618 20.438 -378.43
## - osca_opt 1 0.690 20.511 -377.79
## - critics_score 1 3.078 22.898 -358.08
## - audience_score 1 38.748 58.568 -189.98
##
## Step: AIC=-383.88
## imdb_rating ~ critics_score + audience_score + runtime + osca_opt
##
## Df Sum of Sq RSS AIC
## <none> 19.825 -383.88
## - runtime 1 0.615 20.440 -380.41
## - osca_opt 1 0.686 20.511 -379.79
## - critics_score 1 3.073 22.898 -360.08
## - audience_score 1 38.812 58.637 -191.77
summary(fit.final)##
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime +
## osca_opt, data = dra_rfs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.25432 -0.15076 0.03392 0.20826 0.77221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.556458 0.184239 19.303 < 2e-16 ***
## critics_score 0.007087 0.001364 5.194 5.71e-07 ***
## audience_score 0.033669 0.001824 18.457 < 2e-16 ***
## runtime 0.004051 0.001744 2.323 0.0213 *
## osca_opt 0.143748 0.058579 2.454 0.0151 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3375 on 174 degrees of freedom
## Multiple R-squared: 0.8439, Adjusted R-squared: 0.8403
## F-statistic: 235.2 on 4 and 174 DF, p-value: < 2.2e-16
Compared with the results of summary(fit01), we can say all of the coefficients and \(R^2\),\(adjusted R^2\) are the same. So the model fit.final and fit01 is the same one.
0.5.2 Interpret the final model
\[\hat{IMDb\_rating}\ = 3.556 + 0.007*critics\_score + 0.034*audience\_score + 0.004*runtime + 0.144*osca\_opt\]
Intercept(3.556) is the estimated IMDb rating for a movie with critics score and audience score from Rotten Tomatoes site, runtime, the status about Oscar Awards related to the film, for the Drama movies with R-rating. It does not provide any meaningful interpretation here.
critics_scorecoefficient(0.007): All else hold constant, for every one unit increase incritics_score, the model predicts a 0.007 increase inimdb_ratingon average.audience_scorecoefficient(0.034): All else hold constant, for every one unit increase inaudience_score, the model predicts a 0.034 increase inimdb_ratingon average.runtimecoefficient(0.004): All else hold constant, for every one unit increase inruntime, the model predicts a 0.004 increase inimdb_ratingon average.osca_optcoefficient(0.144): All else hold constant, if the movie won the best picture Oscar or was nominated or one of the main actors/actresses won an Oscar or the director won an Oscar, the model predicts a 0.144 increase inimdb_ratingon average.- \(R^2\)(0.844): 84.4% of the variability in IMDb rating can be explained by the model.
0.5.3 Regression diagnostics
0.5.3.1 Linearity check
If the dependent variable is linearly related to the independent variables, there should be no systematic relationship between the residuals and the fitted values. Here the below code chunk to check it:
crPlots(fit01)Figure5.1 Component plus residual plots for the regression of final model on movie characteristics
Discussion for Fig5.1: The component plus residual plots confirm that we’ve met the linearity assumption. The form of the linear model seems to be appropriate for this dataset.
0.5.3.2 Normality check
The qqPlot() function provides a more accurate method of assessing the normality assumption than that provided by the plot() function in the base package.
dra_rfs <- dra_rfs %>%
mutate(title_bak = title) %>%
column_to_rownames(var="title_bak")
qqPlot(fit01, labels=row.names(dra_rfs),simulate=TRUE, envelope=.95)Figure5.2 Q-Q plot for studentized residuals
## Filly Brown The Room
## 1 39
Discussion for Fig5.2: There are big negative residuals for the records 1(Filly Brown) and 39(The Room). Except them, all the points fall close to the line and are within the confidence envelope, suggesting that we met the normality assumption fairly well.
We will have a look at the records because the model gave the excessive estimation for the records Filly Brown and The Room.
dra_rfs %>% slice(c(1,39)) %>%
select(title,thtr_rel_year, studio,
imdb_rating, imdb_num_votes, critics_score, critics_rating,
audience_score, audience_rating, runtime, osca_opt) %>%
kable(align = "c", caption ="TABLE5.1 Movie-records about Filly Brown and The Room") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"),
font_size = 12,full_width = TRUE) %>%
scroll_box(width = "900px")| title | thtr_rel_year | studio | imdb_rating | imdb_num_votes | critics_score | critics_rating | audience_score | audience_rating | runtime | osca_opt |
|---|---|---|---|---|---|---|---|---|---|---|
| Filly Brown | 2013 | Indomina Media Inc. | 5.5 | 899 | 45 | Rotten | 73 | Upright | 80 | 0 |
| The Room | 2003 | Chloe Productions | 3.5 | 27769 | 35 | Rotten | 46 | Spilled | 99 | 0 |
In my opinion, the reason about the predictive deviation for the two movies is beyond the modeling. It’s a big possibility that this result was related to the content and expressive techniques of the films. If we could have more movies to analysis, maybe we could find the reasonable explanation.
residplot(fit01) # Function residplot() is referenced from Page189,R in Action 2nd Edition.Figure5.3 Distribution of studentized residuals
Discussion for Fig5.3: It shows the issue about the normal distribution. It was obviously found that the center residuals is more dense than the theoretical normal distribution curve. It was caused by the positive residuals around 0 which could be observed from Figure5.2.
0.5.3.3 Variability(Homoscedasticity) check
ncvTest(fit01)## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 33.55001, Df = 1, p = 6.9455e-09
Here we got a significant score(p-value=\(6.9455*10^{-9}\)),it means the final model does not meet the constant variance assumption for the residuals.
spreadLevelPlot(fit01)Figure5.4 Spread-level plot for assessing constant error variance
##
## Suggested power transformation: 2.967555
Discussion for Fig5.4: The result from function car::spreadLevelPlot() suggested power transformation 3, means the suggest modeling would be: \(y^3= \beta_0 + \beta_1*x_1 + \beta_2*x_2 + ...\). The transformations of the response variable can help in situations of heteroscedasticity, but for this practice it might be difficult to explain the meaning of \(IMDb\_rating^3\) in the real world.
0.5.3.4 Independancy check
As we knew each record of the movie in the dataset is the independent values. Now we check the independence of the residuals.
fit.res <- as_tibble(data.frame(index = seq(1:length(fit01$residuals)),
residuals = fit01$residuals,
title = dra_rfs$title))
fit.res <- fit.res %>%
mutate(plotname = ifelse( abs(residuals) >1,as.character(title), ""))
ggplot(fit.res,aes(x=index,y=residuals)) + geom_point() + geom_text(aes(x=index,label=plotname),vjust=-0.8)+ theme_bw()Figure5.5 Independany Conditions
Discussion for Fig5.5: The residual values concentrat around 0 except the two outliers Filly Brown and The Room. They were discussed by Fig5.2.
durbinWatsonTest(fit01) # detect the serially correlated errors in package car## lag Autocorrelation D-W Statistic p-value
## 1 0.02003874 1.891553 0.484
## Alternative hypothesis: rho != 0
The p-value(0.444) is not significant, this means a lack of autocorrelation.
0.6 Part 6: Prediction
0.6.1 Overview
We will use the previous MLR model fit01 to create a prediction algorithym. The response variable imdb_rating will be computed by the below formula: \[IMDb\_rating\ = 3.556 + 0.007*critics\_score + 0.034*audience\_score + 0.004*runtime + 0.144*osca\_opt\]
0.6.2 New movie
Here the movie Hacksaw Ridge(2016) is picked to do a prediction for the model.
The movie Hacksaw Ridge, which adapted from a real story, was released on Nov 4, 2016. It was rated by R for intense, prolonged realistically graphic sequences of war violence including grisly bloody images. The genre is Action & Adventure, Drama. Summit Entertainment produced it, and the runtime is 139 minutes. IMDb scored it with 8.1. This movie’s director Mel Columcille Gerard Gibson won Oscar with Braveheart on 1995.
The observations about the movie were acquired from Rotten Tomatoes and IMDb websites.
movie.new <- data.frame(title="Hacksaw Ridge", title_type="", genre="Drama", runtime=139,
mpaa_rating="R",studio="Summit Entertainment", imdb_rating=8.1,
imdb_num_votes=363769, critics_rating="Certified Fresh",
audience_rating="Upright",critics_score=86, audience_score=91,
osca_opt=1,thtr_quat=4)0.6.3 Predict Result
predict(fit01, movie.new, interval = "prediction", level = 0.95)## fit lwr upr
## 1 7.936636 7.258669 8.614602
The actual audience score for this movie is 8.1 and the prediction interval contains this value.
0.7 Part 7: Conclusion
Generally, we could use the model based on one movie’s score on the Rotten Tomatoes site and the runtime, some information about the movie’s Oscar award to predict the score on the IMDb site, the movies’ type is limited to Drama and R-rating.
The modeling is based on the Drama and R-rating movies in the provided dataset, some limitations about the data source might cause the issues for the modeling. If there were some external information about the movie’s cost and gross, it might be more accurate to do the regression analysis of the popularity of the films.
0.8 Part 8: Bibliography(Main ones only, the textbooks not included)
- Luis Torgo(2016). Data Mining with R: LEARNING WITH CASE STUDIES,2nd EDITION. Chapman & Hall/CRC.
- Winston Change(2018). R Graphics Cookbook:Pratical Recipes for Visualizing Data,2nd Edition. O’REILLY
- Robert I. Kabacoff(2016). R in Action,Data Analysis and Graphics with R 2nd Edition. Manning Publications Co.
- MPAA film ratings
- IMDb
- Rotten Tomatoes