Modeling and prediction for movies

0.1 Part 1: Setup

This project details my analysis of the movie dataset that contains information from IMDb and Rotten Tomatoes for a random sample of movies between 1970 and 2014. The purpose of this project is to develop a multiple linear regression model to understand what attributes make a film popular, and I want to learn some knowledge about the movie reviews.

0.1.1 Load packages

Loading R packages used besides base R.

library(knitr)
library(kableExtra)
library(tidyverse)
library(patchwork)      # stitch ggplots together
library(lubridate)
library(GGally)
library(car)

The package Hmisc will be used in this article, but it will not be loaded here to eliminate the conflicts.

0.1.2 Background about the original dataset

MPAA logo IMDB logo Rotten Tomatoes logo

Figure1.1 logos of MPAA, IMDb and Rotten Tomatoes

The MPAA(Motion Picture Association of America) rating of the movie with five levels. They are G, PG, PG-13, R, NC-17. In this dataset, value Unrated was added beside the original MPAA’s ratings.

IMDb(Internet Movie Database) offers a rating scale that allows the website’s users to rate films on a scale of one to ten.

Rotten Tomatoes is an American review-aggregation website for film and television. There are three levels from the professional film critics, and they are Certified Fresh, Fresh to Rotten. In a way, Rotten Tomatoes uses professional evaluation from the critics as the mainly score, general audience’s assessment as a supplement.

Coursera offered the data set. It is comprised of 651 randomly sampled movies produced and released in the theaters before 2015. Each row in the dataset contains a movie and some characteristic/ratings of it.

For this research, it is essential to state:

The original dataset about each movie’s score was taken from both the IMDb and Rotten Tomatoes database. Characteristics of the site user community influence trends in specific movie scores.
This article was an observational study. There is no causation can be established because a random assignment is not used in this study.
In this study, I will assume some voting results about a film on website A could be used to predict the same movie’s score on website B. However, in the real world the score for the same movie occurs at the same time on two sites A and B.

0.2 Part 2: Data Import

At this step, the raw data which was downloaded from the Coursera webpage will be loaded from the local disk.

load("movies.Rdata")

str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

For the raw dataset movies, there are 651 observations, it is the total number of the movies. There are 32 variables in the raw dataset, they are mixed with numerical and categorical ones. Some variables will be ignored in further research.

There might be some duplicate rows in the raw dataset. It is essential to have a check.

movies.dup <- movies %>% select(title) %>% 
  table() %>%
  data.frame() %>%
  filter(Freq >1)
movies.dup %>% rename(title='.') %>%
  inner_join(movies,by="title") %>%
  select(title, studio, thtr_rel_year) %>%
  arrange(title, thtr_rel_year) %>%
  kable(align = "c", caption ="TABLE2.1 Movies with duplicated title")  %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),full_width = TRUE)

TABLE2.1 Movies with duplicated title
title	studio	thtr_rel_year
Hairspray	New Line Home Entertainment	1988
Hairspray	New Line Cinema	2007
Man on Wire	Magnolia Pictures	2008
Man on Wire	Magnolia Pictures	2008
Maniac	Analysis	1980
Maniac	IFC Midnight	2013
Where the Heart Is	Touchstone Pictures	1990
Where the Heart Is	Twentieth Century Fox Home Entertainment	2000

There are four pair movies with the same title. Some one could be considered as the remake or duplicate title ones because the release year is different. It is evident that the movie Man on Wire produced by Magnolia Pictures on 2008 had a duplicated record. One row of them will be removed to keep the unique ones.

movies <- movies[!duplicated(movies),]
dim(movies)

## [1] 650  32

One duplicated row is removed.

Now start to remove some redundant variables.In this study the below list shows 20 variables are selected to start the analysis in EDA stage of the research.

Table2.2 EDA focused 20 variables and the description in the raw dataset

Variable Name	Description
title	Title of movie
title_type	Type of movie (Documentary, Feature Film, TV Movie)
genre	Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime	Runtime of movie (in minutes)
mpaa_rating	MPAA rating of the movie (G, PG, PG-13, R, NC-17, Unrated)
studio	The Studio produced the movie
thtr_rel_year	Year the movie is released in theaters
thtr_rel_month	Month the movie is released in theaters
thtr_rel_day	Day of the month the movie is released in theaters
imdb_rating	Rating on IMDB
imdb_num_votes	Number of votes on IMDB
critics_rating	Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
critics_score	Critics score on Rotten Tomatoes
audience_rating	Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
audience_score	Audience score on Rotten Tomatoes
best_pic_nom	Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_pic_win	Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win	Whether or not one of the main actors in the movie ever won an Oscar (no, yes) ?C note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_actress win	Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) ?C note that this is not necessarily whether the actresses won an Oscar for their role in the given movie
best_dir_win	Whether or not the director of the movie ever won an Oscar (no, yes) ?C note that this is not necessarily whether the director won an Oscar for the given movie

movies.tb0 <- movies %>%
  select(title:best_dir_win) %>%
  select(-contains("DVD"))

Some variables are ignored, such as top200_box, director, actor1~5, imdb_url, rt_url and information about the movie released on DVD. Now the movies.tb0 is a simplified dataset compared with the raw one.

dim(movies.tb0)

## [1] 650  20

For the dataset movies.tb0, there are 650 movies in the dataset, 20 variables reserved.

Here shows some information about the movies.

movies.tb0 %>% 
  select(c("mpaa_rating","title_type","imdb_rating","imdb_num_votes",
           "critics_rating","audience_rating","runtime")) %>%
  Hmisc::describe()

## . 
## 
##  7  Variables      650  Observations
## ---------------------------------------------------------------------------
## mpaa_rating 
##        n  missing distinct 
##      650        0        6 
##                                                           
## Value            G   NC-17      PG   PG-13       R Unrated
## Frequency       19       2     118     132     329      50
## Proportion   0.029   0.003   0.182   0.203   0.506   0.077
## ---------------------------------------------------------------------------
## title_type 
##        n  missing distinct 
##      650        0        3 
##                                                  
## Value       Documentary Feature Film     TV Movie
## Frequency            54          591            5
## Proportion        0.083        0.909        0.008
## ---------------------------------------------------------------------------
## imdb_rating 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      650        0       56    0.999    6.491    1.187     4.40     5.10 
##      .25      .50      .75      .90      .95 
##     5.90     6.60     7.30     7.71     7.90 
## 
## lowest : 1.9 2.3 2.4 2.7 2.8, highest: 8.2 8.3 8.4 8.5 9.0
## ---------------------------------------------------------------------------
## imdb_num_votes 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      650        0      644        1    57563    82785     1286     1993 
##      .25      .50      .75      .90      .95 
##     4543    15070    58484   151962   263069 
## 
## lowest :    180    183    281    285    318, highest: 753592 756602 797101 806911 893008
## ---------------------------------------------------------------------------
## critics_rating 
##        n  missing distinct 
##      650        0        3 
##                                                           
## Value      Certified Fresh           Fresh          Rotten
## Frequency              134             209             307
## Proportion           0.206           0.322           0.472
## ---------------------------------------------------------------------------
## audience_rating 
##        n  missing distinct 
##      650        0        2 
##                           
## Value      Spilled Upright
## Frequency      275     375
## Proportion   0.423   0.577
## ---------------------------------------------------------------------------
## runtime 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      649        1       89        1    105.8    20.12     83.4     86.8 
##      .25      .50      .75      .90      .95 
##     92.0    103.0    116.0    129.0    137.6 
## 
## lowest :  39  40  65  68  74, highest: 178 180 194 202 267
## ---------------------------------------------------------------------------

From the above description of the dataset, here we could say:

The MPAA rating has a total of 6 levels, as level is arranged in alphabetical order and needs to be rearranged;
Feature films account for 90.9% of all films;
More than 50 of the film IMDb score is more than 6.6 points;
There are some huge number of IMDb votes, they are bigger than 750,000. The medium vote number is 15070.
Rotten Tomatoes critics rated higher (Certified Fresh) accounted for 20.6% of the film; Fresh ones are slightly more than Rotten one in overall;
Rotten Tomatoes audience rated good or bad is half to half.
One missing value was found in runtime and it is necessary to check other ones.

movies.tb0 %>% sapply(is.na) %>% colSums()

##            title       title_type            genre          runtime 
##                0                0                0                1 
##      mpaa_rating           studio    thtr_rel_year   thtr_rel_month 
##                0                8                0                0 
##     thtr_rel_day      imdb_rating   imdb_num_votes   critics_rating 
##                0                0                0                0 
##    critics_score  audience_rating   audience_score     best_pic_nom 
##                0                0                0                0 
##     best_pic_win   best_actor_win best_actress_win     best_dir_win 
##                0                0                0                0

The result shows eight films are missing studio record, one losing runtime record. One observation missed runtime record will be removed; we can keep the missed studio records because this factor will not affect this research.

movies.tb0 <- movies.tb0 %>% filter(!is.na(runtime))
dim(movies.tb0)

## [1] 649  20

Here the dataset will be releveled on the mpaa_rating. The thtr_quat variable will be added for considering the quarterly factors for the film releasing. The osca_opt will be valued one if there was any Yes in the variables of best_*.

movies.tb1 <- movies.tb0 %>%
  mutate(mpaa_rating = fct_relevel(mpaa_rating,"Unrated","NC-17","R","PG-13","PG","G"),
         thtr_date = make_date(thtr_rel_year,thtr_rel_month,thtr_rel_day),
         thtr_quat = quarter(thtr_date),
         osca_opt  = ifelse(best_pic_nom=="yes"|best_pic_win=="yes"|best_actor_win=="yes"|
                              best_actress_win=="yes"|best_dir_win=="yes",1,0))

The new variables and other ones will be plotted in the next EDA stage.

0.3 Part 3: Research question

Is there a relationship between the ratings that a movie gets on the IMDb website and the grades that the same film comes on the Rotten Tomatoes?

Through the score of a movie on the Rotten Tomatoes, as well as the length of the show, the release season and other factors, can it better speculate its rating on the IMDb website? This article assumes here that users of two sites involved in this data are independent of each other.

Usually, the drama type of the film and the characteristics of the audience group will have a positive impact on the audience’s score. The same is true even for professional film critics. In the EDA part of this article, the film will be further classified and studied in terms of each MPAA rating and genre characteristics. The variable imdb_rating will be considered as a response one to initiate the research.

0.4 Part 4: Exploratory data analysis

0.4.1 Movie type and feature predictors selection

In the EDA stage, first I analysis the movies spread in each quarter with different MPAA ratings the category of the movie will be explored.

temp1 <- movies.tb1 %>%
  group_by(thtr_quat,mpaa_rating) %>%
  summarise(num_quat=n()) %>%
  mutate(label_y = cumsum(num_quat) - 0.5*num_quat)
ggplot(temp1,aes(x=thtr_quat,y=num_quat,fill=fct_rev(mpaa_rating)))+
  geom_col(colour="black")+
  labs(fill="MPAA_Rating")+ ylab("Films Number") + xlab("Quarter") +
  geom_text(aes(y=label_y,label=num_quat)) +
  scale_fill_brewer(palette = "Accent") + theme_bw()

Figure4.1: Number of releases of different MPAA rating films in each quarter

Discussion for Fig4.1: It shows that more films released in the fourth quarter of each year in the provided dataset. R-rating movies make up a more significant proportion of films released in each quarter.

temp2 <- movies.tb1 %>% 
  group_by(mpaa_rating,genre) %>%
  summarise(num_genre=n()) %>%
  ungroup()
ggplot(temp2,aes(x=fct_rev(mpaa_rating),y=num_genre,fill=genre))+
  geom_col(color="black")+ 
  ylab("Films Number group by Genre") + xlab("MPAA rating") +
  scale_fill_brewer(palette = "Paired") + theme_bw()

Figure4.2: Number of films of various genres in different MPAA ratings

Discussion for Fig4.2: In all of MPAA’s rating films, R-rating ones reaches the highest number. Drama ones make up a more significant proportion of all.

Different types of films, the audience will produce matched viewing feelings after appreciation, such as a horror film or a musical, which will have an impact on the film scoring.

MPAA’s rating of the film is based on the age of the audience, the audience to enjoy a child-oriented G-class film and an adult-oriented R-rated film to get the viewing experience is also different, the same will affect the film score.

In this article, the multivariate regression analysis will be proceeded based on the original data R-class feature film data.

dra_rfs <- movies.tb1 %>%
  filter(genre=="Drama",mpaa_rating=="R")
dim(dra_rfs)

## [1] 179  23

Now a new dataset was created for the modeling, it has 179 observations and 23 variables.

It is necessary to simplify the variables for the further work. All movies are R-rating and Drama ones so the mpaa_rating and genre will be removed because they are fixed value in this dataset. The best_* variables will be removed, and some potential factors will be moved to the start of the data frame.

dra_rfs <- dra_rfs %>% 
  select(-genre,-mpaa_rating) %>%
  select(-contains("best_")) %>%
  select(imdb_rating, runtime, imdb_num_votes, critics_score, audience_score, 
         osca_opt, thtr_quat, everything())
dim(dra_rfs)

## [1] 179  16

The variable number of dataset for modeling has been cut down to 16.

ggpairs(dra_rfs, columns = 1:7)

Figure4.3: Scatterplot matrices with correlation values

Discussion for Fig4.3:

The figure created by GGally::ggpairs is applied for the response variable imdb_rating and focused predictors in future modeling.
On the top row of the figure, the correlation values about imdb_rating and other ones are displayed. It shows audience-score reaches the highest one 0.897, thtr_quat reaches the lowest value 0.0442. thtr_quat has a very weak correlation with the response variable.
On the left side of the figure4.3, the scatterplot shows the linear relationship between imdb_rating and other ones. In the diagonal it shows a continuous approximation of the distribution of the respective variable.
There are two predictors critics_score and audience_score are correlated at 0.652 (colinearity). Therefore, one of them might be removed from the model, how to deal with this will be further discussed in the modeling stage.

0.4.2 The feature about the numerical variables

p1 <- ggplot(dra_rfs, aes(x=runtime)) + 
  geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='lightblue', binwidth = 5) +
  ylab('Percentage') + ggtitle('Run Time') + theme_bw()

p2 <- ggplot(dra_rfs, aes(x=log10(imdb_num_votes))) +
  geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='lightblue') +
  ylab('Percentage') + ggtitle('log(IMDB number of votes)')+ theme_bw()

p3 <- ggplot(dra_rfs, aes(x=critics_score)) +
  geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='lightblue', binwidth = 5) +
  ylab('Percentage') + ggtitle('Critics Score')+ theme_bw()

p4 <- ggplot(dra_rfs, aes(x=audience_score)) + 
  geom_histogram(aes(y=100*(..count..)/sum(..count..)), color='black', fill='lightblue', binwidth = 5) +
  ylab('Percentage') + ggtitle('Audience Score') + theme_bw()

p1 + p2 + p3 + p4 + plot_layout(ncol = 2)

Figure4.4: Histogram plottings for the numerical variables

Discussion for Fig4.4: The figure shows the four predictors have reasonable distributions and they are suitable to do the regression analysis. The imdb_num_votes was applied log10 to get a proper visualization due to the existence of some large numbers of votes.

0.4.3 The feature about the categorical variables

p1 <- ggplot(dra_rfs, aes(x=as.factor(osca_opt),y=imdb_rating)) +
  geom_boxplot(color='black', fill='lightblue',show.legend = FALSE) +
  xlab('Oscar options') + ggtitle('IMDB ratings vs Oscar options in the movie') + theme_bw()

p2 <- ggplot(dra_rfs, aes(x=as.factor(thtr_quat),y=imdb_rating)) +
  geom_boxplot(color='black', fill='lightblue',show.legend = FALSE) +
  xlab('Quarter') + ggtitle('IMDB ratings vs Quarter') + theme_bw()

p1 + p2 + plot_layout(ncol = 2)

Figure4.5: Box-plot plottings for the categorical variables

Discussion for Fig4.5:

The categorical variable oscs_opt seems to have a reasonable significant correlation with imdb_rating.
The variable thtr_quat is not a significant correlation with imdb_rating, and it will be evaluated in the modeling stage.

0.5 Part 5: Modeling

0.5.1 Model searching

Considering that the number of users voting on a movie at the IMDb website and the IMDb score is obtained at the same time as the IMDB site, as a predictive model for imdb_rating this article will not use imdb_num_votes as an explanatory variable.
At the EDA stage, it was found that the variable critics_score and audience_score had the colinearity. But these two variables are different types of scores from rotten tomato sites that can be interpreted separately. We will evaluate the modeling situation to determine the extent to which the variable is removed, with the retention or abandonment of the critics’ scores.
We will start with the full model, do the backwards elimination for the model selection.

0.5.1.1 Way#1: backwards elimination with p-value and adjusted \(R^2\)

Step 1. establish the full model

fit.full <- lm( imdb_rating ~ critics_score + audience_score + runtime + osca_opt + thtr_quat, 
                data = dra_rfs)
summary(fit.full)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime + 
##     osca_opt + thtr_quat, data = dra_rfs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25624 -0.15579  0.02893  0.20870  0.77029 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.562864   0.187543  18.998  < 2e-16 ***
## critics_score   0.007095   0.001369   5.183 6.04e-07 ***
## audience_score  0.033657   0.001830  18.390  < 2e-16 ***
## runtime         0.004099   0.001765   2.322   0.0214 *  
## osca_opt        0.145031   0.059096   2.454   0.0151 *  
## thtr_quat      -0.004506   0.022682  -0.199   0.8428    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3385 on 173 degrees of freedom
## Multiple R-squared:  0.844,  Adjusted R-squared:  0.8395 
## F-statistic: 187.1 on 5 and 173 DF,  p-value: < 2.2e-16

The coefficient for thtr_quat isn’t significantly different from zero (p = 0.8428) suggesting that which quarter the movie released and it’s rating on IMDb site is not linearly related when controlling for the other predictor variables. Generally, the predictor variables in this model account for 84.4% of the variance in IMDb ratings of the Drama movies with R-rating.

Step 2. remove thtr_quat from the full model

fit01 <- update(fit.full, . ~ . - thtr_quat)
summary(fit01)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime + 
##     osca_opt, data = dra_rfs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25432 -0.15076  0.03392  0.20826  0.77221 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.556458   0.184239  19.303  < 2e-16 ***
## critics_score  0.007087   0.001364   5.194 5.71e-07 ***
## audience_score 0.033669   0.001824  18.457  < 2e-16 ***
## runtime        0.004051   0.001744   2.323   0.0213 *  
## osca_opt       0.143748   0.058579   2.454   0.0151 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3375 on 174 degrees of freedom
## Multiple R-squared:  0.8439, Adjusted R-squared:  0.8403 
## F-statistic: 235.2 on 4 and 174 DF,  p-value: < 2.2e-16

Step 3. remove thtr_quat and osca_opt from the full model

fit02 <- update(fit01, . ~ . - osca_opt)
summary(fit02)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime, 
##     data = dra_rfs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2919 -0.1433  0.0337  0.2100  0.7288 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.454413   0.182042  18.976  < 2e-16 ***
## critics_score  0.006998   0.001383   5.058 1.06e-06 ***
## audience_score 0.033367   0.001846  18.075  < 2e-16 ***
## runtime        0.005632   0.001643   3.427  0.00076 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3424 on 175 degrees of freedom
## Multiple R-squared:  0.8385, Adjusted R-squared:  0.8358 
## F-statistic: 302.9 on 3 and 175 DF,  p-value: < 2.2e-16

Now the model fit02’s adjusted \(R^2\) value is 0.8357597 and it is less than the one of fit01(0.8403411). So we could confirm the final model would be fit01 with the lm formula imdb_rating ~ critics_score + audience_score + runtime + osca_opt.

Step 4. evaluating multicolinearity Here we use the variance inflation factor(VIF) to detect the multicolinearity. So the colinearity about critics_score and audience_score could be evaluated. The VIF values could be provided by the car::vif() function.

sqrt(vif(fit01)) > 2

##  critics_score audience_score        runtime       osca_opt 
##          FALSE          FALSE          FALSE          FALSE

The results indicate there is no multicolinearity issue with these predictor variables. The variables critics_score and audience_score will be reserved in modeling.

0.5.1.2 Way#2: use AIC to perform a model search and cross check

Here is the way to use Akaike Information Criterion(AIC) to perform a model search with function step() .

fit.final <- step(fit.full, direction = "backward")

## Start:  AIC=-381.92
## imdb_rating ~ critics_score + audience_score + runtime + osca_opt + 
##     thtr_quat
## 
##                  Df Sum of Sq    RSS     AIC
## - thtr_quat       1     0.005 19.825 -383.88
## <none>                        19.820 -381.92
## - runtime         1     0.618 20.438 -378.43
## - osca_opt        1     0.690 20.511 -377.79
## - critics_score   1     3.078 22.898 -358.08
## - audience_score  1    38.748 58.568 -189.98
## 
## Step:  AIC=-383.88
## imdb_rating ~ critics_score + audience_score + runtime + osca_opt
## 
##                  Df Sum of Sq    RSS     AIC
## <none>                        19.825 -383.88
## - runtime         1     0.615 20.440 -380.41
## - osca_opt        1     0.686 20.511 -379.79
## - critics_score   1     3.073 22.898 -360.08
## - audience_score  1    38.812 58.637 -191.77

summary(fit.final)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime + 
##     osca_opt, data = dra_rfs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.25432 -0.15076  0.03392  0.20826  0.77221 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.556458   0.184239  19.303  < 2e-16 ***
## critics_score  0.007087   0.001364   5.194 5.71e-07 ***
## audience_score 0.033669   0.001824  18.457  < 2e-16 ***
## runtime        0.004051   0.001744   2.323   0.0213 *  
## osca_opt       0.143748   0.058579   2.454   0.0151 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3375 on 174 degrees of freedom
## Multiple R-squared:  0.8439, Adjusted R-squared:  0.8403 
## F-statistic: 235.2 on 4 and 174 DF,  p-value: < 2.2e-16

Compared with the results of summary(fit01), we can say all of the coefficients and \(R^2\),\(adjusted R^2\) are the same. So the model fit.final and fit01 is the same one.

0.5.2 Interpret the final model

\[\hat{IMDb\_rating}\ = 3.556 + 0.007*critics\_score + 0.034*audience\_score + 0.004*runtime + 0.144*osca\_opt\]

Intercept(3.556) is the estimated IMDb rating for a movie with critics score and audience score from Rotten Tomatoes site, runtime, the status about Oscar Awards related to the film, for the Drama movies with R-rating. It does not provide any meaningful interpretation here.
critics_score coefficient(0.007): All else hold constant, for every one unit increase in critics_score, the model predicts a 0.007 increase in imdb_rating on average.
audience_score coefficient(0.034): All else hold constant, for every one unit increase in audience_score, the model predicts a 0.034 increase in imdb_rating on average.
runtime coefficient(0.004): All else hold constant, for every one unit increase in runtime, the model predicts a 0.004 increase in imdb_rating on average.
osca_opt coefficient(0.144): All else hold constant, if the movie won the best picture Oscar or was nominated or one of the main actors/actresses won an Oscar or the director won an Oscar, the model predicts a 0.144 increase in imdb_rating on average.
\(R^2\)(0.844): 84.4% of the variability in IMDb rating can be explained by the model.

0.5.3 Regression diagnostics

0.5.3.1 Linearity check

If the dependent variable is linearly related to the independent variables, there should be no systematic relationship between the residuals and the fitted values. Here the below code chunk to check it:

crPlots(fit01)

Figure5.1 Component plus residual plots for the regression of final model on movie characteristics

Discussion for Fig5.1: The component plus residual plots confirm that we’ve met the linearity assumption. The form of the linear model seems to be appropriate for this dataset.

0.5.3.2 Normality check

The qqPlot() function provides a more accurate method of assessing the normality assumption than that provided by the plot() function in the base package.

dra_rfs <- dra_rfs  %>%
  mutate(title_bak = title)  %>%
  column_to_rownames(var="title_bak")
qqPlot(fit01, labels=row.names(dra_rfs),simulate=TRUE, envelope=.95)

Figure5.2 Q-Q plot for studentized residuals

## Filly Brown    The Room 
##           1          39

Discussion for Fig5.2: There are big negative residuals for the records 1(Filly Brown) and 39(The Room). Except them, all the points fall close to the line and are within the confidence envelope, suggesting that we met the normality assumption fairly well.

We will have a look at the records because the model gave the excessive estimation for the records Filly Brown and The Room.

dra_rfs %>% slice(c(1,39))  %>%
select(title,thtr_rel_year, studio, 
       imdb_rating, imdb_num_votes, critics_score, critics_rating,
       audience_score, audience_rating, runtime, osca_opt)  %>%
kable(align = "c", caption ="TABLE5.1 Movie-records about Filly Brown and The Room")  %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                font_size = 12,full_width = TRUE)  %>% 
    scroll_box(width = "900px")

TABLE5.1 Movie-records about Filly Brown and The Room
title	thtr_rel_year	studio	imdb_rating	imdb_num_votes	critics_score	critics_rating	audience_score	audience_rating	runtime	osca_opt
Filly Brown	2013	Indomina Media Inc.	5.5	899	45	Rotten	73	Upright	80	0
The Room	2003	Chloe Productions	3.5	27769	35	Rotten	46	Spilled	99	0

In my opinion, the reason about the predictive deviation for the two movies is beyond the modeling. It’s a big possibility that this result was related to the content and expressive techniques of the films. If we could have more movies to analysis, maybe we could find the reasonable explanation.

residplot(fit01)  # Function residplot() is referenced from Page189,R in Action 2nd Edition.

Figure5.3 Distribution of studentized residuals

Discussion for Fig5.3: It shows the issue about the normal distribution. It was obviously found that the center residuals is more dense than the theoretical normal distribution curve. It was caused by the positive residuals around 0 which could be observed from Figure5.2.

0.5.3.3 Variability(Homoscedasticity) check

ncvTest(fit01)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 33.55001, Df = 1, p = 6.9455e-09

Here we got a significant score(p-value=\(6.9455*10^{-9}\)),it means the final model does not meet the constant variance assumption for the residuals.

spreadLevelPlot(fit01)

Figure5.4 Spread-level plot for assessing constant error variance

## 
## Suggested power transformation:  2.967555

Discussion for Fig5.4: The result from function car::spreadLevelPlot() suggested power transformation 3, means the suggest modeling would be: \(y^3= \beta_0 + \beta_1*x_1 + \beta_2*x_2 + ...\). The transformations of the response variable can help in situations of heteroscedasticity, but for this practice it might be difficult to explain the meaning of \(IMDb\_rating^3\) in the real world.

0.5.3.4 Independancy check

As we knew each record of the movie in the dataset is the independent values. Now we check the independence of the residuals.

fit.res <- as_tibble(data.frame(index = seq(1:length(fit01$residuals)),
                      residuals = fit01$residuals,
                      title = dra_rfs$title))
fit.res <- fit.res  %>% 
  mutate(plotname = ifelse( abs(residuals) >1,as.character(title), ""))
ggplot(fit.res,aes(x=index,y=residuals)) + geom_point() + geom_text(aes(x=index,label=plotname),vjust=-0.8)+ theme_bw()

Figure5.5 Independany Conditions

Discussion for Fig5.5: The residual values concentrat around 0 except the two outliers Filly Brown and The Room. They were discussed by Fig5.2.

durbinWatsonTest(fit01) # detect the serially correlated errors in package car

##  lag Autocorrelation D-W Statistic p-value
##    1      0.02003874      1.891553   0.484
##  Alternative hypothesis: rho != 0

The p-value(0.444) is not significant, this means a lack of autocorrelation.

0.6 Part 6: Prediction

0.6.1 Overview

We will use the previous MLR model fit01 to create a prediction algorithym. The response variable imdb_rating will be computed by the below formula: \[IMDb\_rating\ = 3.556 + 0.007*critics\_score + 0.034*audience\_score + 0.004*runtime + 0.144*osca\_opt\]

0.6.2 New movie

Here the movie Hacksaw Ridge(2016) is picked to do a prediction for the model.

The movie Hacksaw Ridge, which adapted from a real story, was released on Nov 4, 2016. It was rated by R for intense, prolonged realistically graphic sequences of war violence including grisly bloody images. The genre is Action & Adventure, Drama. Summit Entertainment produced it, and the runtime is 139 minutes. IMDb scored it with 8.1. This movie’s director Mel Columcille Gerard Gibson won Oscar with Braveheart on 1995.

The observations about the movie were acquired from Rotten Tomatoes and IMDb websites.

movie.new <- data.frame(title="Hacksaw Ridge", title_type="", genre="Drama", runtime=139,
                        mpaa_rating="R",studio="Summit Entertainment", imdb_rating=8.1, 
                        imdb_num_votes=363769, critics_rating="Certified Fresh",
                        audience_rating="Upright",critics_score=86, audience_score=91,
                        osca_opt=1,thtr_quat=4)

0.6.3 Predict Result

predict(fit01, movie.new, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 7.936636 7.258669 8.614602

The actual audience score for this movie is 8.1 and the prediction interval contains this value.

0.7 Part 7: Conclusion

Generally, we could use the model based on one movie’s score on the Rotten Tomatoes site and the runtime, some information about the movie’s Oscar award to predict the score on the IMDb site, the movies’ type is limited to Drama and R-rating.

The modeling is based on the Drama and R-rating movies in the provided dataset, some limitations about the data source might cause the issues for the modeling. If there were some external information about the movie’s cost and gross, it might be more accurate to do the regression analysis of the popularity of the films.

0.8 Part 8: Bibliography(Main ones only, the textbooks not included)

Luis Torgo(2016). Data Mining with R: LEARNING WITH CASE STUDIES,2nd EDITION. Chapman & Hall/CRC.
Winston Change(2018). R Graphics Cookbook:Pratical Recipes for Visualizing Data,2nd Edition. O’REILLY
Robert I. Kabacoff(2016). R in Action,Data Analysis and Graphics with R 2nd Edition. Manning Publications Co.
MPAA film ratings
IMDb
Rotten Tomatoes