Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.5.3
library(leaps)
## Warning: package 'leaps' was built under R version 3.5.3
library(grid)
library(gridExtra)

Load data

load("movies.Rdata")

Part 1: Data

The dataset contains information about movies in Rotten Tomatoes and IMDB. The data set is comprised of 651 randomly sampled movies produced and released before 2016. There are 32 available variables. With this dataset and for the purpose of this project it is only possible to do an observational study and no causal analysis is done.The study can generalize to movies produced and released before 2016.

We considered that some of the variables are irrelevant to the purpose of identifying the popularity of a movie: the Link to IMDB page for the movie and the Link to Rotten Tomatoes page for the movie.


Part 2: Research question

After having a look at the data I concluded that I will try to understanding the most significant predictor and its relationship with audience score for a movie i.e movies polularity.


Part 3: Exploratory data analysis

Analyse the data

Checking the structure of the data using the code below

str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

we can conclude that the dataset cantains 32 variables and 651 observations. Among the total 32 variables, 9 ar character variables, 12 are factor variable, 10 are numerical variable, one is integer variable and among the present 10 numerical variable 6 are related to date. There are a total of four variables related to rating and these are Mpaa_rating, Imdb_rating, critics_rating, and audience_ rating. Two of these variable is related to scoring a movie, one variable related to the maturity content of the movie and one to the voting for a movie. For identification of the popularity of the movies not all variable will be relevant. Also for variable like awards are not taken into consideration here as awards ceremony happens much after the movie is released and won’t be affecting the audience score. the variables like actor1,2,3,4,5, URL based, and studio won’t not taken into consderation.

Getting Data for the model

Getting the data of the potential predictor for the model using the code below.

MD <- movies %>% 
  # from the movie dataset selecting these 
  select (title_type, genre, runtime, 
         mpaa_rating, thtr_rel_year, thtr_rel_month, 
         imdb_rating, imdb_num_votes, critics_score,
         critics_rating, audience_rating, audience_score)%>%
  # out of the selected renaming some long variables name
  rename (rel_month = thtr_rel_month, rel_year = thtr_rel_year)

Analyze the structure of Model Data

Checking the structure of the selected data from the movie data using the code below

str(MD)
## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  12 variables:
##  $ title_type     : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre          : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime        : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating    : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ rel_year       : num  2013 2001 1996 1993 2004 ...
##  $ rel_month      : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ imdb_rating    : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_score  : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ audience_rating: Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score : num  73 81 91 76 27 86 76 47 89 66 ...

On analysing the structure of Model Data dataset, we can conclude that the dataset cantains 12 variables out of the initial 32 variables and 651 observations. Among the total 12 variables, 5 are factor variable, 6 are numerical variable, one is integer variable and among the present 6 numerical variable 2 are related to date. There are a total of four variables related to rating and these are Mpaa_rating, Imdb_rating, critics_rating, and audience_ rating. Two of these variable is related to scoring a movie, one variable related to the maturity content of the movie and one to the voting for a movie.

Removing missing data and Check dimentionality

Removing the obseravtion having missing data in the Model Data dataset using the code below.

# Remove NAs 
CompleteCases_Index <-complete.cases(MD)
MD <- MD[CompleteCases_Index, ]
dim(MD)
## [1] 650  12

Initially there 651 obseravations present and now after removing the incomplete obseravtions we are left with 650 observations i.e. we had a 651-650=1 incomplete observations

Summarize of Model Data

summary(MD)
##         title_type                 genre        runtime       mpaa_rating 
##  Documentary : 54   Drama             :305   Min.   : 39.0   G      : 19  
##  Feature Film:591   Comedy            : 87   1st Qu.: 92.0   NC-17  :  2  
##  TV Movie    :  5   Action & Adventure: 65   Median :103.0   PG     :118  
##                     Mystery & Suspense: 59   Mean   :105.8   PG-13  :133  
##                     Documentary       : 51   3rd Qu.:115.8   R      :329  
##                     Horror            : 23   Max.   :267.0   Unrated: 49  
##                     (Other)           : 60                                
##     rel_year      rel_month       imdb_rating    imdb_num_votes  
##  Min.   :1970   Min.   : 1.000   Min.   :1.900   Min.   :   180  
##  1st Qu.:1990   1st Qu.: 4.000   1st Qu.:5.900   1st Qu.:  4584  
##  Median :2000   Median : 7.000   Median :6.600   Median : 15204  
##  Mean   :1998   Mean   : 6.735   Mean   :6.492   Mean   : 57620  
##  3rd Qu.:2007   3rd Qu.:10.000   3rd Qu.:7.300   3rd Qu.: 58484  
##  Max.   :2014   Max.   :12.000   Max.   :9.000   Max.   :893008  
##                                                                  
##  critics_score            critics_rating audience_rating audience_score 
##  Min.   :  1.00   Certified Fresh:135    Spilled:275     Min.   :11.00  
##  1st Qu.: 33.00   Fresh          :208    Upright:375     1st Qu.:46.00  
##  Median : 61.00   Rotten         :307                    Median :65.00  
##  Mean   : 57.65                                          Mean   :62.35  
##  3rd Qu.: 83.00                                          3rd Qu.:80.00  
##  Max.   :100.00                                          Max.   :97.00  
## 

Out of the total 650 complete observations of the movies, 591 are feature films, 54 are documentry and 5 are TV Movies. Among these movies 305 are drama based, 87 are comedy based, 65 are action & adventure based, 59 are mystery & suspense based, 51 are documentry, 23 are horror and the 60 lies in other categories. Run time of movies ranges from 39 minutes to 267 minutes and it seems to be right skewed. Among these movies 19 are G rated, 2 are NC-17 rated, 118 are PG rated, 133 are PG-13 rated, 329 are R rated and 49 are Unrated. Movies release year for the available data ranges from 1970 to 2014 and the the data is a bit left skewed. Movies release month shows that more number of movies are released in the later half of the year. The rating score for the IMDB rating ranges from 0 to 9 while critics score and audience score ranges from 1 to 100. IMDB rating, critics score and audience score all are skewed left. IMDB num votes ranges from 180 to 893008 and this is right skewed. The critics rating has three level of which for majority are negative (i.e. 307). In contrast audience rating has two levels of which majority are positive (i.e. 375).

Analyze the above discussion graphically

Checking the skewedness of various parameters using histogram and plot using the code below.

plot(MD$title_type, xlab = "Movies Type", ylab = "no. of movies", las = 0,  main="a) No. of movies of specific type", col=rainbow(7))

plot(MD$genre, xlab = "Movies Genre", ylab = "no. of movies", las = 2, axis=0.6, main="b) No. of movies of specific genre", col=rainbow(7), col.lab = "Black", col.axis="dark grey")

hist(MD$runtime, xlab = "Movie Runtime", prob=TRUE, main = "c) Runtime Evaluation")
lines(density(MD$runtime), col="blue", lwd=2)

plot(MD$mpaa_rating, xlab = "mpaa rating", ylab = "no. of movies", las = 0,  main="d) Classification of no. of movies based on mpaa rating", col=rainbow(7), cex.lab = 1, col.lab = "Black")

hist(MD$rel_year, xlab = "Movie release year", xlim = c(1970, 2014), breaks = 44, prob=TRUE, main = "e) No. of movies released per year distribution")
lines(density(MD$rel_year), col="blue", lwd=2)

hist(MD$rel_month, xlim = c(1, 12), breaks = 12, xlab = "Movie release month", prob=TRUE, main = "f) Movie Release Month distribution")
lines(density(MD$rel_month), col="blue", lwd=2)

hist(MD$imdb_rating, xlab = "imdb rating", breaks = 18, prob=TRUE, main = "g) Movie IMDB rating")
lines(density(MD$imdb_rating), col="blue", lwd=2)

hist(MD$imdb_num_votes, xlim = c(180, 893008), breaks = 500, xlab = "imdb no. of votes", prob=TRUE, main = "h) Movie imdb number of votes")
lines(density(MD$imdb_num_votes), col="blue", lwd=2)

hist(MD$critics_score, xlab = "Critics score", xlim = c(1, 100), breaks = 20, prob=TRUE, main = "i) Critics score of the movie")
lines(density(MD$critics_score), col="blue", lwd=2)

plot(MD$critics_rating, xlab = "Critics Rating", ylab = "no. of movies", las = 0,  main="j) No. of movies classified by critics rating", col=rainbow(7))

plot(MD$audience_rating, xlab = "Audience Rating", ylab = "no. of movies", las = 0,  main="k) No. of movies classified by audience Rating", col=rainbow(7))

hist(MD$audience_score,xlim = c(1, 100), breaks = 20, xlab = "Audience Score", prob=TRUE, main = "k) Movie Audience Score")
lines(density(MD$audience_score), col="blue", lwd=2)

The discussion done above seems to be adequate even on observing the graphical interpretation.

Understandanding the relationship between various numerical parameter and audience score Graph the runtime predictor Checking the relationship between runtime (explanatory variable) and audience score (response variable) by plotting using the code below

ggplot(MD, aes(x=runtime, y=audience_score)) + geom_point() + stat_smooth(method=lm, level=0.99)

The relationship is a positive weak liner relationship between a potential explanatory variable (predictor) and the response variable as depicted in the graph. This Relationship should be confirmed by the Corelation matrix (done below).

Graph the rel_year predictor

Checking the relationship between rel_year (explanatory variable) and audience score (response variable) by plotting using the code below.

ggplot(MD, aes(x=rel_year, y=audience_score)) + geom_point() + stat_smooth(method=lm, level=0.99)

There seems to be no relationship between a potential explanatory variable (predictor) and the response variable as depicted in the graph. This Relationship should be confirmed by the Corelation matrix (done below).

Graph the rel_month predictor

Checking the relationship between rel_month (explanatory variable) and audience score (response variable) by plotting using the code below.

ggplot(MD, aes(x=rel_month, y=audience_score)) + geom_point() + stat_smooth(method=lm, level=0.99)

There seems to be no relationship between a potential explanatory variable (predictor) and the response variable as depicted in the graph. This Relationship should be confirmed by the Corelation matrix (done below).

Graph the imdb_rating predictor

Checking the relationship between imdb_rating (explanatory variable) and audience score (response variable) by plotting using the code below.

ggplot(MD, aes(x=imdb_rating, y=audience_score)) + geom_point() + stat_smooth(method=lm, level=0.99)

There seems to be strong positive linear relationship between a potential explanatory variable (predictor) and the response variable as depicted in the graph. This Relationship should be confirmed by the Corelation matrix (done below).

Graph the imdb_num_votes predictor

Checking the relationship between imdb_num_votes (explanatory variable) and audience score (response variable) by plotting using the code below.

ggplot(MD, aes(x=imdb_num_votes, y=audience_score)) + geom_point() + stat_smooth(method=lm, level=0.99)

There seems to be a moderate positive linear relationship between a potential explanatory variable (predictor) and the response variable as depicted in the graph. This Relationship should be confirmed by the Corelation matrix (done below).

Graph the critics_score predictor

Checking the relationship between imdb_num_votes (explanatory variable) and audience score (response variable) by plotting using the code below.

ggplot(MD, aes(x=critics_score, y=audience_score)) + geom_point() + stat_smooth(method=lm, level=0.99)

There seems to be a strong positive linear relationship between a potential explanatory variable (predictor) and the response variable as depicted in the graph. This Relationship should be confirmed by the Corelation matrix (done below).

Creating a Corelation Matrix and Graphing it

Corelation matrix was created using the code below.

# Selecting the numerical data
MD[ , sapply(MD, is.numeric)]
## # A tibble: 650 x 7
##    runtime rel_year rel_month imdb_rating imdb_num_votes critics_score
##      <dbl>    <dbl>     <dbl>       <dbl>          <int>         <dbl>
##  1      80     2013         4         5.5            899            45
##  2     101     2001         3         7.3          12285            96
##  3      84     1996         8         7.6          22381            91
##  4     139     1993        10         7.2          35096            80
##  5      90     2004         9         5.1           2386            33
##  6      78     2009         1         7.8            333            91
##  7     142     1986         1         7.2           5016            57
##  8      93     1996        11         5.5           2272            17
##  9      88     2012         9         7.5            880            90
## 10     119     2012         3         6.6          12496            83
## # ... with 640 more rows, and 1 more variable: audience_score <dbl>
# applying the numerical data to get correlation
CorMatrix <- cor(MD[ ,sapply(MD,is.numeric)], use= "complete.obs")
corrplot(CorMatrix, method="shade", shade.col=NA, cl.pos="n", tl.col="black", tl.srt=30, addCoef.col="black")

In the correlation matrix it can be seen that the collinearty between two explanatory variable imdb_rating and critics_score. the relationship between these two is exceptionally strong which is 76% and it means that the two variables contribute redundant information to the model and complicate model estimation. Hence the explanatory variable, critics_score will not be used. However the extremely high correlation between imdb_rating and audience_score of 86% indicates that imdb_rating should be the first predictor added to the model.


Part 4: Modeling

Developing the Model

To create a Multiple Linear Regression (MLR) model that predicts audience score (AS), adding predictor with Forwad Stepwise Regression methodology has been selected. To build/create the multiple regression model a iterative process is used. the model will be build using the lm() function, Summarizing the model and to analyze its adjusted R square the summary function is used. To add the predictor to the model by analyzing both the AIC & p-value, add() function is used. This approach was used because it evaluated both the significance (as measured by both F-values and t-values) and the proportion of variability (as measured by adjusted R-square) before a predictor is added.

Create blank Model for audience score

Create a blank model for audience score (response variable) using the code below.

# Multiple Linear Regression Model for Audience Score
MLRMAS <- lm(audience_score~1, data=MD)

Summarze the existing model

ascertain significane, adjusted R-square, is increasing & the degree of freedom are decreasing

summary(MLRMAS)
## 
## Call:
## lm(formula = audience_score ~ 1, data = MD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.348 -16.348   2.652  17.652  34.652 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  62.3477     0.7937   78.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.23 on 649 degrees of freedom

Only the intercept is in the model and there is no predictor. However the degree of freedom is 649 (650-0-1).

Selecting the first predictor

To select the first predictor we need to find the predictor with the lowest AIC and p-value. The table to check the values can the displayed using the code below.

add1(MLRMAS, scope=MD, test="F")
## Warning in model.matrix.default(Terms, m, contrasts.arg =
## object$contrasts): the response appeared on the right-hand side and was
## dropped
## Warning in model.matrix.default(Terms, m, contrasts.arg =
## object$contrasts): problem with term 11 in model.matrix: no columns are
## assigned
## Single term additions
## 
## Model:
## audience_score ~ 1
##                 Df Sum of Sq    RSS    AIC   F value    Pr(>F)    
## <none>                       265727 3910.6                        
## genre           10     51633 214094 3790.2   15.4108 < 2.2e-16 ***
## runtime          1      8702 257025 3891.0   21.9389 3.431e-06 ***
## mpaa_rating      5     18017 247710 3875.0    9.3684 1.248e-08 ***
## rel_year         1       798 264930 3910.7    1.9516    0.1629    
## rel_month        1       273 265455 3911.9    0.6660    0.4147    
## imdb_rating      1    198782  66945 3016.5 1924.1205 < 2.2e-16 ***
## imdb_num_votes   1     22393 243335 3855.4   59.6321 4.353e-14 ***
## critics_score    1    131758 133970 3467.5  637.3000 < 2.2e-16 ***
## critics_rating   2    100258 165469 3606.7  196.0095 < 2.2e-16 ***
## audience_rating  1    198625  67102 3018.0 1918.1116 < 2.2e-16 ***
## audience_score   0         0 265727 3910.6                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As expected, the significant predictor with the lowest AIC is imdb_rating (3016.5). significance is determined by using F-value which is very high consequently the p-value is less than 0.05.

Adding the first predictor to model

The selected predictor is added to the model using the code below.

MLRMAS <- lm(audience_score~imdb_rating, data=MD)

Summarizing the first iteration

Ascertain Significance, adjusted R square is increasing and degree of freedom are decreasing. The model can the summarize using the code below.

summary(MLRMAS)
## 
## Call:
## lm(formula = audience_score ~ imdb_rating, data = MD)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.805  -6.550   0.676   5.676  52.912 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -42.3748     2.4205  -17.51   <2e-16 ***
## imdb_rating  16.1321     0.3678   43.87   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.16 on 648 degrees of freedom
## Multiple R-squared:  0.7481, Adjusted R-squared:  0.7477 
## F-statistic:  1924 on 1 and 648 DF,  p-value: < 2.2e-16

After adding the imdb_rating predictor to the model, three were examined: 1. t-value to confirm a significant p-value 2. adjuste R square to confirm an increase 3. degrees of freedom to confirm a decrease. All three were confirmed. Note the values of Multiple R-squared: 0.748, Adjusted R-squared: 0.7477, DF: 648, p-value: <2.2e-16.

Selecting the second predictor

To select the second predictor we need to find the predictor with the lowest AIC and p-value. The table to check the values can the displayed using the code below.

add1(MLRMAS, scope=MD, test="F")
## Warning in model.matrix.default(Terms, m, contrasts.arg =
## object$contrasts): the response appeared on the right-hand side and was
## dropped
## Warning in model.matrix.default(Terms, m, contrasts.arg =
## object$contrasts): problem with term 11 in model.matrix: no columns are
## assigned
## Single term additions
## 
## Model:
## audience_score ~ imdb_rating
##                 Df Sum of Sq   RSS    AIC  F value    Pr(>F)    
## <none>                       66945 3016.5                       
## genre           10      4589 62357 2990.4   4.6949 1.706e-06 ***
## runtime          1       746 66199 3011.2   7.2904 0.0071137 ** 
## mpaa_rating      5      1016 65929 3016.6   1.9816 0.0794436 .  
## rel_year         1       204 66742 3016.6   1.9753 0.1603687    
## rel_month        1       231 66714 3016.3   2.2414 0.1348464    
## imdb_num_votes   1         3 66943 3018.5   0.0260 0.8719136    
## critics_score    1      1167 65779 3007.1  11.4745 0.0007483 ***
## critics_rating   2      1976 64969 3001.1   9.8231 6.273e-05 ***
## audience_rating  1     35703 31242 2523.2 739.3758 < 2.2e-16 ***
## audience_score   0         0 66945 3016.5                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

audience_rating is the next significant predictor with the lowest AIC (2523.2). Significance is determined by using F-value which is very high consequently the p-value is less than 0.05.

Adding the second predictor to model

The selected predictor is added to the model using the code below.

MLRMAS <- lm(audience_score~imdb_rating + audience_rating, data=MD)

Summarizing the second iteration

Ascertain Significance, adjusted R square is increasing and degree of freedom are decreasing. The model can the summarize using the code below.

summary(MLRMAS)
## 
## Call:
## lm(formula = audience_score ~ imdb_rating + audience_rating, 
##     data = MD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.1512  -4.7629   0.6289   4.3517  24.3283 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -11.5316     2.0062  -5.748 1.39e-08 ***
## imdb_rating              9.5271     0.3496  27.251  < 2e-16 ***
## audience_ratingUpright  20.8584     0.7671  27.191  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.949 on 647 degrees of freedom
## Multiple R-squared:  0.8824, Adjusted R-squared:  0.8821 
## F-statistic:  2428 on 2 and 647 DF,  p-value: < 2.2e-16

After adding the imdb_rating predictor to the model, three were examined: 1. t-value to confirm a significant p-value 2. adjuste R square to confirm an increase 3. degrees of freedom to confirm a decrease. All three were confirmed. Note the values of Multiple R-squared: 0.8824, Adjusted R-squared: 0.8821, DF: 647, p-value: <2.2e-16.

Selecting the third predictor

To select the second predictor we need to find the predictor with the lowest AIC and p-value. The table to check the values can the displayed using the code below.

add1(MLRMAS, scope=MD, test="F")
## Warning in model.matrix.default(Terms, m, contrasts.arg =
## object$contrasts): the response appeared on the right-hand side and was
## dropped
## Warning in model.matrix.default(Terms, m, contrasts.arg =
## object$contrasts): problem with term 11 in model.matrix: no columns are
## assigned
## Single term additions
## 
## Model:
## audience_score ~ imdb_rating + audience_rating
##                Df Sum of Sq   RSS    AIC F value  Pr(>F)   
## <none>                      31242 2523.2                   
## genre          10   1265.01 29977 2516.3  2.6881 0.00313 **
## runtime         1    240.18 31002 2520.2  5.0047 0.02562 * 
## mpaa_rating     5    399.93 30842 2524.8  1.6650 0.14093   
## rel_year        1     67.08 31175 2523.8  1.3900 0.23884   
## rel_month       1    126.78 31116 2522.5  2.6321 0.10521   
## imdb_num_votes  1      8.67 31234 2525.0  0.1793 0.67210   
## critics_score   1    140.61 31102 2522.2  2.9205 0.08794 . 
## critics_rating  2    154.79 31088 2523.9  1.6057 0.20154   
## audience_score  0      0.00 31242 2523.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

genre is the next significant predictor with the lowest AIC (2516.3). Significance is determined by using F-value which is very high consequently the p-value is less than 0.05.

Adding the third predictor to model

The selected predictor is added to the model using the code below.

MLRMAS <- lm(audience_score~imdb_rating + audience_rating + genre, data=MD)

Summarizing the third iteration

Ascertain Significance, adjusted R square is increasing and degree of freedom are decreasing. The model can the summarize using the code below.

summary(MLRMAS)
## 
## Call:
## lm(formula = audience_score ~ imdb_rating + audience_rating + 
##     genre, data = MD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.6395  -4.4288   0.5889   4.2970  25.0845 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -12.5605     2.1955  -5.721 1.63e-08 ***
## imdb_rating                      9.8028     0.3689  26.571  < 2e-16 ***
## audience_ratingUpright          20.3180     0.7746  26.231  < 2e-16 ***
## genreAnimation                   3.6228     2.4513   1.478  0.13991    
## genreArt House & International  -2.7912     2.0320  -1.374  0.17005    
## genreComedy                      1.5109     1.1269   1.341  0.18050    
## genreDocumentary                 0.6003     1.3696   0.438  0.66130    
## genreDrama                      -0.8339     0.9589  -0.870  0.38481    
## genreHorror                     -1.6199     1.6693  -0.970  0.33222    
## genreMusical & Performing Arts   2.5416     2.1899   1.161  0.24625    
## genreMystery & Suspense         -3.2744     1.2462  -2.627  0.00881 ** 
## genreOther                       0.2743     1.9251   0.142  0.88675    
## genreScience Fiction & Fantasy   0.2559     2.4406   0.105  0.91652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.86 on 637 degrees of freedom
## Multiple R-squared:  0.8872, Adjusted R-squared:  0.8851 
## F-statistic: 417.5 on 12 and 637 DF,  p-value: < 2.2e-16

After adding the imdb_rating predictor to the model, three were examined: 1. t-value to confirm a significant p-value 2. adjuste R square to confirm an increase 3. degrees of freedom to confirm a decrease. All three were confirmed. Note the values of Multiple R-squared: 0.8872, Adjusted R-squared: 0.8851, DF: 637, p-value: <2.2e-16.

Selecting the fourth predictor

To select the second predictor we need to find the predictor with the lowest AIC and p-value. The table to check the values can the displayed using the code below.

add1(MLRMAS, scope=MD, test="F")
## Warning in model.matrix.default(Terms, m, contrasts.arg =
## object$contrasts): the response appeared on the right-hand side and was
## dropped
## Warning in model.matrix.default(Terms, m, contrasts.arg =
## object$contrasts): problem with term 11 in model.matrix: no columns are
## assigned
## Single term additions
## 
## Model:
## audience_score ~ imdb_rating + audience_rating + genre
##                Df Sum of Sq   RSS    AIC F value  Pr(>F)  
## <none>                      29977 2516.3                  
## runtime         1   107.425 29870 2516.0  2.2873 0.13093  
## mpaa_rating     5   114.492 29863 2523.8  0.4846 0.78787  
## rel_year        1   102.085 29875 2516.1  2.1732 0.14093  
## rel_month       1   169.321 29808 2514.6  3.6127 0.05779 .
## imdb_num_votes  1    21.772 29956 2517.8  0.4622 0.49682  
## critics_score   1   121.255 29856 2515.7  2.5830 0.10851  
## critics_rating  2   154.302 29823 2516.9  1.6427 0.19427  
## audience_score  0     0.000 29977 2516.3                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is no next significant predictor to add to the model with p-value less than 0.05.

Analyze the final model

Analyzing the the final model’s regression output, avona output and the formula using the code written below.

summary(MLRMAS)
## 
## Call:
## lm(formula = audience_score ~ imdb_rating + audience_rating + 
##     genre, data = MD)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.6395  -4.4288   0.5889   4.2970  25.0845 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -12.5605     2.1955  -5.721 1.63e-08 ***
## imdb_rating                      9.8028     0.3689  26.571  < 2e-16 ***
## audience_ratingUpright          20.3180     0.7746  26.231  < 2e-16 ***
## genreAnimation                   3.6228     2.4513   1.478  0.13991    
## genreArt House & International  -2.7912     2.0320  -1.374  0.17005    
## genreComedy                      1.5109     1.1269   1.341  0.18050    
## genreDocumentary                 0.6003     1.3696   0.438  0.66130    
## genreDrama                      -0.8339     0.9589  -0.870  0.38481    
## genreHorror                     -1.6199     1.6693  -0.970  0.33222    
## genreMusical & Performing Arts   2.5416     2.1899   1.161  0.24625    
## genreMystery & Suspense         -3.2744     1.2462  -2.627  0.00881 ** 
## genreOther                       0.2743     1.9251   0.142  0.88675    
## genreScience Fiction & Fantasy   0.2559     2.4406   0.105  0.91652    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.86 on 637 degrees of freedom
## Multiple R-squared:  0.8872, Adjusted R-squared:  0.8851 
## F-statistic: 417.5 on 12 and 637 DF,  p-value: < 2.2e-16
anova(MLRMAS)
## Analysis of Variance Table
## 
## Response: audience_score
##                  Df Sum Sq Mean Sq   F value  Pr(>F)    
## imdb_rating       1 198782  198782 4224.0003 < 2e-16 ***
## audience_rating   1  35703   35703  758.6665 < 2e-16 ***
## genre            10   1265     127    2.6881 0.00313 ** 
## Residuals       637  29977      47                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
formula(MLRMAS)
## audience_score ~ imdb_rating + audience_rating + genre

The final model depicts a Parsimonius Model: the simplest model with the highest predictive power. Only three predictors are used : imdb_rating, audience_rating and genre. The ANOVA output confirms the significance of the individual predictors (i.e., p-values < 0.05) The linear regression model output confirm the significance of the individual predictors as well, but it also confirms the significance of the model as a whole (i.e., F-statistic 417.5 on 12 and 637 DF, p-value:< 2.2e-16). Finally, the proportion of variability in the response variable explained by the model is 88.51% (i.e. adjusted R-square).

Variables that were excluded from the table are listed below- runtime - weak linear relationship and not significant rel_year - no linear relationship and not significant rel_month - no linear relationship and not significant mpaa_rating - not significant imdb_num_votes - not significant critics_rating - not significant critics_score - collinearity and not significant

Intrepreting the coefficients

To know of the coefficient of the model use the code below.

coefficients(MLRMAS)
##                    (Intercept)                    imdb_rating 
##                    -12.5605354                      9.8028449 
##         audience_ratingUpright                 genreAnimation 
##                     20.3180279                      3.6228430 
## genreArt House & International                    genreComedy 
##                     -2.7911586                      1.5108717 
##               genreDocumentary                     genreDrama 
##                      0.6003104                     -0.8339436 
##                    genreHorror genreMusical & Performing Arts 
##                     -1.6198585                      2.5415754 
##        genreMystery & Suspense                     genreOther 
##                     -3.2743845                      0.2742760 
## genreScience Fiction & Fantasy 
##                      0.2559299

the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in a predictor, holding all of the other predictors constant.

Specific interpretations follow-

Intercept Coefficient: the estimated audience score is -12.56053 if none of the predictors in the model are included. this cn be interpreted as that if no information is given the audience generally conceive the movie with a negative sense. imdb_rating: the estimated expected increase in the audience score is 9.8028 when the imdb_rating goes up by 1, holding all other predictors constant. audience_rating Upright coefficient: the estimted audience_rating score is 20.318, ehen the audience rating is “Upright” and holding all other predictors constant. However if the audience rating is “Spilled” the expected decrease in audience score will be of around 20. genreDrama Coefficient: the estimated decrease in audience score, when the genre is drama is 0.83394, while holding all other predictor constant. However, the audience scores can increase or decrease depending on what genre category is selected.

Model Diagnostics i.e. Checking the conditions graphically

Check for linearity Checking for the linear relationship between numerical predictor (s) and residual (s) using the code written below.

plot(MLRMAS$residuals ~ MD$imdb_rating, main="Linearity Condition")

Condition met the plot depicts a complete random scatter around zero; no descernable pattern.

Check for normality

Checking for the nearly normal residuals using the code written below.

qqnorm(MLRMAS$residuals, main="Normality Condition")
qqline(MLRMAS$residuals, main="Normality Condition")

Condition met - the majority of the points lie on the line, but because of skeness, a few points do not. Also note that there are no apparent outliners.

hist(MLRMAS$residuals, prob=TRUE, main="Normality Condition")
lines(density(MLRMAS$residuals), col="blue", lwd=2)

Condition met - the histogram confirms the skewness (right skewness) but the distribution still appear to be nearly normal.

Check for variability

checking for the variability of the residuals using the code written below.

plot(MLRMAS$residuals ~ MLRMAS$fitted.values, main="Variability conditions")

Condition met - the plot of predicted values shows that residuals are equally variable for low and heigh values and there is no visible fan pattern.

plot(abs(MLRMAS$residuals) ~ MLRMAS$fitted.values, main="Variability conditions")

Condition met - the plot of absolute value of the residuals does not depict any unusual observations.

Check for independancy

checking for the independancy of the residuals using the code written below.

plot(MLRMAS$residuals, main="Independany Conditions")

Condition met - the plot depicts residuals being randomly scatterd around zero.


Part 5: Prediction

Building a test data case

Build test data cases for the movie “Deadpool (2016)” using the data gathered from IMDB and rotten tomatoes website and storing the data in the variable named TDC (test data case) using the following code.

audience_score <- 90
imdb_rating <- 8.1
audience_rating <- "Upright"
genre <- "Comedy"

TDC <- data.frame (audience_score, imdb_rating, audience_rating, genre)

as said above the source of the data is IMDB and rotten tomatatoes website. once the movie was selected the movie ws searched on these two website and the reuired data was extracted which will be used here.

Predicting the audience score

TDC (test data case) using the following code.

myPrediction <- round(predict(MLRMAS, TDC), digits = 0)
c(myPrediction, TDC$audience_score)
##  1    
## 89 90

Predicting the correct audience score was not easy. The model seems to be much sensitive to the imdb_rating variable and the result of this sesitivity was to predict a much higher audience score. Audience score will be predicted much more accurately by the model when both the audience acore and the imdb_rating are relatively high.

Estimate and Interpret the prediction confidence interval

ConfidenceInterval <- predict(MLRMAS, TDC, interval="confidence")
ConfidenceInterval
##        fit      lwr      upr
## 1 88.67141 86.72074 90.62207

We are 95% confident that, all else being equal, the predicted audience score for the movie ‘Deadpool’ will be between 86.72074 and 90.62207 on average.


Part 6: Conclusion

Explanatory data analysis was of great help in providing the insight on what data items to include in the model or not to. The modeling methodology of evaluating both significance and variablity of each predictor before adding it to the model produced a very robust model that very precisely answered the research question and the model predicted the audience score correctly and the margin of error is +/- (90.62207-86.72074) =1.950665.

State Concerns Sample is not representative: the data is biased toward drama movies, consequently the model was trained primarily by drame movie dataset thus it would have been better to predict the audience score about drama movies.