SUMMARY

This project assessment is about analyzing a movies dataset which is comprised of 651 randomly sampled movies produced and released before 2016. We have been hired to work in Paramount Pictures as a Data Scientist and we need to study the data in order to answer some research questions based on public’s responses from two of the many largest rating sites on the web: Rotten Tomatoes and IMDB (Internet Movie Database)

ROTTEN TOMATOES

Rotten Tomatoes is a top 1000 site, placing around #400 globally and top 150 for the US only. Their staff first collect online reviews from writers who are certified members of various writing guilds or film critic-associations.

Critic’s score: If the positive reviews make up 60% or more, the film is considered “fresh”, in that a supermajority of the reviewers approve of the film. If the positive reviews are less than 60%, the film is considered “rotten”. An average score on a 0 to 10 scale is also calculated. Certified Fresh status is a special distinction awarded to the best-reviewed movies and TV shows.

Audience score: The Audience rating, denoted by a popcorn bucket, is the percentage of all users who have rated the movie or TV Show positively. The full popcorn bucket means the movie received 3.5 stars or higher by users, rather said “Upright”. The tipped over popcorn bucket means the movie received less than 3.5 stars by users and the trash icon means that the movie is being qualified as “Spilled”.

IMDB

IMDb is the world’s most popular and authoritative source for movie, TV and celebrity content. Their offer a searchable database of more than 250 million data items including more than 4 million movies, TV and entertainment programs and 8 million cast and crew members. IMDb launched online in 1990 and has been a subsidiary of Amazon.com since 1998.

Sources:

https://www.rottentomatoes.com/about/

https://www.imdb.com/

https://en.wikipedia.org/wiki/Rotten_Tomatoes

Load packages

library(dplyr)     #Useful package for data manipulation
library(knitr)     #Useful for presenting graphs and tables
library(ggplot2)   #Recognized package for creating graphs
library(reshape2)  #Useful package for data manipulation
library(psych)     #Useful package to do correlations
library(MASS)      #A statistical package with useful functions 
library(effects)   #A package for visualizing plots (specially regression models)
library(modelr)    #Some functions for regression models

Load data

load("movies.Rdata")
head(movies, 5)
## # A tibble: 5 x 32
##   title     title_type  genre runtime mpaa_rating studio     thtr_rel_year
##   <chr>     <fct>       <fct>   <dbl> <fct>       <fct>              <dbl>
## 1 Filly Br~ Feature Fi~ Drama      80 R           Indomina ~          2013
## 2 The Dish  Feature Fi~ Drama     101 PG-13       Warner Br~          2001
## 3 Waiting ~ Feature Fi~ Come~      84 R           Sony Pict~          1996
## 4 The Age ~ Feature Fi~ Drama     139 PG          Columbia ~          1993
## 5 Malevole~ Feature Fi~ Horr~      90 R           Anchor Ba~          2004
## # ... with 25 more variables: thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## #   dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>,
## #   imdb_rating <dbl>, imdb_num_votes <int>, critics_rating <fct>,
## #   critics_score <dbl>, audience_rating <fct>, audience_score <dbl>,
## #   best_pic_nom <fct>, best_pic_win <fct>, best_actor_win <fct>,
## #   best_actress_win <fct>, best_dir_win <fct>, top200_box <fct>,
## #   director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>,
## #   actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr>
nrow(movies) #number of sampled movies
## [1] 651
str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

Part 1: Data

As indicated previously, we have identified some numerical and categorical variables in our dataset. Numerical variables are divided in two portions: Numerical Dates (Year, Month and Days) and scoring variables (such as critics score, audience score and voting). We have also identified some variables that contains more detailed information about each sampled movie such as: title of the movie, genre, director, MPAA rating and the name of the actors.

One of the most relevant variables that we need to take into account, are those who merely identify which rating site is referring to:

Rotten Tomatoes: audience_rating(Upright or Spilled); critics_rating(Fresh, Certified Fresh) and audience_score/critics_score IMDB: imdb_num_votes (number of public’s votes); imdb_rating (The only weighing meassure available in our dataset)


Part 2: Research questions

These are the questions that we are going to answer throughout the movies dataset:

“Learning something new about new movies”

According to our research questions we are focused in stating conclusions about a sampled population movies dataset. Since we are collecting different samples from a representative group of movies rating, random sampling is present and therefore, generalizability is applied since we are trying to find attributes and predicting variables through inferences and significance tests. These results can be transferable to the population in general. However, it is important to state that the larger the sample population, the more we can generalize the results.

One of our main research questions is “predicting” but this doesn’t mean that we are pretending to establish direct causations throughout our analysis. However, we are in a position in stating whether relationships might exist in our analysis according to our selected variables and taking into account that expertise domain is needed in order to find causal relationships.


Part 3: Exploratory data analysis (EDA)

Firstly, it is important to get to know the data, so our next step is driving an EDA analysis and start figuring out to try answering our research questions.

Collecting relevant variables. We proceed to store these variables in a new data frame.

detach("package:dplyr", character.only = TRUE)
library(dplyr)
detach("package:ggplot2", character.only = TRUE)
library(ggplot2)
#Collecting 
movies.data <- movies %>%
        select(thtr_rel_year, title, genre, mpaa_rating, 
               best_pic_nom,best_pic_win,
               director,actor1, actor2, actor3, actor4, actor5,
               imdb_num_votes, imdb_rating,  
               critics_rating, critics_score,
               audience_rating, audience_score)

We know that we have a representative group of movies. But overall, how are they voted and rated?

movies %>%
        select(imdb_num_votes, critics_score,audience_score) %>%
        summary()
##  imdb_num_votes   critics_score    audience_score 
##  Min.   :   180   Min.   :  1.00   Min.   :11.00  
##  1st Qu.:  4546   1st Qu.: 33.00   1st Qu.:46.00  
##  Median : 15116   Median : 61.00   Median :65.00  
##  Mean   : 57533   Mean   : 57.69   Mean   :62.36  
##  3rd Qu.: 58301   3rd Qu.: 83.00   3rd Qu.:80.00  
##  Max.   :893008   Max.   :100.00   Max.   :97.00

We can clearly have an idea of the critics and audience response, both of them seem nearly normally distributed but it appears to be a higher average of 62.36 in respect to the audience score. It is important to state both types of ratings: audience is more like just for entertainment, fun and time pass. There is no such technical value involved. On the other hand however, critics are more like a professional rating because professionals, professionals and even PhDs are involved. Therefore, technical value is added.

This not seems to be the case for the IMDB votes. We can clearly see a highly right skewed data which represents 15,116 votes in general. But why is this difference? is this related to the public’s preferences?

IMDB rating system seems to have “Drama Preferences” in general. Moreover, there appears to be a lack of interest in animated and science fiction & fantasy movies.

ggplot(data=movies.data,aes(x=mpaa_rating,fill=imdb_num_votes)) + geom_bar(fill="cornflowerblue") + coord_flip() +
        theme(plot.title = element_text(hjust=0.5)) + theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic")) + labs(title="Movie Preferences on IMDB(1970-2014)") + xlab("Number of Votes") + ylab("MPAA Rating")


Higher preference is involved in movies rated “R” with more than 300 votes, followed by “PG-13” and “PG” movies.

studio <- movies %>%
        select(studio, imdb_num_votes ) %>%
        group_by(studio) %>%
        summarize(votes = n()) %>%
        arrange(desc(votes)) %>%
        head(10)

kable(studio, format = 'markdown')
studio votes
Paramount Pictures 37
Warner Bros. Pictures 30
Sony Pictures Home Entertainment 27
Universal Pictures 23
Warner Home Video 19
20th Century Fox 18
Miramax Films 18
MGM 16
Twentieth Century Fox Home Entertainment 14
IFC Films 13

Now we know that there are some preferences throught our sampled dataset, we are interested in seeing which movies are the most and worst prefered. For this, we proceed to rank according to audience and critics score (For Rotten Tomatoes); and IMDB Rating Score

movies.ranking.best <- movies.data %>%
        filter(critics_rating=="Certified Fresh" & critics_score >= 90) %>%
        filter(audience_score >= 90) %>% 
        filter(imdb_rating >= 8.5 ) %>% 
        select(title, critics_score, audience_score, imdb_rating) %>%
        arrange(desc(critics_score, audience_score, imdb_rating))

movies.ranking.worst <- movies.data %>%
        filter(critics_score <= 25) %>%
        filter(audience_score <= 25) %>% 
        filter(imdb_rating <= 3 ) %>% 
        select(title, critics_score, audience_score, imdb_rating) %>%
        arrange(desc(critics_score, audience_score, imdb_rating))

kable(movies.ranking.best, format = 'markdown')
title critics_score audience_score imdb_rating
The Godfather, Part II 97 97 9.0
Promises 96 96 8.5
Memento 92 94 8.5
kable(movies.ranking.worst, format = 'markdown')
title critics_score audience_score imdb_rating
Viva Knievel! 17 17 2.7
Doogal 8 18 2.8
Battlefield Earth 3 11 2.4
Disaster Movie 1 19 1.9

It clearly shows that in fact there appears to be different rating systems in both sites Rotten Tomatoes and IMDB. For instance, although “Disaster Movie” has been the worst of them all, the audience score for Rotten Tomatoes has rated it to 19/100 (Highly different from the other systems).

Moreover, as mention previously, the rating system in IMDB doesn’t state which type of public is referred to, is it just audience or critics rating? Maybe both? We can start by stating differences in both systems in order to see if there is a significant difference across the groups or maybe is just due by random chance?

years <- movies.data %>%
        select(year = thtr_rel_year, critics = critics_score, audience = audience_score, imdb = imdb_rating ) 

#standardize at same rate (1/10)
years$critics <- years$critics * 0.1
years$audience <- years$audience * 0.1

years.m <- melt(years, id=c("year")) #melting data: so we create a new variable integrating factor variables: critics, audience and IMDB
 
years.s <- years.m %>%
        group_by(year, variable) %>%
        summarize(score = mean(value))

qplot(year,  score, data = years.s, 
               geom ="line", color = variable,
               xlab = "Year", ylab = "Rating % (0 - 10)", 
               main = "Public's Movie Responses Across the Years(1970-2014)"
) + theme(plot.title = element_text(hjust=0.5)) 

Trending (ups and downs) across each year appears to have the same behavior for each type of public response: Critics, Audience (for Rotten Tomatoes) and the IMDB system. However, differences in terms of “rating rates” for each category is evident: Critics opinions are quite different compared to the audience and IMBD’s. Interestingly enough, is that both IMDB and “Audience - Rotten Tomatoes” systems appear to be somewhat related in terms of rating differences and trending behavior as well.

Now, we proceed to answer our inference research question:

Is there appears to be a relationship in public’s rating between both Rotten Tomatoes and IMDB Systems?

ggplot(years.m, aes(x=variable, y = value, fill = factor(variable))) + 
        geom_boxplot(fill="darkseagreen1") + xlab("Rating System") + ylab("Rating % (0 - 10)") + 
        theme(plot.title = element_text(hjust=0.5)) +  
        theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic")) +   labs(title="Public's Movie Responses (Rotten Tomates / IMBD)")


In general, there appears to be a slight difference across each type of system. Both Rotten Tomatoes’ Audience and IMDB are nearly equal to 6.5%. This is not the case for Rotten Tomatoes’ Critics, which is low (with a difference of 4%) compared to the other systems.

Consistency seems to be present in the IMBD reactions, though we can see several outliers below the 25% responses which represents a 5.9 rating. Representative variations is present on the critics and it’s by far the highest rating system within an interval of 3.3% AND 8.3%.

ANOVA TEST FOR ASSESSING DIFFERENCE ACROSS THE GROUPS. In order to answer our inference question, we proceed quickly to run an ANOVA test, for establishing if the differences in means across each system is representative or simply due to random chance:

H0 (NULL Hypothesis) = Differences are due to sampling variability (there’s nothing going on)

HA (Alternative Hypothesis) = There is a significant relationship

years.m %>%
        group_by(variable) %>%
        summarize(score = mean(value))
## # A tibble: 3 x 2
##   variable score
##   <fct>    <dbl>
## 1 critics   5.77
## 2 audience  6.24
## 3 imdb      6.49
fit <- aov(value ~ variable, data = years.m)  
summary(fit)
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## variable       2    176   87.78   19.75 3.22e-09 ***
## Residuals   1950   8667    4.44                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on our results, we have little evidence against the Null Hypothesis (Having low p value of 3.22e-09 < 0.05). So, conclude in accepting the alternative Hypothesis Ha that there is a significant relationship between within each rating group and their total rating of movies.

From our first EDA and INFERENTIAL assessment we can conclude that our movies dataset comes from a randomized sampling process since we can clearly see differences in variables such as genre and studio preferences. Both Rotten Tomatoes critics/audience and IMDB have different qualification systems and that difference is evident so they should fit well in constructing a regression model for rating prediction.


Part 4: Modeling

Variable selection and NA’s

Firstly, we are going to create a new data frame that will contain the rating numerical values and some categorical variables that are somewhat related to the rating response. Assessing NA’s is also relevant task since our model could run into errors.

movies.model <- movies %>%
        select (critics_score, audience_score,imdb_rating, runtime, imdb_num_votes,
                title_type, genre, mpaa_rating, critics_rating, audience_rating, 
                best_pic_nom, best_actor_win, best_actress_win, best_dir_win, top200_box) 

#checking for NA's
movies.model[rowSums(is.na(movies.model[,1:5])) >= 1,]
## # A tibble: 1 x 15
##   critics_score audience_score imdb_rating runtime imdb_num_votes
##           <dbl>          <dbl>       <dbl>   <dbl>          <int>
## 1            80             72         7.5      NA            739
## # ... with 10 more variables: title_type <fct>, genre <fct>,
## #   mpaa_rating <fct>, critics_rating <fct>, audience_rating <fct>,
## #   best_pic_nom <fct>, best_actor_win <fct>, best_actress_win <fct>,
## #   best_dir_win <fct>, top200_box <fct>
#this missing value coresponds to the row = 334

movies[334, ] #the movie is called "The End of America" and is 73 minutes long
## # A tibble: 1 x 32
##   title       title_type  genre   runtime mpaa_rating studio thtr_rel_year
##   <chr>       <fct>       <fct>     <dbl> <fct>       <fct>          <dbl>
## 1 The End of~ Documentary Docume~      NA Unrated     Indip~          2008
## # ... with 25 more variables: thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## #   dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>,
## #   imdb_rating <dbl>, imdb_num_votes <int>, critics_rating <fct>,
## #   critics_score <dbl>, audience_rating <fct>, audience_score <dbl>,
## #   best_pic_nom <fct>, best_pic_win <fct>, best_actor_win <fct>,
## #   best_actress_win <fct>, best_dir_win <fct>, top200_box <fct>,
## #   director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>,
## #   actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr>
              #https://en.wikipedia.org/wiki/The_End_of_America_(film)
              #we replace the missing value:

movies.model[334, 4] <- 73

Correlation between variables

First of all, we should get to know the relationships between our selected numerical variables. We can do this by doing a scatterplot matrix:

pairs.panels(movies.model[c("critics_score", "audience_score", "imdb_rating", "runtime", "imdb_num_votes")])


We are interested particularly in the rating (IMDB) or score (Rotten Tomatoes) values. One of these variables should be selected as our main outcome and the remained variables should be assessed whether they are somehow related to the outcome. Based on the scatter plots, there appears to be a somewhat linear relationship between all the variables.

The imdb_rating predictor appears to be nearly normally distributed compared to the other predictors, and we can see some left skewness as well. A significant relationship appears to be between predictors (critics_score/asucience_score) and the outcome is appreciated though we need to asses this conclusion through statistical significance methods (including the rest of predictors).

One thing that is necessary to take into account is to determine if high correlation between predictors exist because this could lead us to multicollinearity problems. By making stretching throughout the scatter plots we can see that some variables appear to be correlated such as critics_score and audience_score.

Perhaps obtaining the correlation percentage should be quite useful:

corr <- cor(movies.model[c("critics_score", "audience_score", "imdb_rating", "runtime", "imdb_num_votes")])
kable(corr, format = 'markdown')
critics_score audience_score imdb_rating runtime imdb_num_votes
critics_score 1.0000000 0.7042762 0.7650355 0.1700033 0.2092508
audience_score 0.7042762 1.0000000 0.8648652 0.1793002 0.2898128
imdb_rating 0.7650355 0.8648652 1.0000000 0.2650698 0.3311525
runtime 0.1700033 0.1793002 0.2650698 1.0000000 0.3477014
imdb_num_votes 0.2092508 0.2898128 0.3311525 0.3477014 1.0000000

As we can see critics_score and audience_score variables are needed to be assessed whether we should include them in our predictions since it appears to be a 70% relationship between both.

Building the first model approach:
fit.movies.1 <- lm(imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
              title_type + genre + mpaa_rating  + critics_rating + audience_rating +
              best_pic_nom + best_actor_win + best_actress_win + best_dir_win + top200_box,   
              data = movies.model)

summary(fit.movies.1)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime + 
##     imdb_num_votes + title_type + genre + mpaa_rating + critics_rating + 
##     audience_rating + best_pic_nom + best_actor_win + best_actress_win + 
##     best_dir_win + top200_box, data = movies.model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.35280 -0.18251  0.02743  0.25048  1.10595 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.860e+00  2.694e-01  10.616  < 2e-16 ***
## critics_score                   1.478e-02  1.531e-03   9.653  < 2e-16 ***
## audience_score                  3.968e-02  2.062e-03  19.246  < 2e-16 ***
## runtime                         3.335e-03  1.097e-03   3.040 0.002467 ** 
## imdb_num_votes                  9.580e-07  2.038e-07   4.699 3.22e-06 ***
## title_typeFeature Film         -1.066e-01  1.675e-01  -0.636 0.524737    
## title_typeTV Movie             -3.038e-01  2.623e-01  -1.158 0.247238    
## genreAnimation                 -3.614e-01  1.761e-01  -2.052 0.040550 *  
## genreArt House & International  3.328e-01  1.366e-01   2.437 0.015093 *  
## genreComedy                    -1.275e-01  7.513e-02  -1.697 0.090281 .  
## genreDocumentary                2.763e-01  1.792e-01   1.542 0.123614    
## genreDrama                      1.027e-01  6.640e-02   1.547 0.122476    
## genreHorror                     6.827e-02  1.123e-01   0.608 0.543307    
## genreMusical & Performing Arts  6.816e-02  1.543e-01   0.442 0.658834    
## genreMystery & Suspense         2.336e-01  8.455e-02   2.763 0.005896 ** 
## genreOther                     -1.089e-02  1.277e-01  -0.085 0.932051    
## genreScience Fiction & Fantasy -1.587e-01  1.599e-01  -0.992 0.321349    
## mpaa_ratingNC-17               -1.339e-01  3.400e-01  -0.394 0.693828    
## mpaa_ratingPG                  -1.318e-01  1.238e-01  -1.065 0.287288    
## mpaa_ratingPG-13               -1.074e-01  1.281e-01  -0.838 0.402345    
## mpaa_ratingR                   -5.411e-02  1.237e-01  -0.437 0.661988    
## mpaa_ratingUnrated             -1.257e-01  1.412e-01  -0.890 0.373864    
## critics_ratingFresh             8.668e-02  5.628e-02   1.540 0.124037    
## critics_ratingRotten            3.518e-01  9.000e-02   3.909 0.000103 ***
## audience_ratingUpright         -3.421e-01  7.202e-02  -4.751 2.52e-06 ***
## best_pic_nomyes                -1.065e-01  1.082e-01  -0.984 0.325354    
## best_actor_winyes               2.955e-02  5.328e-02   0.555 0.579346    
## best_actress_winyes             5.876e-02  5.911e-02   0.994 0.320506    
## best_dir_winyes                 3.214e-02  7.436e-02   0.432 0.665717    
## top200_boxyes                  -9.861e-02  1.257e-01  -0.784 0.433099    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4464 on 621 degrees of freedom
## Multiple R-squared:  0.8382, Adjusted R-squared:  0.8306 
## F-statistic: 110.9 on 29 and 621 DF,  p-value: < 2.2e-16

We do see some interesting statistical significances between our numerical predictors since their p-value is less than 5%. Moreover, it shows an R-Squared of 83% (by penalizing adding all predictors) which may represent a good approximation; however, the R fitting output doesn’t consider some issues such as correlation, errors and some approximations that could lead us to underestimate the model. Actually it is appreciated that some factor predictors appear to be non-significant as well.

Second step is to develop a STEPWISE process (Forward and Backward) in order to determine which predictors should stay and which ones we need to leave them out. This assessment is reached by finding the best model which reduces the AIC errors (Akaike Information Criterion, an estimator for measuring the quality of statistical models)

AIC = 2k - 2 Ln(L)

where, k = number of estimated parameters L = maximum function value of statistical parameters

#STEPWISE PROCESS
step.movies <- stepAIC(fit.movies.1, direction ="both")
## Start:  AIC=-1020.8
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + 
##     title_type + genre + mpaa_rating + critics_rating + audience_rating + 
##     best_pic_nom + best_actor_win + best_actress_win + best_dir_win + 
##     top200_box
## 
##                    Df Sum of Sq    RSS      AIC
## - mpaa_rating       5     0.761 124.52 -1026.81
## - title_type        2     0.269 124.02 -1023.39
## - best_dir_win      1     0.037 123.79 -1022.61
## - best_actor_win    1     0.061 123.81 -1022.48
## - top200_box        1     0.123 123.88 -1022.16
## - best_pic_nom      1     0.193 123.95 -1021.79
## - best_actress_win  1     0.197 123.95 -1021.77
## <none>                          123.75 -1020.80
## - runtime           1     1.841 125.59 -1013.19
## - critics_rating    2     3.213 126.97 -1008.12
## - genre            10     7.685 131.44 -1001.58
## - imdb_num_votes    1     4.401 128.16 -1000.05
## - audience_rating   1     4.498 128.25  -999.56
## - critics_score     1    18.571 142.32  -931.78
## - audience_score    1    73.818 197.57  -718.26
## 
## Step:  AIC=-1026.81
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + 
##     title_type + genre + critics_rating + audience_rating + best_pic_nom + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
##                    Df Sum of Sq    RSS      AIC
## - title_type        2     0.249 124.76 -1029.51
## - best_dir_win      1     0.041 124.56 -1028.60
## - best_actor_win    1     0.044 124.56 -1028.58
## - top200_box        1     0.159 124.67 -1027.98
## - best_actress_win  1     0.170 124.69 -1027.92
## - best_pic_nom      1     0.221 124.74 -1027.65
## <none>                          124.52 -1026.81
## + mpaa_rating       5     0.761 123.75 -1020.80
## - runtime           1     1.734 126.25 -1019.81
## - critics_rating    2     3.143 127.66 -1014.58
## - audience_rating   1     4.486 129.00 -1005.77
## - imdb_num_votes    1     4.762 129.28 -1004.38
## - genre            10     9.064 133.58 -1001.07
## - critics_score     1    18.788 143.30  -937.32
## - audience_score    1    74.253 198.77  -724.33
## 
## Step:  AIC=-1029.51
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + 
##     genre + critics_rating + audience_rating + best_pic_nom + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
##                    Df Sum of Sq    RSS      AIC
## - best_dir_win      1     0.040 124.80 -1031.30
## - best_actor_win    1     0.050 124.81 -1031.25
## - best_actress_win  1     0.159 124.92 -1030.68
## - top200_box        1     0.161 124.92 -1030.67
## - best_pic_nom      1     0.221 124.98 -1030.36
## <none>                          124.76 -1029.51
## + title_type        2     0.249 124.52 -1026.81
## + mpaa_rating       5     0.741 124.02 -1023.39
## - runtime           1     1.759 126.52 -1022.39
## - critics_rating    2     3.169 127.93 -1017.18
## - audience_rating   1     4.543 129.31 -1008.23
## - imdb_num_votes    1     4.749 129.51 -1007.19
## - genre            10    11.296 136.06  -993.08
## - critics_score     1    19.225 143.99  -938.21
## - audience_score    1    74.817 199.58  -725.67
## 
## Step:  AIC=-1031.3
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + 
##     genre + critics_rating + audience_rating + best_pic_nom + 
##     best_actor_win + best_actress_win + top200_box
## 
##                    Df Sum of Sq    RSS      AIC
## - best_actor_win    1     0.053 124.86 -1033.03
## - best_actress_win  1     0.162 124.97 -1032.46
## - top200_box        1     0.165 124.97 -1032.44
## - best_pic_nom      1     0.211 125.01 -1032.20
## <none>                          124.80 -1031.30
## + best_dir_win      1     0.040 124.76 -1029.51
## + title_type        2     0.248 124.56 -1028.60
## + mpaa_rating       5     0.745 124.06 -1025.20
## - runtime           1     1.892 126.69 -1023.51
## - critics_rating    2     3.155 127.96 -1019.05
## - audience_rating   1     4.580 129.38 -1009.84
## - imdb_num_votes    1     4.844 129.65 -1008.52
## - genre            10    11.259 136.06  -995.07
## - critics_score     1    19.304 144.11  -939.68
## - audience_score    1    74.896 199.70  -727.28
## 
## Step:  AIC=-1033.03
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + 
##     genre + critics_rating + audience_rating + best_pic_nom + 
##     best_actress_win + top200_box
## 
##                    Df Sum of Sq    RSS      AIC
## - top200_box        1     0.159 125.02 -1034.20
## - best_actress_win  1     0.172 125.03 -1034.13
## - best_pic_nom      1     0.192 125.05 -1034.03
## <none>                          124.86 -1033.03
## + best_actor_win    1     0.053 124.80 -1031.30
## + best_dir_win      1     0.042 124.81 -1031.25
## + title_type        2     0.254 124.60 -1030.35
## + mpaa_rating       5     0.729 124.13 -1026.84
## - runtime           1     2.091 126.95 -1024.21
## - critics_rating    2     3.157 128.01 -1020.77
## - audience_rating   1     4.644 129.50 -1011.25
## - imdb_num_votes    1     4.813 129.67 -1010.40
## - genre            10    11.281 136.14  -996.72
## - critics_score     1    19.320 144.18  -941.36
## - audience_score    1    75.027 199.88  -728.69
## 
## Step:  AIC=-1034.2
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + 
##     genre + critics_rating + audience_rating + best_pic_nom + 
##     best_actress_win
## 
##                    Df Sum of Sq    RSS      AIC
## - best_actress_win  1     0.155 125.17 -1035.39
## - best_pic_nom      1     0.183 125.20 -1035.25
## <none>                          125.02 -1034.20
## + top200_box        1     0.159 124.86 -1033.03
## + best_actor_win    1     0.047 124.97 -1032.44
## + best_dir_win      1     0.047 124.97 -1032.44
## + title_type        2     0.255 124.76 -1031.53
## + mpaa_rating       5     0.763 124.25 -1028.19
## - runtime           1     2.047 127.06 -1025.62
## - critics_rating    2     3.174 128.19 -1021.88
## - imdb_num_votes    1     4.655 129.67 -1012.40
## - audience_rating   1     4.713 129.73 -1012.11
## - genre            10    11.456 136.47  -997.12
## - critics_score     1    19.229 144.24  -943.06
## - audience_score    1    75.349 200.36  -729.12
## 
## Step:  AIC=-1035.39
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + 
##     genre + critics_rating + audience_rating + best_pic_nom
## 
##                    Df Sum of Sq    RSS      AIC
## - best_pic_nom      1     0.142 125.31 -1036.65
## <none>                          125.17 -1035.39
## + best_actress_win  1     0.155 125.02 -1034.20
## + top200_box        1     0.142 125.03 -1034.13
## + best_actor_win    1     0.057 125.11 -1033.69
## + best_dir_win      1     0.050 125.12 -1033.65
## + title_type        2     0.243 124.93 -1032.66
## + mpaa_rating       5     0.734 124.44 -1029.22
## - runtime           1     2.233 127.40 -1025.88
## - critics_rating    2     3.214 128.38 -1022.89
## - imdb_num_votes    1     4.680 129.85 -1013.49
## - audience_rating   1     4.697 129.87 -1013.41
## - genre            10    11.489 136.66  -998.22
## - critics_score     1    19.436 144.61  -943.43
## - audience_score    1    75.194 200.36  -731.12
## 
## Step:  AIC=-1036.65
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + 
##     genre + critics_rating + audience_rating
## 
##                    Df Sum of Sq    RSS      AIC
## <none>                          125.31 -1036.65
## + best_pic_nom      1     0.142 125.17 -1035.39
## + top200_box        1     0.136 125.18 -1035.36
## + best_actress_win  1     0.115 125.20 -1035.25
## + best_dir_win      1     0.039 125.27 -1034.86
## + best_actor_win    1     0.038 125.28 -1034.85
## + title_type        2     0.243 125.07 -1033.92
## + mpaa_rating       5     0.760 124.55 -1030.61
## - runtime           1     2.115 127.43 -1027.75
## - critics_rating    2     3.252 128.56 -1023.98
## - imdb_num_votes    1     4.539 129.85 -1015.49
## - audience_rating   1     4.635 129.95 -1015.01
## - genre            10    11.562 136.87  -999.20
## - critics_score     1    19.370 144.68  -945.08
## - audience_score    1    75.089 200.40  -733.00
print(step.movies$anova)
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + 
##     title_type + genre + mpaa_rating + critics_rating + audience_rating + 
##     best_pic_nom + best_actor_win + best_actress_win + best_dir_win + 
##     top200_box
## 
## Final Model:
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + 
##     genre + critics_rating + audience_rating
## 
## 
##                 Step Df   Deviance Resid. Df Resid. Dev       AIC
## 1                                        621   123.7535 -1020.802
## 2      - mpaa_rating  5 0.76108763       626   124.5146 -1026.810
## 3       - title_type  2 0.24901821       628   124.7637 -1029.510
## 4     - best_dir_win  1 0.03963448       629   124.8033 -1031.303
## 5   - best_actor_win  1 0.05268222       630   124.8560 -1033.028
## 6       - top200_box  1 0.15904131       631   125.0150 -1034.199
## 7 - best_actress_win  1 0.15544337       632   125.1705 -1035.390
## 8     - best_pic_nom  1 0.14212696       633   125.3126 -1036.652

The output reveals that residual deviance is reduced by eliminating some predictors that doesn’t add value to the model. Final model is now:

imdb_rating = critics_score + audience_score + runtime + imdb_num_votes + genre + critics_rating + audience_rating

Building the second model approach:
fit.movies.2 <- lm(imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + genre + critics_rating + audience_rating, data = movies.model)

summary(fit.movies.2)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime + 
##     imdb_num_votes + genre + critics_rating + audience_rating, 
##     data = movies.model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.33507 -0.18181  0.02534  0.25007  1.08827 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.662e+00  1.688e-01  15.770  < 2e-16 ***
## critics_score                   1.482e-02  1.498e-03   9.892  < 2e-16 ***
## audience_score                  3.971e-02  2.039e-03  19.476  < 2e-16 ***
## runtime                         3.340e-03  1.022e-03   3.269 0.001138 ** 
## imdb_num_votes                  9.242e-07  1.930e-07   4.788 2.10e-06 ***
## genreAnimation                 -3.046e-01  1.598e-01  -1.906 0.057130 .  
## genreArt House & International  3.423e-01  1.329e-01   2.575 0.010239 *  
## genreComedy                    -1.190e-01  7.338e-02  -1.622 0.105401    
## genreDocumentary                3.596e-01  9.256e-02   3.885 0.000113 ***
## genreDrama                      1.199e-01  6.342e-02   1.891 0.059050 .  
## genreHorror                     9.174e-02  1.092e-01   0.840 0.401043    
## genreMusical & Performing Arts  1.076e-01  1.443e-01   0.746 0.456123    
## genreMystery & Suspense         2.729e-01  8.113e-02   3.364 0.000816 ***
## genreOther                     -3.482e-02  1.255e-01  -0.277 0.781581    
## genreScience Fiction & Fantasy -1.548e-01  1.590e-01  -0.974 0.330618    
## critics_ratingFresh             8.685e-02  5.534e-02   1.569 0.117035    
## critics_ratingRotten            3.518e-01  8.900e-02   3.952 8.61e-05 ***
## audience_ratingUpright         -3.451e-01  7.132e-02  -4.839 1.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4449 on 633 degrees of freedom
## Multiple R-squared:  0.8362, Adjusted R-squared:  0.8318 
## F-statistic:   190 on 17 and 633 DF,  p-value: < 2.2e-16

Adjusted R-squared is now 83.18 % and including significance in some factors.

All else being constant, a 1% increase in critics score appears to be associated with 1.482% increase in IMDB Rating

All else being constant, a 1% increase in audience score appears to be associated with 3.98% increase in IMDB Rating

All else being constant, a 1% increase in movie’s runtime appears to be associated with 33.93% increase in IMDB Rating

Perhaps a standardization (Z-Scores) of slopes (Beta Values) is needed in order to scale all numerical variables, so we can get a better view of the explained relationship between predictors (number of votes and rating number) and the outcome:

means   <- sapply(movies.model[,c("critics_score", "audience_score", "runtime", "imdb_num_votes")],mean)
stdev   <- sapply(movies.model[,c("critics_score", "audience_score", "runtime", "imdb_num_votes")],sd)

movies.2.scaled <- as.data.frame(scale(movies.model[,c("critics_score", "audience_score", "runtime", "imdb_num_votes")],center=means,scale=stdev))

movies.2.scaled <- movies.2.scaled %>%
        mutate(imdb_rating = movies.model$imdb_rating, genre = 
        movies.model$genre, critics_rating = movies.model$critics_rating,  
        audience_rating = movies.model$audience_rating  )

fit.movies.2.scaled <- lm(imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + genre + critics_rating + audience_rating, data = movies.2.scaled)

summary(fit.movies.2.scaled)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime + 
##     imdb_num_votes + genre + critics_rating + audience_rating, 
##     data = movies.2.scaled)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.33507 -0.18181  0.02534  0.25007  1.08827 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     6.39951    0.09142  69.998  < 2e-16 ***
## critics_score                   0.42097    0.04256   9.892  < 2e-16 ***
## audience_score                  0.80305    0.04123  19.476  < 2e-16 ***
## runtime                         0.06504    0.01990   3.269 0.001138 ** 
## imdb_num_votes                  0.10362    0.02164   4.788 2.10e-06 ***
## genreAnimation                 -0.30456    0.15981  -1.906 0.057130 .  
## genreArt House & International  0.34230    0.13291   2.575 0.010239 *  
## genreComedy                    -0.11898    0.07338  -1.622 0.105401    
## genreDocumentary                0.35958    0.09256   3.885 0.000113 ***
## genreDrama                      0.11995    0.06342   1.891 0.059050 .  
## genreHorror                     0.09174    0.10918   0.840 0.401043    
## genreMusical & Performing Arts  0.10763    0.14433   0.746 0.456123    
## genreMystery & Suspense         0.27288    0.08113   3.364 0.000816 ***
## genreOther                     -0.03482    0.12553  -0.277 0.781581    
## genreScience Fiction & Fantasy -0.15481    0.15901  -0.974 0.330618    
## critics_ratingFresh             0.08685    0.05534   1.569 0.117035    
## critics_ratingRotten            0.35178    0.08900   3.952 8.61e-05 ***
## audience_ratingUpright         -0.34510    0.07132  -4.839 1.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4449 on 633 degrees of freedom
## Multiple R-squared:  0.8362, Adjusted R-squared:  0.8318 
## F-statistic:   190 on 17 and 633 DF,  p-value: < 2.2e-16

Now we do see some interesting pattern. By the way of standardization, we can see that with every increase of one standard deviation in audience score, the IMBD Rating changes 0.80 standard deviations, holding constant other variables; a slight percentage is explained by the runtime predictor

Regression Diagnostics

par(mfrow = c(2,2))
plot(fit.movies.2, main = "MODEL 2")

We must take into account that whenever R deploys a regression model, there are some assumptions about the data: Linearity of data; Normality of Residuals; Multicollinearity; and Homogeneity of Residuals (Constant Variance or Homoscedasticity). We can see four diagnostic plots above: Residuals vs Fitted, Normal QQ plot, Scale-location and Residuals vs Leverage

  1. Ideally, we would expect that our predicted values (fitted) nearly approximate a horizontal line; meaning a less distance between our residuals and the fitted values. In general, it seems a good model.

  2. Residuals appear to be nearly normally distributed.

  3. By standardize our residuals, we can determine if we have a heteroscedasticity. Ideally, we would expect the predicted line to be horizontal and don’t increase across fitted values which seems to be the case.

  4. Through a Cook’s Distance method, we can determine Leverage points and potential outliers and they do not appear to affect our model

** We can even try if some predictor variables are related. Are the critics and audience rating (Fresh / Rotten / Spilled / Upright) somehow related to IMDB Rating (0 - 10) ?**

Building a third model approach:
fit.movies.3 <- lm(imdb_rating ~ critics_score * critics_rating + audience_score * audience_rating + runtime + imdb_num_votes + genre, data = movies.model)
plot(allEffects(fit.movies.3, lattent = TRUE), multiline = TRUE)

We can clearly see that if we interact “Critics Score and Critics Rating” predictors, IMDB changes according to Rotten or Fresh qualification. It doesn’t seem to be a change if audience rating is Spilled or Upright.

By driving an ANOVA test we can compare both models 2 and 3: Model 3 is significantly better than model 2. Residual Sum of Squares (RSS) are also reduced in model 3.

anova(fit.movies.2, fit.movies.3)
## Analysis of Variance Table
## 
## Model 1: imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + 
##     genre + critics_rating + audience_rating
## Model 2: imdb_rating ~ critics_score * critics_rating + audience_score * 
##     audience_rating + runtime + imdb_num_votes + genre
##   Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
## 1    633 125.31                                 
## 2    630 119.75  3     5.567 9.763 2.651e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By building a box-plot and fitting a new model, we can clearly see that critics and audience in general qualifies movies according to their genre. Again, driving an ANOVA test RSS has been reduced and apparently model 3 is performed better than the second model.

Comparing the last three models (Predicted Values vs Actual Values), we can see that all of them seem to fit equally a linear regression model for predicting IMDB Rating Score:

par(mfrow = c(1,3))

plot(predict(fit.movies.2), movies$imdb_rating, col = "Blue", cex = 0.6,
     xlab="predicted",ylab="actual", main = "MODEL 2")

plot(predict(fit.movies.3), movies$imdb_rating, col = "Red", cex = 0.6,
     xlab="predicted",ylab="actual", main = "MODEL 3")

ASSESSING ERROR ACCURACY

RSS: Residual Sum of Squares R2: R-Squared RMSE: Root Mean Square Errors MAE: Mean Absolute Error

m1 <- data.frame(
        MODEL = "MODEL 1",
        RSS = sum(resid(fit.movies.1)^2),
        R2 = rsquare(fit.movies.1, data = movies.model ), 
        RMSE = rmse(fit.movies.1, data = movies.model), 
        MAE = mae(fit.movies.1, data = movies.model) )

m2 <- data.frame(
        MODEL = "MODEL 2",
        RSS = sum(resid(fit.movies.2)^2),
        R2 = rsquare(fit.movies.2, data = movies.model ), 
        RMSE = rmse(fit.movies.2, data = movies.model), 
        MAE = mae(fit.movies.2, data = movies.model) )

m3 <- data.frame(
        MODEL = "MODEL 3",
        RSS = sum(resid(fit.movies.3)^2),
        R2 = rsquare(fit.movies.3, data = movies.model ), 
        RMSE = rmse(fit.movies.3, data = movies.model), 
        MAE = mae(fit.movies.3, data = movies.model) )


metrics <- rbind(m1,m2,m3)


kable(metrics, format = 'markdown')
MODEL RSS R2 RMSE MAE
MODEL 1 123.7535 0.8381966 0.4360018 0.3044807
MODEL 2 125.3126 0.8361582 0.4387396 0.3067051
MODEL 3 119.7456 0.8434369 0.4288834 0.2984583

Our first attempt in constructing the model was to include many variables in order to find the best fitted regression model. However, our main goal is to find the best model in which we maintain predictors at minimum. Though our RMSE and MAE maintain slightly at the same level, we can see that RSS has been performed well in the last model.


Part 5: Prediction

For this part we have been asked to pick a movie from 2016 (that is not in the sample) and do a prediction for this movie using the models developed. We need to quantify uncertainty around a prediction interval.

We have collected 5 movies from the websites:

https://www.rottentomatoes.com/about/ https://www.imdb.com/

Data has been divided in two parts: TRAIN DATA (Our movies dataset) and TEST DATA (DATA FRAME that contains the new sampled movies)

#train data
movies.df <- movies.model %>%
        select(imdb_rating, critics_score, audience_score, runtime, 
                       imdb_num_votes, genre, critics_rating, audience_rating )
#test data

movies2016 <- read.csv("movies2016.csv", header = TRUE)
kable(movies2016, format = 'markdown')
movie imdb_rating critics_score audience_score runtime imdb_num_votes genre critics_rating audience_rating
Deadpool (2016) 8.0 83 90 108 159894 Action & Adventure Certified Fresh Upright
Billions (2016) 8.4 85 88 60 42259 Drama Fresh Upright
Moana (2016) 7.6 96 89 107 190286 Animation Certified Fresh Upright
Mothers Day (2016) 5.6 6 44 118 25471 Comedy Rotten Spilled
Rogue One (2016) 7.8 85 87 133 424795 Action & Adventure Certified Fresh Upright

PREDICTION INTERVAL MODEL 2

pred.2 <- cbind(actual = movies2016$imdb_rating, predict(fit.movies.2, newdata = movies2016, interval="prediction", level = 0.95))
kable(cbind(as.character(movies2016$movie), pred.2, error = movies2016$imdb_rating - pred.2[,2]), format = 'markdown')
actual fit lwr upr error
Deadpool (2016) 8 7.62910122793052 6.74381123902151 8.51439121683953 0.370898772069481
Billions (2016) 8.4 7.51709138434069 6.63318193296795 8.40100083571343 0.882908615659307
Moana (2016) 7.6 7.50225824510333 6.57762996835367 8.426886521853 0.0977417548966688
Mothers Day (2016) 5.6 5.14823413708064 4.26511410276931 6.03135417139197 0.451765862919359
Rogue One (2016) 7.8 7.86791873725278 6.97988213681699 8.75595533768857 -0.0679187372527821

PREDICTION INTERVAL MODEL 3

pred.3 <- cbind(actual = movies2016$imdb_rating, predict(fit.movies.3, newdata = movies2016, interval="prediction", level = 0.95))
kable(cbind(as.character(movies2016$movie), pred.3, error = movies2016$imdb_rating - pred.3[,2]), format = 'markdown')
actual fit lwr upr error
Deadpool (2016) 8 7.68003611060525 6.80968698761903 8.55038523359146 0.319963889394752
Billions (2016) 8.4 7.44235423090109 6.57437819288333 8.31033026891885 0.957645769098908
Moana (2016) 7.6 7.39109711448799 6.47834809404312 8.30384613493285 0.208902885512012
Mothers Day (2016) 5.6 5.01717725733323 4.15022301370193 5.88413150096454 0.582822742666768
Rogue One (2016) 7.8 7.8679659500952 6.99721085747201 8.73872104271838 -0.0679659500951963

Both models 2 and 3 seem well performed when predicting movie Rogue One (2016) and Dead Pool (2016) considering that both are classified as Action & Adventure. Prediction of the Drama movie (Billions) has been performed well in the second model. This appears to be the same case for movies categorized as Animation and Comedy.

In general, both models predict with 95% confidence, and based on the results prediction IMDB ratings scores have been captured well around the confidence intervals.


Part 6: Conclusion

  1. Through an Exploratory Data Analysis we saw that public’s preferences are reflected in our given dataset. This asseveration was proven by analyzing categories such as genre, MPAA and studio.

  2. Both system ratings of IMDB and Rotten Tomatoes seem to follow same pattern behavior across the years, but they are quite different in terms of rating percentages and this was proven by doing statistical inference. We could see that public participation in IMDB is not stated, however, it appears that Rotten Tomatoes’ Audience Rating is nearly approximated to the IMDB Rating.

  3. Two models were developed:

IMDB_Ratings = Critics_Score + Audience_Score + Runtime + IMDB_num_votes + Genre + Critics_Rating + Audience_Rating

IMDB_Ratings = Critics_Score x Critics_Rating + Audience_Score x Audience_Rating + Runtime + IMDB_num_votes + Genre

Both seem to predict at a 95% confidence the IMDB Rating Score for the sampled movies obtained from the Rotten Tomatoes and IMDB web sites.

  1. Though the two models seemed to fit predictions pretty well, there improvements that should be taken into account:
  • The original data set was about 651 records. The more samples we have the more accuracy of the model should have. This should permit to capture more variability in our population.
  • Missing attributes of voters. This could explain in more detail the movies preferences in general.
  • Missing attributes of movies. For instance: popularity, reviews, number of likes, etc.