This project assessment is about analyzing a movies dataset which is comprised of 651 randomly sampled movies produced and released before 2016. We have been hired to work in Paramount Pictures as a Data Scientist and we need to study the data in order to answer some research questions based on public’s responses from two of the many largest rating sites on the web: Rotten Tomatoes and IMDB (Internet Movie Database)
Rotten Tomatoes is a top 1000 site, placing around #400 globally and top 150 for the US only. Their staff first collect online reviews from writers who are certified members of various writing guilds or film critic-associations.
Critic’s score: If the positive reviews make up 60% or more, the film is considered “fresh”, in that a supermajority of the reviewers approve of the film. If the positive reviews are less than 60%, the film is considered “rotten”. An average score on a 0 to 10 scale is also calculated. Certified Fresh status is a special distinction awarded to the best-reviewed movies and TV shows.
Audience score: The Audience rating, denoted by a popcorn bucket, is the percentage of all users who have rated the movie or TV Show positively. The full popcorn bucket means the movie received 3.5 stars or higher by users, rather said “Upright”. The tipped over popcorn bucket means the movie received less than 3.5 stars by users and the trash icon means that the movie is being qualified as “Spilled”.
IMDb is the world’s most popular and authoritative source for movie, TV and celebrity content. Their offer a searchable database of more than 250 million data items including more than 4 million movies, TV and entertainment programs and 8 million cast and crew members. IMDb launched online in 1990 and has been a subsidiary of Amazon.com since 1998.
Sources:
library(dplyr) #Useful package for data manipulation
library(knitr) #Useful for presenting graphs and tables
library(ggplot2) #Recognized package for creating graphs
library(reshape2) #Useful package for data manipulation
library(psych) #Useful package to do correlations
library(MASS) #A statistical package with useful functions
library(effects) #A package for visualizing plots (specially regression models)
library(modelr) #Some functions for regression models
load("movies.Rdata")
head(movies, 5)
## # A tibble: 5 x 32
## title title_type genre runtime mpaa_rating studio thtr_rel_year
## <chr> <fct> <fct> <dbl> <fct> <fct> <dbl>
## 1 Filly Br~ Feature Fi~ Drama 80 R Indomina ~ 2013
## 2 The Dish Feature Fi~ Drama 101 PG-13 Warner Br~ 2001
## 3 Waiting ~ Feature Fi~ Come~ 84 R Sony Pict~ 1996
## 4 The Age ~ Feature Fi~ Drama 139 PG Columbia ~ 1993
## 5 Malevole~ Feature Fi~ Horr~ 90 R Anchor Ba~ 2004
## # ... with 25 more variables: thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## # dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>,
## # imdb_rating <dbl>, imdb_num_votes <int>, critics_rating <fct>,
## # critics_score <dbl>, audience_rating <fct>, audience_score <dbl>,
## # best_pic_nom <fct>, best_pic_win <fct>, best_actor_win <fct>,
## # best_actress_win <fct>, best_dir_win <fct>, top200_box <fct>,
## # director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>,
## # actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr>
nrow(movies) #number of sampled movies
## [1] 651
str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
As indicated previously, we have identified some numerical and categorical variables in our dataset. Numerical variables are divided in two portions: Numerical Dates (Year, Month and Days) and scoring variables (such as critics score, audience score and voting). We have also identified some variables that contains more detailed information about each sampled movie such as: title of the movie, genre, director, MPAA rating and the name of the actors.
One of the most relevant variables that we need to take into account, are those who merely identify which rating site is referring to:
Rotten Tomatoes: audience_rating
(Upright or Spilled); critics_rating
(Fresh, Certified Fresh) and audience_score
/critics_score
IMDB: imdb_num_votes
(number of public’s votes); imdb_rating
(The only weighing meassure available in our dataset)
These are the questions that we are going to answer throughout the movies dataset:
According to our research questions we are focused in stating conclusions about a sampled population movies dataset. Since we are collecting different samples from a representative group of movies rating, random sampling is present and therefore, generalizability
is applied since we are trying to find attributes and predicting variables through inferences and significance tests. These results can be transferable to the population in general. However, it is important to state that the larger the sample population, the more we can generalize the results.
One of our main research questions is “predicting” but this doesn’t mean that we are pretending to establish direct causations
throughout our analysis. However, we are in a position in stating whether relationships might exist in our analysis according to our selected variables and taking into account that expertise domain is needed in order to find causal relationships.
Firstly, it is important to get to know the data, so our next step is driving an EDA analysis and start figuring out to try answering our research questions.
Collecting relevant variables. We proceed to store these variables in a new data frame.
detach("package:dplyr", character.only = TRUE)
library(dplyr)
detach("package:ggplot2", character.only = TRUE)
library(ggplot2)
#Collecting
movies.data <- movies %>%
select(thtr_rel_year, title, genre, mpaa_rating,
best_pic_nom,best_pic_win,
director,actor1, actor2, actor3, actor4, actor5,
imdb_num_votes, imdb_rating,
critics_rating, critics_score,
audience_rating, audience_score)
movies %>%
select(imdb_num_votes, critics_score,audience_score) %>%
summary()
## imdb_num_votes critics_score audience_score
## Min. : 180 Min. : 1.00 Min. :11.00
## 1st Qu.: 4546 1st Qu.: 33.00 1st Qu.:46.00
## Median : 15116 Median : 61.00 Median :65.00
## Mean : 57533 Mean : 57.69 Mean :62.36
## 3rd Qu.: 58301 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :893008 Max. :100.00 Max. :97.00
We can clearly have an idea of the critics and audience response, both of them seem nearly normally distributed but it appears to be a higher average of 62.36 in respect to the audience score. It is important to state both types of ratings: audience is more like just for entertainment, fun and time pass. There is no such technical value involved. On the other hand however, critics are more like a professional rating because professionals, professionals and even PhDs are involved. Therefore, technical value is added.
This not seems to be the case for the IMDB votes. We can clearly see a highly right skewed data which represents 15,116 votes in general. But why is this difference? is this related to the public’s preferences?
IMDB rating system seems to have “Drama Preferences” in general. Moreover, there appears to be a lack of interest in animated and science fiction & fantasy movies.
ggplot(data=movies.data,aes(x=mpaa_rating,fill=imdb_num_votes)) + geom_bar(fill="cornflowerblue") + coord_flip() +
theme(plot.title = element_text(hjust=0.5)) + theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic")) + labs(title="Movie Preferences on IMDB(1970-2014)") + xlab("Number of Votes") + ylab("MPAA Rating")
Higher preference is involved in movies rated “R” with more than 300 votes, followed by “PG-13” and “PG” movies.
studio <- movies %>%
select(studio, imdb_num_votes ) %>%
group_by(studio) %>%
summarize(votes = n()) %>%
arrange(desc(votes)) %>%
head(10)
kable(studio, format = 'markdown')
studio | votes |
---|---|
Paramount Pictures | 37 |
Warner Bros. Pictures | 30 |
Sony Pictures Home Entertainment | 27 |
Universal Pictures | 23 |
Warner Home Video | 19 |
20th Century Fox | 18 |
Miramax Films | 18 |
MGM | 16 |
Twentieth Century Fox Home Entertainment | 14 |
IFC Films | 13 |
movies.ranking.best <- movies.data %>%
filter(critics_rating=="Certified Fresh" & critics_score >= 90) %>%
filter(audience_score >= 90) %>%
filter(imdb_rating >= 8.5 ) %>%
select(title, critics_score, audience_score, imdb_rating) %>%
arrange(desc(critics_score, audience_score, imdb_rating))
movies.ranking.worst <- movies.data %>%
filter(critics_score <= 25) %>%
filter(audience_score <= 25) %>%
filter(imdb_rating <= 3 ) %>%
select(title, critics_score, audience_score, imdb_rating) %>%
arrange(desc(critics_score, audience_score, imdb_rating))
kable(movies.ranking.best, format = 'markdown')
title | critics_score | audience_score | imdb_rating |
---|---|---|---|
The Godfather, Part II | 97 | 97 | 9.0 |
Promises | 96 | 96 | 8.5 |
Memento | 92 | 94 | 8.5 |
kable(movies.ranking.worst, format = 'markdown')
title | critics_score | audience_score | imdb_rating |
---|---|---|---|
Viva Knievel! | 17 | 17 | 2.7 |
Doogal | 8 | 18 | 2.8 |
Battlefield Earth | 3 | 11 | 2.4 |
Disaster Movie | 1 | 19 | 1.9 |
It clearly shows that in fact there appears to be different rating systems in both sites Rotten Tomatoes and IMDB. For instance, although “Disaster Movie” has been the worst of them all, the audience score for Rotten Tomatoes has rated it to 19/100 (Highly different from the other systems).
Moreover, as mention previously, the rating system in IMDB doesn’t state which type of public is referred to, is it just audience or critics rating? Maybe both? We can start by stating differences in both systems in order to see if there is a significant difference across the groups or maybe is just due by random chance?
years <- movies.data %>%
select(year = thtr_rel_year, critics = critics_score, audience = audience_score, imdb = imdb_rating )
#standardize at same rate (1/10)
years$critics <- years$critics * 0.1
years$audience <- years$audience * 0.1
years.m <- melt(years, id=c("year")) #melting data: so we create a new variable integrating factor variables: critics, audience and IMDB
years.s <- years.m %>%
group_by(year, variable) %>%
summarize(score = mean(value))
qplot(year, score, data = years.s,
geom ="line", color = variable,
xlab = "Year", ylab = "Rating % (0 - 10)",
main = "Public's Movie Responses Across the Years(1970-2014)"
) + theme(plot.title = element_text(hjust=0.5))
Trending (ups and downs) across each year appears to have the same behavior for each type of public response: Critics, Audience (for Rotten Tomatoes) and the IMDB system. However, differences in terms of “rating rates” for each category is evident: Critics opinions are quite different compared to the audience and IMBD’s. Interestingly enough, is that both IMDB and “Audience - Rotten Tomatoes” systems appear to be somewhat related in terms of rating differences and trending behavior as well.
Now, we proceed to answer our inference research question:
Is there appears to be a relationship in public’s rating between both Rotten Tomatoes and IMDB Systems?
ggplot(years.m, aes(x=variable, y = value, fill = factor(variable))) +
geom_boxplot(fill="darkseagreen1") + xlab("Rating System") + ylab("Rating % (0 - 10)") +
theme(plot.title = element_text(hjust=0.5)) +
theme(plot.title = element_text(hjust=0.5, size = 15, face = "italic")) + labs(title="Public's Movie Responses (Rotten Tomates / IMBD)")
In general, there appears to be a slight difference across each type of system. Both Rotten Tomatoes’ Audience and IMDB are nearly equal to 6.5%. This is not the case for Rotten Tomatoes’ Critics, which is low (with a difference of 4%) compared to the other systems.
Consistency seems to be present in the IMBD reactions, though we can see several outliers below the 25% responses which represents a 5.9 rating. Representative variations is present on the critics and it’s by far the highest rating system within an interval of 3.3% AND 8.3%.
ANOVA TEST FOR ASSESSING DIFFERENCE ACROSS THE GROUPS. In order to answer our inference question, we proceed quickly to run an ANOVA test
, for establishing if the differences in means across each system is representative or simply due to random chance:
H0 (NULL Hypothesis) = Differences are due to sampling variability (there’s nothing going on)
HA (Alternative Hypothesis) = There is a significant relationship
years.m %>%
group_by(variable) %>%
summarize(score = mean(value))
## # A tibble: 3 x 2
## variable score
## <fct> <dbl>
## 1 critics 5.77
## 2 audience 6.24
## 3 imdb 6.49
fit <- aov(value ~ variable, data = years.m)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## variable 2 176 87.78 19.75 3.22e-09 ***
## Residuals 1950 8667 4.44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on our results, we have little evidence against the Null Hypothesis (Having low p value of 3.22e-09 < 0.05). So, conclude in accepting the alternative Hypothesis Ha that there is a significant relationship between within each rating group and their total rating of movies.
From our first EDA and INFERENTIAL assessment we can conclude that our movies dataset comes from a randomized sampling process since we can clearly see differences in variables such as genre and studio preferences. Both Rotten Tomatoes critics/audience and IMDB have different qualification systems and that difference is evident so they should fit well in constructing a regression model for rating prediction.
Firstly, we are going to create a new data frame that will contain the rating numerical values and some categorical variables that are somewhat related to the rating response. Assessing NA’s is also relevant task since our model could run into errors.
movies.model <- movies %>%
select (critics_score, audience_score,imdb_rating, runtime, imdb_num_votes,
title_type, genre, mpaa_rating, critics_rating, audience_rating,
best_pic_nom, best_actor_win, best_actress_win, best_dir_win, top200_box)
#checking for NA's
movies.model[rowSums(is.na(movies.model[,1:5])) >= 1,]
## # A tibble: 1 x 15
## critics_score audience_score imdb_rating runtime imdb_num_votes
## <dbl> <dbl> <dbl> <dbl> <int>
## 1 80 72 7.5 NA 739
## # ... with 10 more variables: title_type <fct>, genre <fct>,
## # mpaa_rating <fct>, critics_rating <fct>, audience_rating <fct>,
## # best_pic_nom <fct>, best_actor_win <fct>, best_actress_win <fct>,
## # best_dir_win <fct>, top200_box <fct>
#this missing value coresponds to the row = 334
movies[334, ] #the movie is called "The End of America" and is 73 minutes long
## # A tibble: 1 x 32
## title title_type genre runtime mpaa_rating studio thtr_rel_year
## <chr> <fct> <fct> <dbl> <fct> <fct> <dbl>
## 1 The End of~ Documentary Docume~ NA Unrated Indip~ 2008
## # ... with 25 more variables: thtr_rel_month <dbl>, thtr_rel_day <dbl>,
## # dvd_rel_year <dbl>, dvd_rel_month <dbl>, dvd_rel_day <dbl>,
## # imdb_rating <dbl>, imdb_num_votes <int>, critics_rating <fct>,
## # critics_score <dbl>, audience_rating <fct>, audience_score <dbl>,
## # best_pic_nom <fct>, best_pic_win <fct>, best_actor_win <fct>,
## # best_actress_win <fct>, best_dir_win <fct>, top200_box <fct>,
## # director <chr>, actor1 <chr>, actor2 <chr>, actor3 <chr>,
## # actor4 <chr>, actor5 <chr>, imdb_url <chr>, rt_url <chr>
#https://en.wikipedia.org/wiki/The_End_of_America_(film)
#we replace the missing value:
movies.model[334, 4] <- 73
First of all, we should get to know the relationships between our selected numerical variables. We can do this by doing a scatterplot matrix:
pairs.panels(movies.model[c("critics_score", "audience_score", "imdb_rating", "runtime", "imdb_num_votes")])
We are interested particularly in the rating (IMDB) or score (Rotten Tomatoes) values. One of these variables should be selected as our main outcome and the remained variables should be assessed whether they are somehow related to the outcome. Based on the scatter plots, there appears to be a somewhat linear relationship between all the variables.
The imdb_rating predictor appears to be nearly normally distributed compared to the other predictors, and we can see some left skewness as well. A significant relationship appears to be between predictors (critics_score/asucience_score) and the outcome is appreciated though we need to asses this conclusion through statistical significance methods (including the rest of predictors).
One thing that is necessary to take into account is to determine if high correlation between predictors exist because this could lead us to multicollinearity
problems. By making stretching throughout the scatter plots we can see that some variables appear to be correlated such as critics_score and audience_score.
Perhaps obtaining the correlation percentage should be quite useful:
corr <- cor(movies.model[c("critics_score", "audience_score", "imdb_rating", "runtime", "imdb_num_votes")])
kable(corr, format = 'markdown')
critics_score | audience_score | imdb_rating | runtime | imdb_num_votes | |
---|---|---|---|---|---|
critics_score | 1.0000000 | 0.7042762 | 0.7650355 | 0.1700033 | 0.2092508 |
audience_score | 0.7042762 | 1.0000000 | 0.8648652 | 0.1793002 | 0.2898128 |
imdb_rating | 0.7650355 | 0.8648652 | 1.0000000 | 0.2650698 | 0.3311525 |
runtime | 0.1700033 | 0.1793002 | 0.2650698 | 1.0000000 | 0.3477014 |
imdb_num_votes | 0.2092508 | 0.2898128 | 0.3311525 | 0.3477014 | 1.0000000 |
As we can see critics_score and audience_score variables are needed to be assessed whether we should include them in our predictions since it appears to be a 70% relationship between both.
fit.movies.1 <- lm(imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
title_type + genre + mpaa_rating + critics_rating + audience_rating +
best_pic_nom + best_actor_win + best_actress_win + best_dir_win + top200_box,
data = movies.model)
summary(fit.movies.1)
##
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime +
## imdb_num_votes + title_type + genre + mpaa_rating + critics_rating +
## audience_rating + best_pic_nom + best_actor_win + best_actress_win +
## best_dir_win + top200_box, data = movies.model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.35280 -0.18251 0.02743 0.25048 1.10595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.860e+00 2.694e-01 10.616 < 2e-16 ***
## critics_score 1.478e-02 1.531e-03 9.653 < 2e-16 ***
## audience_score 3.968e-02 2.062e-03 19.246 < 2e-16 ***
## runtime 3.335e-03 1.097e-03 3.040 0.002467 **
## imdb_num_votes 9.580e-07 2.038e-07 4.699 3.22e-06 ***
## title_typeFeature Film -1.066e-01 1.675e-01 -0.636 0.524737
## title_typeTV Movie -3.038e-01 2.623e-01 -1.158 0.247238
## genreAnimation -3.614e-01 1.761e-01 -2.052 0.040550 *
## genreArt House & International 3.328e-01 1.366e-01 2.437 0.015093 *
## genreComedy -1.275e-01 7.513e-02 -1.697 0.090281 .
## genreDocumentary 2.763e-01 1.792e-01 1.542 0.123614
## genreDrama 1.027e-01 6.640e-02 1.547 0.122476
## genreHorror 6.827e-02 1.123e-01 0.608 0.543307
## genreMusical & Performing Arts 6.816e-02 1.543e-01 0.442 0.658834
## genreMystery & Suspense 2.336e-01 8.455e-02 2.763 0.005896 **
## genreOther -1.089e-02 1.277e-01 -0.085 0.932051
## genreScience Fiction & Fantasy -1.587e-01 1.599e-01 -0.992 0.321349
## mpaa_ratingNC-17 -1.339e-01 3.400e-01 -0.394 0.693828
## mpaa_ratingPG -1.318e-01 1.238e-01 -1.065 0.287288
## mpaa_ratingPG-13 -1.074e-01 1.281e-01 -0.838 0.402345
## mpaa_ratingR -5.411e-02 1.237e-01 -0.437 0.661988
## mpaa_ratingUnrated -1.257e-01 1.412e-01 -0.890 0.373864
## critics_ratingFresh 8.668e-02 5.628e-02 1.540 0.124037
## critics_ratingRotten 3.518e-01 9.000e-02 3.909 0.000103 ***
## audience_ratingUpright -3.421e-01 7.202e-02 -4.751 2.52e-06 ***
## best_pic_nomyes -1.065e-01 1.082e-01 -0.984 0.325354
## best_actor_winyes 2.955e-02 5.328e-02 0.555 0.579346
## best_actress_winyes 5.876e-02 5.911e-02 0.994 0.320506
## best_dir_winyes 3.214e-02 7.436e-02 0.432 0.665717
## top200_boxyes -9.861e-02 1.257e-01 -0.784 0.433099
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4464 on 621 degrees of freedom
## Multiple R-squared: 0.8382, Adjusted R-squared: 0.8306
## F-statistic: 110.9 on 29 and 621 DF, p-value: < 2.2e-16
We do see some interesting statistical significances between our numerical predictors since their p-value is less than 5%. Moreover, it shows an R-Squared of 83% (by penalizing adding all predictors) which may represent a good approximation; however, the R fitting output doesn’t consider some issues such as correlation, errors and some approximations that could lead us to underestimate
the model. Actually it is appreciated that some factor predictors appear to be non-significant as well.
Second step is to develop a STEPWISE process (Forward and Backward) in order to determine which predictors should stay and which ones we need to leave them out. This assessment is reached by finding the best model which reduces the AIC errors (Akaike Information Criterion, an estimator for measuring the quality of statistical models)
AIC = 2k - 2 Ln(L)
where, k = number of estimated parameters L = maximum function value of statistical parameters
#STEPWISE PROCESS
step.movies <- stepAIC(fit.movies.1, direction ="both")
## Start: AIC=-1020.8
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
## title_type + genre + mpaa_rating + critics_rating + audience_rating +
## best_pic_nom + best_actor_win + best_actress_win + best_dir_win +
## top200_box
##
## Df Sum of Sq RSS AIC
## - mpaa_rating 5 0.761 124.52 -1026.81
## - title_type 2 0.269 124.02 -1023.39
## - best_dir_win 1 0.037 123.79 -1022.61
## - best_actor_win 1 0.061 123.81 -1022.48
## - top200_box 1 0.123 123.88 -1022.16
## - best_pic_nom 1 0.193 123.95 -1021.79
## - best_actress_win 1 0.197 123.95 -1021.77
## <none> 123.75 -1020.80
## - runtime 1 1.841 125.59 -1013.19
## - critics_rating 2 3.213 126.97 -1008.12
## - genre 10 7.685 131.44 -1001.58
## - imdb_num_votes 1 4.401 128.16 -1000.05
## - audience_rating 1 4.498 128.25 -999.56
## - critics_score 1 18.571 142.32 -931.78
## - audience_score 1 73.818 197.57 -718.26
##
## Step: AIC=-1026.81
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
## title_type + genre + critics_rating + audience_rating + best_pic_nom +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Df Sum of Sq RSS AIC
## - title_type 2 0.249 124.76 -1029.51
## - best_dir_win 1 0.041 124.56 -1028.60
## - best_actor_win 1 0.044 124.56 -1028.58
## - top200_box 1 0.159 124.67 -1027.98
## - best_actress_win 1 0.170 124.69 -1027.92
## - best_pic_nom 1 0.221 124.74 -1027.65
## <none> 124.52 -1026.81
## + mpaa_rating 5 0.761 123.75 -1020.80
## - runtime 1 1.734 126.25 -1019.81
## - critics_rating 2 3.143 127.66 -1014.58
## - audience_rating 1 4.486 129.00 -1005.77
## - imdb_num_votes 1 4.762 129.28 -1004.38
## - genre 10 9.064 133.58 -1001.07
## - critics_score 1 18.788 143.30 -937.32
## - audience_score 1 74.253 198.77 -724.33
##
## Step: AIC=-1029.51
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
## genre + critics_rating + audience_rating + best_pic_nom +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Df Sum of Sq RSS AIC
## - best_dir_win 1 0.040 124.80 -1031.30
## - best_actor_win 1 0.050 124.81 -1031.25
## - best_actress_win 1 0.159 124.92 -1030.68
## - top200_box 1 0.161 124.92 -1030.67
## - best_pic_nom 1 0.221 124.98 -1030.36
## <none> 124.76 -1029.51
## + title_type 2 0.249 124.52 -1026.81
## + mpaa_rating 5 0.741 124.02 -1023.39
## - runtime 1 1.759 126.52 -1022.39
## - critics_rating 2 3.169 127.93 -1017.18
## - audience_rating 1 4.543 129.31 -1008.23
## - imdb_num_votes 1 4.749 129.51 -1007.19
## - genre 10 11.296 136.06 -993.08
## - critics_score 1 19.225 143.99 -938.21
## - audience_score 1 74.817 199.58 -725.67
##
## Step: AIC=-1031.3
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
## genre + critics_rating + audience_rating + best_pic_nom +
## best_actor_win + best_actress_win + top200_box
##
## Df Sum of Sq RSS AIC
## - best_actor_win 1 0.053 124.86 -1033.03
## - best_actress_win 1 0.162 124.97 -1032.46
## - top200_box 1 0.165 124.97 -1032.44
## - best_pic_nom 1 0.211 125.01 -1032.20
## <none> 124.80 -1031.30
## + best_dir_win 1 0.040 124.76 -1029.51
## + title_type 2 0.248 124.56 -1028.60
## + mpaa_rating 5 0.745 124.06 -1025.20
## - runtime 1 1.892 126.69 -1023.51
## - critics_rating 2 3.155 127.96 -1019.05
## - audience_rating 1 4.580 129.38 -1009.84
## - imdb_num_votes 1 4.844 129.65 -1008.52
## - genre 10 11.259 136.06 -995.07
## - critics_score 1 19.304 144.11 -939.68
## - audience_score 1 74.896 199.70 -727.28
##
## Step: AIC=-1033.03
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
## genre + critics_rating + audience_rating + best_pic_nom +
## best_actress_win + top200_box
##
## Df Sum of Sq RSS AIC
## - top200_box 1 0.159 125.02 -1034.20
## - best_actress_win 1 0.172 125.03 -1034.13
## - best_pic_nom 1 0.192 125.05 -1034.03
## <none> 124.86 -1033.03
## + best_actor_win 1 0.053 124.80 -1031.30
## + best_dir_win 1 0.042 124.81 -1031.25
## + title_type 2 0.254 124.60 -1030.35
## + mpaa_rating 5 0.729 124.13 -1026.84
## - runtime 1 2.091 126.95 -1024.21
## - critics_rating 2 3.157 128.01 -1020.77
## - audience_rating 1 4.644 129.50 -1011.25
## - imdb_num_votes 1 4.813 129.67 -1010.40
## - genre 10 11.281 136.14 -996.72
## - critics_score 1 19.320 144.18 -941.36
## - audience_score 1 75.027 199.88 -728.69
##
## Step: AIC=-1034.2
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
## genre + critics_rating + audience_rating + best_pic_nom +
## best_actress_win
##
## Df Sum of Sq RSS AIC
## - best_actress_win 1 0.155 125.17 -1035.39
## - best_pic_nom 1 0.183 125.20 -1035.25
## <none> 125.02 -1034.20
## + top200_box 1 0.159 124.86 -1033.03
## + best_actor_win 1 0.047 124.97 -1032.44
## + best_dir_win 1 0.047 124.97 -1032.44
## + title_type 2 0.255 124.76 -1031.53
## + mpaa_rating 5 0.763 124.25 -1028.19
## - runtime 1 2.047 127.06 -1025.62
## - critics_rating 2 3.174 128.19 -1021.88
## - imdb_num_votes 1 4.655 129.67 -1012.40
## - audience_rating 1 4.713 129.73 -1012.11
## - genre 10 11.456 136.47 -997.12
## - critics_score 1 19.229 144.24 -943.06
## - audience_score 1 75.349 200.36 -729.12
##
## Step: AIC=-1035.39
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
## genre + critics_rating + audience_rating + best_pic_nom
##
## Df Sum of Sq RSS AIC
## - best_pic_nom 1 0.142 125.31 -1036.65
## <none> 125.17 -1035.39
## + best_actress_win 1 0.155 125.02 -1034.20
## + top200_box 1 0.142 125.03 -1034.13
## + best_actor_win 1 0.057 125.11 -1033.69
## + best_dir_win 1 0.050 125.12 -1033.65
## + title_type 2 0.243 124.93 -1032.66
## + mpaa_rating 5 0.734 124.44 -1029.22
## - runtime 1 2.233 127.40 -1025.88
## - critics_rating 2 3.214 128.38 -1022.89
## - imdb_num_votes 1 4.680 129.85 -1013.49
## - audience_rating 1 4.697 129.87 -1013.41
## - genre 10 11.489 136.66 -998.22
## - critics_score 1 19.436 144.61 -943.43
## - audience_score 1 75.194 200.36 -731.12
##
## Step: AIC=-1036.65
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
## genre + critics_rating + audience_rating
##
## Df Sum of Sq RSS AIC
## <none> 125.31 -1036.65
## + best_pic_nom 1 0.142 125.17 -1035.39
## + top200_box 1 0.136 125.18 -1035.36
## + best_actress_win 1 0.115 125.20 -1035.25
## + best_dir_win 1 0.039 125.27 -1034.86
## + best_actor_win 1 0.038 125.28 -1034.85
## + title_type 2 0.243 125.07 -1033.92
## + mpaa_rating 5 0.760 124.55 -1030.61
## - runtime 1 2.115 127.43 -1027.75
## - critics_rating 2 3.252 128.56 -1023.98
## - imdb_num_votes 1 4.539 129.85 -1015.49
## - audience_rating 1 4.635 129.95 -1015.01
## - genre 10 11.562 136.87 -999.20
## - critics_score 1 19.370 144.68 -945.08
## - audience_score 1 75.089 200.40 -733.00
print(step.movies$anova)
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
## title_type + genre + mpaa_rating + critics_rating + audience_rating +
## best_pic_nom + best_actor_win + best_actress_win + best_dir_win +
## top200_box
##
## Final Model:
## imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
## genre + critics_rating + audience_rating
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 621 123.7535 -1020.802
## 2 - mpaa_rating 5 0.76108763 626 124.5146 -1026.810
## 3 - title_type 2 0.24901821 628 124.7637 -1029.510
## 4 - best_dir_win 1 0.03963448 629 124.8033 -1031.303
## 5 - best_actor_win 1 0.05268222 630 124.8560 -1033.028
## 6 - top200_box 1 0.15904131 631 125.0150 -1034.199
## 7 - best_actress_win 1 0.15544337 632 125.1705 -1035.390
## 8 - best_pic_nom 1 0.14212696 633 125.3126 -1036.652
The output reveals that residual deviance is reduced by eliminating some predictors that doesn’t add value to the model. Final model is now:
imdb_rating = critics_score + audience_score + runtime + imdb_num_votes + genre + critics_rating + audience_rating
fit.movies.2 <- lm(imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + genre + critics_rating + audience_rating, data = movies.model)
summary(fit.movies.2)
##
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime +
## imdb_num_votes + genre + critics_rating + audience_rating,
## data = movies.model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.33507 -0.18181 0.02534 0.25007 1.08827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.662e+00 1.688e-01 15.770 < 2e-16 ***
## critics_score 1.482e-02 1.498e-03 9.892 < 2e-16 ***
## audience_score 3.971e-02 2.039e-03 19.476 < 2e-16 ***
## runtime 3.340e-03 1.022e-03 3.269 0.001138 **
## imdb_num_votes 9.242e-07 1.930e-07 4.788 2.10e-06 ***
## genreAnimation -3.046e-01 1.598e-01 -1.906 0.057130 .
## genreArt House & International 3.423e-01 1.329e-01 2.575 0.010239 *
## genreComedy -1.190e-01 7.338e-02 -1.622 0.105401
## genreDocumentary 3.596e-01 9.256e-02 3.885 0.000113 ***
## genreDrama 1.199e-01 6.342e-02 1.891 0.059050 .
## genreHorror 9.174e-02 1.092e-01 0.840 0.401043
## genreMusical & Performing Arts 1.076e-01 1.443e-01 0.746 0.456123
## genreMystery & Suspense 2.729e-01 8.113e-02 3.364 0.000816 ***
## genreOther -3.482e-02 1.255e-01 -0.277 0.781581
## genreScience Fiction & Fantasy -1.548e-01 1.590e-01 -0.974 0.330618
## critics_ratingFresh 8.685e-02 5.534e-02 1.569 0.117035
## critics_ratingRotten 3.518e-01 8.900e-02 3.952 8.61e-05 ***
## audience_ratingUpright -3.451e-01 7.132e-02 -4.839 1.64e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4449 on 633 degrees of freedom
## Multiple R-squared: 0.8362, Adjusted R-squared: 0.8318
## F-statistic: 190 on 17 and 633 DF, p-value: < 2.2e-16
Adjusted R-squared is now 83.18 % and including significance in some factors.
All else being constant, a 1% increase in critics score appears to be associated with 1.482% increase in IMDB Rating
All else being constant, a 1% increase in audience score appears to be associated with 3.98% increase in IMDB Rating
All else being constant, a 1% increase in movie’s runtime appears to be associated with 33.93% increase in IMDB Rating
Perhaps a standardization (Z-Scores) of slopes (Beta Values) is needed in order to scale all numerical variables, so we can get a better view of the explained relationship between predictors (number of votes and rating number) and the outcome:
means <- sapply(movies.model[,c("critics_score", "audience_score", "runtime", "imdb_num_votes")],mean)
stdev <- sapply(movies.model[,c("critics_score", "audience_score", "runtime", "imdb_num_votes")],sd)
movies.2.scaled <- as.data.frame(scale(movies.model[,c("critics_score", "audience_score", "runtime", "imdb_num_votes")],center=means,scale=stdev))
movies.2.scaled <- movies.2.scaled %>%
mutate(imdb_rating = movies.model$imdb_rating, genre =
movies.model$genre, critics_rating = movies.model$critics_rating,
audience_rating = movies.model$audience_rating )
fit.movies.2.scaled <- lm(imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes + genre + critics_rating + audience_rating, data = movies.2.scaled)
summary(fit.movies.2.scaled)
##
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + runtime +
## imdb_num_votes + genre + critics_rating + audience_rating,
## data = movies.2.scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.33507 -0.18181 0.02534 0.25007 1.08827
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.39951 0.09142 69.998 < 2e-16 ***
## critics_score 0.42097 0.04256 9.892 < 2e-16 ***
## audience_score 0.80305 0.04123 19.476 < 2e-16 ***
## runtime 0.06504 0.01990 3.269 0.001138 **
## imdb_num_votes 0.10362 0.02164 4.788 2.10e-06 ***
## genreAnimation -0.30456 0.15981 -1.906 0.057130 .
## genreArt House & International 0.34230 0.13291 2.575 0.010239 *
## genreComedy -0.11898 0.07338 -1.622 0.105401
## genreDocumentary 0.35958 0.09256 3.885 0.000113 ***
## genreDrama 0.11995 0.06342 1.891 0.059050 .
## genreHorror 0.09174 0.10918 0.840 0.401043
## genreMusical & Performing Arts 0.10763 0.14433 0.746 0.456123
## genreMystery & Suspense 0.27288 0.08113 3.364 0.000816 ***
## genreOther -0.03482 0.12553 -0.277 0.781581
## genreScience Fiction & Fantasy -0.15481 0.15901 -0.974 0.330618
## critics_ratingFresh 0.08685 0.05534 1.569 0.117035
## critics_ratingRotten 0.35178 0.08900 3.952 8.61e-05 ***
## audience_ratingUpright -0.34510 0.07132 -4.839 1.64e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4449 on 633 degrees of freedom
## Multiple R-squared: 0.8362, Adjusted R-squared: 0.8318
## F-statistic: 190 on 17 and 633 DF, p-value: < 2.2e-16
Now we do see some interesting pattern. By the way of standardization, we can see that with every increase of one standard deviation in audience score, the IMBD Rating changes 0.80 standard deviations, holding constant other variables; a slight percentage is explained by the runtime predictor
par(mfrow = c(2,2))
plot(fit.movies.2, main = "MODEL 2")
We must take into account that whenever R deploys a regression model, there are some assumptions about the data: Linearity of data; Normality of Residuals; Multicollinearity; and Homogeneity of Residuals (Constant Variance or Homoscedasticity). We can see four diagnostic plots above: Residuals vs Fitted, Normal QQ plot, Scale-location and Residuals vs Leverage
Ideally, we would expect that our predicted values (fitted) nearly approximate a horizontal line; meaning a less distance between our residuals and the fitted values. In general, it seems a good model.
Residuals appear to be nearly normally distributed.
By standardize our residuals, we can determine if we have a heteroscedasticity. Ideally, we would expect the predicted line to be horizontal and don’t increase across fitted values which seems to be the case.
Through a Cook’s Distance method, we can determine Leverage points and potential outliers and they do not appear to affect our model
** We can even try if some predictor variables are related. Are the critics and audience rating (Fresh / Rotten / Spilled / Upright) somehow related to IMDB Rating (0 - 10) ?**
fit.movies.3 <- lm(imdb_rating ~ critics_score * critics_rating + audience_score * audience_rating + runtime + imdb_num_votes + genre, data = movies.model)
plot(allEffects(fit.movies.3, lattent = TRUE), multiline = TRUE)
We can clearly see that if we interact “Critics Score and Critics Rating” predictors, IMDB changes according to Rotten or Fresh qualification. It doesn’t seem to be a change if audience rating is Spilled or Upright.
By driving an ANOVA test we can compare both models 2 and 3: Model 3 is significantly better than model 2. Residual Sum of Squares (RSS) are also reduced in model 3.
anova(fit.movies.2, fit.movies.3)
## Analysis of Variance Table
##
## Model 1: imdb_rating ~ critics_score + audience_score + runtime + imdb_num_votes +
## genre + critics_rating + audience_rating
## Model 2: imdb_rating ~ critics_score * critics_rating + audience_score *
## audience_rating + runtime + imdb_num_votes + genre
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 633 125.31
## 2 630 119.75 3 5.567 9.763 2.651e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
By building a box-plot and fitting a new model, we can clearly see that critics and audience in general qualifies movies according to their genre. Again, driving an ANOVA test RSS has been reduced and apparently model 3 is performed better than the second model.
Comparing the last three models (Predicted Values vs Actual Values), we can see that all of them seem to fit equally a linear regression model for predicting IMDB Rating Score:
par(mfrow = c(1,3))
plot(predict(fit.movies.2), movies$imdb_rating, col = "Blue", cex = 0.6,
xlab="predicted",ylab="actual", main = "MODEL 2")
plot(predict(fit.movies.3), movies$imdb_rating, col = "Red", cex = 0.6,
xlab="predicted",ylab="actual", main = "MODEL 3")
RSS: Residual Sum of Squares R2: R-Squared RMSE: Root Mean Square Errors MAE: Mean Absolute Error
m1 <- data.frame(
MODEL = "MODEL 1",
RSS = sum(resid(fit.movies.1)^2),
R2 = rsquare(fit.movies.1, data = movies.model ),
RMSE = rmse(fit.movies.1, data = movies.model),
MAE = mae(fit.movies.1, data = movies.model) )
m2 <- data.frame(
MODEL = "MODEL 2",
RSS = sum(resid(fit.movies.2)^2),
R2 = rsquare(fit.movies.2, data = movies.model ),
RMSE = rmse(fit.movies.2, data = movies.model),
MAE = mae(fit.movies.2, data = movies.model) )
m3 <- data.frame(
MODEL = "MODEL 3",
RSS = sum(resid(fit.movies.3)^2),
R2 = rsquare(fit.movies.3, data = movies.model ),
RMSE = rmse(fit.movies.3, data = movies.model),
MAE = mae(fit.movies.3, data = movies.model) )
metrics <- rbind(m1,m2,m3)
kable(metrics, format = 'markdown')
MODEL | RSS | R2 | RMSE | MAE |
---|---|---|---|---|
MODEL 1 | 123.7535 | 0.8381966 | 0.4360018 | 0.3044807 |
MODEL 2 | 125.3126 | 0.8361582 | 0.4387396 | 0.3067051 |
MODEL 3 | 119.7456 | 0.8434369 | 0.4288834 | 0.2984583 |
Our first attempt in constructing the model was to include many variables in order to find the best fitted regression model. However, our main goal is to find the best model in which we maintain predictors at minimum. Though our RMSE and MAE maintain slightly at the same level, we can see that RSS has been performed well in the last model.
For this part we have been asked to pick a movie from 2016 (that is not in the sample) and do a prediction for this movie using the models developed. We need to quantify uncertainty around a prediction interval.
We have collected 5 movies from the websites:
https://www.rottentomatoes.com/about/ https://www.imdb.com/
Data has been divided in two parts: TRAIN DATA
(Our movies dataset) and TEST DATA
(DATA FRAME that contains the new sampled movies)
#train data
movies.df <- movies.model %>%
select(imdb_rating, critics_score, audience_score, runtime,
imdb_num_votes, genre, critics_rating, audience_rating )
#test data
movies2016 <- read.csv("movies2016.csv", header = TRUE)
kable(movies2016, format = 'markdown')
movie | imdb_rating | critics_score | audience_score | runtime | imdb_num_votes | genre | critics_rating | audience_rating |
---|---|---|---|---|---|---|---|---|
Deadpool (2016) | 8.0 | 83 | 90 | 108 | 159894 | Action & Adventure | Certified Fresh | Upright |
Billions (2016) | 8.4 | 85 | 88 | 60 | 42259 | Drama | Fresh | Upright |
Moana (2016) | 7.6 | 96 | 89 | 107 | 190286 | Animation | Certified Fresh | Upright |
Mothers Day (2016) | 5.6 | 6 | 44 | 118 | 25471 | Comedy | Rotten | Spilled |
Rogue One (2016) | 7.8 | 85 | 87 | 133 | 424795 | Action & Adventure | Certified Fresh | Upright |
pred.2 <- cbind(actual = movies2016$imdb_rating, predict(fit.movies.2, newdata = movies2016, interval="prediction", level = 0.95))
kable(cbind(as.character(movies2016$movie), pred.2, error = movies2016$imdb_rating - pred.2[,2]), format = 'markdown')
actual | fit | lwr | upr | error | |
---|---|---|---|---|---|
Deadpool (2016) | 8 | 7.62910122793052 | 6.74381123902151 | 8.51439121683953 | 0.370898772069481 |
Billions (2016) | 8.4 | 7.51709138434069 | 6.63318193296795 | 8.40100083571343 | 0.882908615659307 |
Moana (2016) | 7.6 | 7.50225824510333 | 6.57762996835367 | 8.426886521853 | 0.0977417548966688 |
Mothers Day (2016) | 5.6 | 5.14823413708064 | 4.26511410276931 | 6.03135417139197 | 0.451765862919359 |
Rogue One (2016) | 7.8 | 7.86791873725278 | 6.97988213681699 | 8.75595533768857 | -0.0679187372527821 |
pred.3 <- cbind(actual = movies2016$imdb_rating, predict(fit.movies.3, newdata = movies2016, interval="prediction", level = 0.95))
kable(cbind(as.character(movies2016$movie), pred.3, error = movies2016$imdb_rating - pred.3[,2]), format = 'markdown')
actual | fit | lwr | upr | error | |
---|---|---|---|---|---|
Deadpool (2016) | 8 | 7.68003611060525 | 6.80968698761903 | 8.55038523359146 | 0.319963889394752 |
Billions (2016) | 8.4 | 7.44235423090109 | 6.57437819288333 | 8.31033026891885 | 0.957645769098908 |
Moana (2016) | 7.6 | 7.39109711448799 | 6.47834809404312 | 8.30384613493285 | 0.208902885512012 |
Mothers Day (2016) | 5.6 | 5.01717725733323 | 4.15022301370193 | 5.88413150096454 | 0.582822742666768 |
Rogue One (2016) | 7.8 | 7.8679659500952 | 6.99721085747201 | 8.73872104271838 | -0.0679659500951963 |
Both models 2 and 3 seem well performed when predicting movie Rogue One (2016) and Dead Pool (2016) considering that both are classified as Action & Adventure. Prediction of the Drama movie (Billions) has been performed well in the second model. This appears to be the same case for movies categorized as Animation and Comedy.
In general, both models predict with 95% confidence, and based on the results prediction IMDB ratings scores have been captured well around the confidence intervals.
Through an Exploratory Data Analysis we saw that public’s preferences are reflected in our given dataset. This asseveration was proven by analyzing categories such as genre, MPAA and studio.
Both system ratings of IMDB and Rotten Tomatoes seem to follow same pattern behavior across the years, but they are quite different in terms of rating percentages and this was proven by doing statistical inference. We could see that public participation in IMDB is not stated, however, it appears that Rotten Tomatoes’ Audience Rating is nearly approximated to the IMDB Rating.
Two models were developed:
IMDB_Ratings = Critics_Score + Audience_Score + Runtime + IMDB_num_votes + Genre + Critics_Rating + Audience_Rating
IMDB_Ratings = Critics_Score x Critics_Rating + Audience_Score x Audience_Rating + Runtime + IMDB_num_votes + Genre
Both seem to predict at a 95% confidence the IMDB Rating Score for the sampled movies obtained from the Rotten Tomatoes and IMDB web sites.