In this project we have developed a Multi-variate Linear Regression Model that will explain the factors that makes a movie popular and help predict movie popularity. The data in the dataset is collected from the following sources:
library(ggplot2)
library(dplyr)
library(caret)
library(statsr)
library(gridExtra)
library(GGally)
library(ggthemes)
library(BAS)
library(lubridate)
library(reshape2)
dim(movies)
## [1] 651 32
str(movies)
## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num [1:651] 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
We can see that the dataset provided consists of 651 rows and 32 column variables.
Each row is a record for movie and each column is an attribute for the movie.
For Prediction/ testing the model, we decided to pick one row in the dataset and separate that from the rest of the data we will use for making our model. it will be used later for our prediction. We will call our data with a single observation as our testing set and the main bulk of our data as our training set.
set.seed(3999)
# using caret package
inTrain <- createDataPartition(y=movies$imdb_rating, p=0.994, list=FALSE)
training <- movies[inTrain,]
testing <- movies[-inTrain,]
dim(training)
## [1] 650 32
dim(testing)
## [1] 1 32
summary(training)
## title title_type genre runtime
## Length:650 Documentary : 55 Drama :304 Min. : 39.0
## Class :character Feature Film:590 Comedy : 87 1st Qu.: 92.0
## Mode :character TV Movie : 5 Action & Adventure: 65 Median :103.0
## Mystery & Suspense: 59 Mean :105.8
## Documentary : 52 3rd Qu.:116.0
## Horror : 23 Max. :267.0
## (Other) : 60 NA's :1
## mpaa_rating studio thtr_rel_year
## G : 19 Paramount Pictures : 36 Min. :1970
## NC-17 : 2 Warner Bros. Pictures : 30 1st Qu.:1990
## PG :118 Sony Pictures Home Entertainment: 27 Median :2000
## PG-13 :132 Universal Pictures : 23 Mean :1998
## R :329 Warner Home Video : 19 3rd Qu.:2007
## Unrated: 50 (Other) :507 Max. :2014
## NA's : 8
## thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month dvd_rel_day
## Min. : 1.000 Min. : 1.0 Min. :1991 Min. : 1.00 Min. : 1.00
## 1st Qu.: 4.000 1st Qu.: 7.0 1st Qu.:2001 1st Qu.: 3.00 1st Qu.: 7.00
## Median : 7.000 Median :15.0 Median :2004 Median : 6.00 Median :15.00
## Mean : 6.746 Mean :14.4 Mean :2004 Mean : 6.34 Mean :14.99
## 3rd Qu.:10.000 3rd Qu.:21.0 3rd Qu.:2008 3rd Qu.: 9.00 3rd Qu.:23.00
## Max. :12.000 Max. :31.0 Max. :2015 Max. :12.00 Max. :31.00
## NA's :8 NA's :8 NA's :8
## imdb_rating imdb_num_votes critics_rating critics_score
## Min. :1.900 Min. : 180 Certified Fresh:135 Min. : 1.00
## 1st Qu.:5.900 1st Qu.: 4584 Fresh :209 1st Qu.: 33.00
## Median :6.600 Median : 15204 Rotten :306 Median : 61.00
## Mean :6.494 Mean : 57621 Mean : 57.73
## 3rd Qu.:7.300 3rd Qu.: 58484 3rd Qu.: 83.00
## Max. :9.000 Max. :893008 Max. :100.00
##
## audience_rating audience_score best_pic_nom best_pic_win best_actor_win
## Spilled:274 Min. :11.00 no :628 no :643 no :557
## Upright:376 1st Qu.:46.00 yes: 22 yes: 7 yes: 93
## Median :65.00
## Mean :62.42
## 3rd Qu.:80.00
## Max. :97.00
##
## best_actress_win best_dir_win top200_box director actor1
## no :578 no :607 no :635 Length:650 Length:650
## yes: 72 yes: 43 yes: 15 Class :character Class :character
## Mode :character Mode :character
##
##
##
##
## actor2 actor3 actor4 actor5
## Length:650 Length:650 Length:650 Length:650
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url
## Length:650 Length:650
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
List of variables in the dataset provided
Variable | Description | |
---|---|---|
title | Title of movie | character |
title_type | Type of movie (Documentary, Feature Film, TV Movie…) | factor |
genre | Genre of movie (Action & Adventure, Comedy, Documentary, Drama…) | factor |
runtime | Runtime of movie (in minutes) | numeric |
mpaa_rating | MPAA rating of the movie (G, PG, PG-13, R, Unrated) | factor |
studio | Studio that produced the movie | factor |
thtr_rel_year | Year the movie is released in theaters | numeric |
thtr_rel_month | Month the movie is released in theaters | numeric |
thtr_rel_day | Day of the month the movie is released in theaters | numeric |
dvd_rel_year | Year the movie is released on DVD | numeric |
dvd_rel_month | Month the movie is released on DVD | numeric |
dvd_rel_day | Day of the month the movie is released on DVD | numeric |
imdb_rating | Rating on IMDB | numeric |
imdb_num_votes | Number of votes on IMDB | integer |
critics_rating | Critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten) | factor |
critics_score | Critics score on Rotten Tomatoes | numeric |
audience_rating | Audience rating on Rotten Tomatoes (Spilled, Upright) | factor |
audience_score | Audience score on Rotten Tomatoes | numeric |
best_pic_nom | Whether or not the movie was nominated for a best picture Oscar (no, yes) | factor |
best_pic_win | Whether or not the movie won a best picture Oscar (no, yes) | factor |
best_actor_win | Whether or not one of the main actors in the movie ever won an Oscar (no, yes) | factor |
best_actress_win | Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) | factor |
best_dir_win | Whether or not the director of the movie ever won an Oscar (no, yes) | factor |
top200_box | Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes) | factor |
director | First main actor/actress in the abridged cast of the movie | character |
actor1 | Second main actor/actress in the abridged cast of the movie | character |
actor2 | Third main actor/actress in the abridged cast of the movie | character |
actor3 | Fourth main actor/actress in the abridged cast of the movie | character |
actor4 | Fifth main actor/actress in the abridged cast of the movie | character |
actor5 | Link to IMDB page for the movie | character |
imdb_url | Link to Rotten Tomatoes page for the movie | character |
rt_url | Title of movie | character |
The sample data is a random sample of movies released in the United States, the results of our analysis can therefore be generalized to the population of movies released in the US.
The characteristics of movies that form the variables in our dataset are derived from the the website Rotten Tomatoes and IMDb. Since voting and rating are voluntary in these websites, the data suffers from positive responses bias, where the votes and ratings of those who did not respond were not counted.
Random assignment was not performed. Causality therefore cannot be inferred from the results of our analysis.
Before answering any question , we should have knowledge about the field. Before heading on the problem of Multiple Linear Regression Modeling and creating a model, we decided to choose the variables that will obviously not be of any use for our model. They are as follows:
Uniform Resource Locators or commonly known as URLs provide an easy way to find more information for each movie but will not provide information whether a movie is popular or not. We can drop the attributes imdb_url and rt_url
If the variable provide information about how many visitors visited their website, that would be even more useful for our purpose.
Runtime or length of the movie in minutes is not a key ingredient for popularity for a movie. Most movies are of similar length.
Runtime would probably be a good predictor of movie genre.
The title of a movie is usually what a moviegoer remembers when a movie is popular but it is not what makes a movie popular.
Which attribute should we select to be our response variable i.e quantify the popularity of movie? Below we can see the possible options that we can consider:
library(knitr)
Variable <- c('imdb_rating', 'imdb_num_votes','critics_rating', 'critics_score','audience_rating', 'audience_score')
Description<-c('Rating on IMDB', 'Number of votes on IMDB','Critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)','Critics score on Rotten Tomatoes','Audience rating on Rotten Tomatoes (Spilled, Upright)','Audience score on Rotten Tomatoes' )
DataType <-c("num","int","factor","num","Factor","num")
df <- cbind(Variable, Description, DataType)
kable(df)
Variable | Description | DataType |
---|---|---|
imdb_rating | Rating on IMDB | num |
imdb_num_votes | Number of votes on IMDB | int |
critics_rating | Critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten) | factor |
critics_score | Critics score on Rotten Tomatoes | num |
audience_rating | Audience rating on Rotten Tomatoes (Spilled, Upright) | Factor |
audience_score | Audience score on Rotten Tomatoes | num |
We have not taken categorical variable as our response variable to avoid logistic regression.
library(knitr)
minv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(min) %>% sapply(round,2)
maxv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(max) %>% sapply(round,2)
meanv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(mean) %>% sapply(round,2)
medianv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(median) %>% sapply(round,2)
df <- rbind(minv, maxv, meanv, medianv)
rownames(df) <- c("min", "max", "mean", "median")
kable(df)
imdb_rating | imdb_num_votes | critics_score | audience_score | |
---|---|---|---|---|
min | 1.90 | 180.0 | 1.00 | 11.00 |
max | 9.00 | 893008.0 | 100.00 | 97.00 |
mean | 6.49 | 57620.9 | 57.73 | 62.42 |
median | 6.60 | 15203.5 | 61.00 | 65.00 |
imdb_rating is scale of 1 to 10 scored by users of IMDb. An IMDb user have one vote per title per user and can change their vote anytime.The totals are converted into a weighted mean-rating that is displayed beside each title in the website.
The imdb_num_votes is the number of users of IMDb who voted for a particular film. It is an open ended scale beginning with 0 and the maximum votes is dependent on the number of users of the website that voted for a particular film.
critics_score and audience_score are ratings from the website Rotten tomatoes. Both have scale of 1 to 100. Each movie features a “user average,” which calculates the percentage of users who have rated the film positively. Users rate the movie on a scale of 0-10, while critics reviews generally use 4-star ratings and are often qualitative.
We cannot make a judgment without looking at the distribution of these variables
gp1 <- ggplot(data = training, aes(x = imdb_rating)) + geom_histogram(colour = "black", fill = "salmon", binwidth = .3)
gp2 <- ggplot(data = training, aes(x = imdb_num_votes)) + geom_histogram(colour = "black", fill = "magenta", binwidth = 40000, alpha = 0.5)
gp3 <- ggplot(data = training, aes(x = critics_score)) + geom_histogram(colour = "black", fill = "cyan", binwidth = 5, alpha = 0.5)
gp4 <- ggplot(data = training, aes(x = audience_score)) + geom_histogram(colour = "black", fill = "yellow", binwidth = 5, alpha = 0.7)
grid.arrange(gp1, gp2, gp3, gp4, nrow = 2, ncol = 2)
- Looking at the above plots, the significant observations are:
quantile(training$imdb_rating, c(0, 0.25, 0.5, 0.75, 1))
## 0% 25% 50% 75% 100%
## 1.9 5.9 6.6 7.3 9.0
quantile(training$imdb_num_votes, c(0, 0.25, 0.5, 0.75, 0.9, 1))
## 0% 25% 50% 75% 90% 100%
## 180.00 4584.25 15203.50 58484.25 151962.20 893008.00
quantile(training$critics_score, c(0, 0.25, 0.5, 0.75, 1))
## 0% 25% 50% 75% 100%
## 1 33 61 83 100
quantile(training$audience_score, c(0, 0.25, 0.5, 0.75, 1))
## 0% 25% 50% 75% 100%
## 11 46 65 80 97
We have two categorical variables critics_rating and audience_rating both taken from same website that measure the same thing. So we will take only one in our model so that all the predictors are independent of each other
gp1 <- ggplot(data = training, aes(y = imdb_rating, x = critics_score, colour = critics_rating)) + geom_point()
plot(gp1)
gp2 <- ggplot(data = training, aes(y = imdb_rating, x = critics_score, colour = audience_rating)) + geom_point()
plot(gp2)
- From the above plot we can state that a similar trends can be seen between imdb_rating and audience_rating . - Both were computed from data collected by IMDB. There are only 2 levels however in their categorical variable.
Is movie popularity, as measured by IMDb rating, associated with other film characteristics like: having actors, actresses, directors who won an oscar, critic reviews, studios that produced the movie, and year that the movie was released in theaters
actor <- table(training$best_actor_win)
actress <- table(training$best_actress_win)
director <- table(training$best_dir_win)
flm2 <- training %>% mutate(oscar = ifelse(best_actor_win == "yes" | best_actress_win == "yes" | best_dir_win == "yes", "yes", "no"))
testing = mutate(testing,
feature_film = ifelse(title_type == 'Feature Film', "yes", "no"),
drama = ifelse(genre == 'Drama', "yes", "no"),
mpaa_rating_R = ifelse(mpaa_rating == 'R', "yes", "no"),
oscar_season = ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no"),
summer_season = ifelse(thtr_rel_month %in% c(5, 6, 7, 8), "yes", "no"))
flm2 = mutate(flm2,
feature_film = ifelse(title_type == 'Feature Film', "yes", "no"),
drama = ifelse(genre == 'Drama', "yes", "no"),
mpaa_rating_R = ifelse(mpaa_rating == 'R', "yes", "no"),
oscar_season = ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no"),
summer_season = ifelse(thtr_rel_month %in% c(5, 6, 7, 8), "yes", "no"))
osc <- flm2 %>% select(oscar) %>% group_by(oscar) %>% table() %>% rbind(actor, actress, director)
rownames(osc) <- c("At.least.one.Oscar", "best.actor", "best.actress", "best.director")
osc
## no yes
## At.least.one.Oscar 479 171
## best.actor 557 93
## best.actress 578 72
## best.director 607 43
We have added the following new variables: 1. oscar which has two levels, combing all the 3 variables on Oscar. 2. feature_film checks if the title type is “Feature Film” or not 3. mpaa_rating_R checks if the mpaa_rating is R or not 4. oscar_season to check if the theater release month is in October, November and December. 5. summer_season to check if the theater release months is in May, June, July and August 6. drama checks if the genre is drama or not
The goal is to observe if having at least one oscar, irrespective of the field, is a better predictor of movie popularity.
oscar_in_cast <- flm2 %>% filter(oscar == "yes") %>% arrange(imdb_rating) %>% select(title) %>% data.frame() %>% head(6)
x1 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = oscar)) + geom_point() + scale_colour_discrete(name="Combined") + scale_fill_hue(name="Combined")
x2 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_actor_win)) + geom_point() + scale_colour_discrete(name="Actor") + scale_fill_hue(name="Actor")
x3 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_actress_win)) + geom_point() + scale_colour_discrete(name="Actress") + scale_fill_hue(name="Actress")
x4 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_dir_win)) + geom_point() + scale_colour_discrete(name="Director") + scale_fill_hue(name="Director")
grid.arrange(x1, x2, x3, x4, nrow = 2, ncol = 2)
- From the above plot we can say that:
flm2%>%select(title, imdb_rating)%>%filter(flm2$oscar == 'yes' & imdb_rating<=5)%>%arrange(imdb_rating)
## # A tibble: 5 x 2
## title imdb_rating
## <chr> <dbl>
## 1 Battlefield Earth 2.4
## 2 Jack and Jill 3.4
## 3 Town & Country 4.3
## 4 Ghost Rider: Spirit of Vengeance 4.3
## 5 Body of Evidence 4.3
nom <- table(flm2$best_pic_nom)
win <- table(flm2$best_pic_win)
flm2 <- flm2 %>% mutate(oscar_nom_win = ifelse(best_pic_nom == "yes" | best_pic_win == "yes", "yes", "no"))
nom_win <- table(flm2$oscar_nom_win)
comb_nom_win <- rbind(nom_win, nom, win)
rownames(comb_nom_win) <- c("combined", "nominations", "wins")
comb_nom_win
## no yes
## combined 627 23
## nominations 628 22
## wins 643 7
flm2%>%select(title, best_pic_nom, oscar_nom_win)%>%filter(best_pic_nom =="no" & oscar_nom_win =="yes")
## # A tibble: 1 x 3
## title best_pic_nom oscar_nom_win
## <chr> <fct> <chr>
## 1 The Hurt Locker no yes
The movie Hurt Locker was not nominated for an Oscar but still won an Oscar according to the data We cross checked on online resources and found that it was actually nominated and won an oscar
gp1 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = oscar_nom_win)) + geom_point()
gp2 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_pic_nom)) + geom_point()
gp3 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_pic_win)) + geom_point()
grid.arrange(gp1, gp2, gp3, nrow = 1, ncol = 3)
gp1 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = drama)) + geom_point()+scale_colour_discrete(name="Drama") + scale_fill_hue(name="Director")
gp2 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = mpaa_rating_R)) + geom_point()+scale_colour_discrete(name="R mpaa_rating") + scale_fill_hue(name="Director")
gp3 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = feature_film)) + geom_point()+scale_colour_discrete(name="feature film") + scale_fill_hue(name="Director")
grid.arrange(gp1, gp2, gp3, nrow = 1, ncol = 3)
flm2%>%select(title,imdb_rating, best_pic_nom, oscar_nom_win)%>%filter(best_pic_nom =="yes" & imdb_rating < 7)
## # A tibble: 1 x 4
## title imdb_rating best_pic_nom oscar_nom_win
## <chr> <dbl> <fct> <chr>
## 1 A Star is Born 6.5 yes yes
outlier_best_pic_nom <- flm2 %>% filter(best_pic_nom == "yes") %>% arrange(imdb_rating) %>% data.frame() %>% head(1)
g1 <- ggplot(data = flm2, aes(x = thtr_rel_year)) + geom_histogram(colour = "black", fill = "orange", alpha = 0.5)
g2 <- ggplot(data = flm2, aes(x = thtr_rel_month)) + geom_histogram(colour = "black", fill = "blue", alpha = 0.5)
g3 <- ggplot(data = flm2, aes(x = thtr_rel_day)) + geom_histogram(colour = "black", fill = "green", alpha = 0.5)
grid.arrange(g1,g2,g3,nrow =1, ncol =3)
g4 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(thtr_rel_year))) + geom_point()
plot(g4)
g5 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(thtr_rel_month))) + geom_point()
plot(g5)
g6 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, colour = factor(thtr_rel_day))) + geom_point()
plot(g6)
g1 <- ggplot(data = flm2, aes(x = dvd_rel_year)) + geom_histogram(colour = "black", fill = "orange", alpha = 0.5)
g2 <- ggplot(data = flm2, aes(x = dvd_rel_month)) + geom_histogram(colour = "black", fill = "blue", alpha = 0.5)
g3 <- ggplot(data = flm2, aes(x = dvd_rel_day)) + geom_histogram(colour = "black", fill = "green", alpha = 0.5)
grid.arrange(g1,g2,g3,nrow =1, ncol =3)
gp1 <-ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(dvd_rel_year))) + geom_point()
plot(gp1)
gp2 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, colour = factor(dvd_rel_month))) + geom_point()
plot(gp2)
gp3 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(dvd_rel_day))) + geom_point()
plot(gp3)
The histograms above show that there are particular years, months, and days where more dvds are released for the first time. We do not observe any clustering of these points in the scatterplot along the imdb_rating and critics_score scale.
num_var <- flm2 %>% select(imdb_rating,imdb_num_votes, critics_score,audience_score)
ggcorr(num_var, name = "Correlation", label = TRUE, alpha = TRUE, palette = "PuOr") + ggtitle("Correlation Matrix Plot") + theme_light()
Our goal is to create a model of what makes a movie popular. We choose imdb_rating as our response variable as it’s distribution is most similar to that of a normal distribution among the other numerical variables that measures movie popularity.
Most of the reason why a movie is popular is based on how many moviegoers appreciate it. We choose audience_score as our first predictor as it is the most correlated to our response variable based on the table above and it makes sense that audience rating is important for a movie to be popular.
fit1 <- lm(imdb_rating ~ audience_score, data = flm2)
summary(fit1)
##
## Call:
## lm(formula = imdb_rating ~ audience_score, data = flm2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2074 -0.1865 0.0704 0.3096 1.1557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.590240 0.069413 51.72 <2e-16 ***
## audience_score 0.046525 0.001058 43.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5441 on 648 degrees of freedom
## Multiple R-squared: 0.7489, Adjusted R-squared: 0.7486
## F-statistic: 1933 on 1 and 648 DF, p-value: < 2.2e-16
fit2 <- lm(imdb_rating ~ audience_score + critics_score, data = flm2)
summary(fit2)
##
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score, data = flm2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.52044 -0.20001 0.03047 0.30616 1.22958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6383938 0.0625863 58.13 <2e-16 ***
## audience_score 0.0348491 0.0013405 26.00 <2e-16 ***
## critics_score 0.0117895 0.0009526 12.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4897 on 647 degrees of freedom
## Multiple R-squared: 0.797, Adjusted R-squared: 0.7964
## F-statistic: 1270 on 2 and 647 DF, p-value: < 2.2e-16
fit3 <- lm(imdb_rating ~ audience_score + critics_score + imdb_num_votes, data = flm2)
summary(fit3)
##
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes,
## data = flm2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48392 -0.18513 0.02148 0.29555 1.17567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.674e+00 6.203e-02 59.226 < 2e-16 ***
## audience_score 3.354e-02 1.347e-03 24.895 < 2e-16 ***
## critics_score 1.176e-02 9.373e-04 12.543 < 2e-16 ***
## imdb_num_votes 8.332e-07 1.761e-07 4.732 2.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4818 on 646 degrees of freedom
## Multiple R-squared: 0.8038, Adjusted R-squared: 0.8029
## F-statistic: 882.2 on 3 and 646 DF, p-value: < 2.2e-16
fit4 <- lm(imdb_rating ~ audience_score + critics_score + imdb_num_votes+oscar, data = flm2)
summary(fit4)
##
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes +
## oscar, data = flm2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.45897 -0.19332 0.01364 0.28747 1.20491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.647e+00 6.278e-02 58.102 < 2e-16 ***
## audience_score 3.374e-02 1.345e-03 25.085 < 2e-16 ***
## critics_score 1.159e-02 9.363e-04 12.382 < 2e-16 ***
## imdb_num_votes 7.673e-07 1.776e-07 4.320 1.8e-05 ***
## oscaryes 1.039e-01 4.342e-02 2.393 0.017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.48 on 645 degrees of freedom
## Multiple R-squared: 0.8055, Adjusted R-squared: 0.8043
## F-statistic: 667.9 on 4 and 645 DF, p-value: < 2.2e-16
fit5 <- lm(imdb_rating ~ audience_score + critics_score + imdb_num_votes+oscar+summer_season, data = flm2)
summary(fit5)
##
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes +
## oscar + summer_season, data = flm2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48947 -0.19010 0.02537 0.27370 1.27897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.688e+00 6.417e-02 57.475 < 2e-16 ***
## audience_score 3.358e-02 1.339e-03 25.070 < 2e-16 ***
## critics_score 1.169e-02 9.322e-04 12.541 < 2e-16 ***
## imdb_num_votes 7.769e-07 1.767e-07 4.396 1.29e-05 ***
## oscaryes 9.911e-02 4.323e-02 2.293 0.02220 *
## summer_seasonyes -1.115e-01 4.022e-02 -2.771 0.00575 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4775 on 644 degrees of freedom
## Multiple R-squared: 0.8078, Adjusted R-squared: 0.8063
## F-statistic: 541.4 on 5 and 644 DF, p-value: < 2.2e-16
We used a forward selection approach with a combined criteria of p value, adjusted R squared, and logical reasoning.
imdb_rating = 3.668 + 0.03358audience_score + 0.0169critics_score + 7.769e-07 imdb_num_votes+ 0.099oscar:yes - 0.1115summer_season:yes
Intercept - The intercept (3.668) is interpreted as the predicted mean imdb_rating when audience_score, critics_score, imdb_num_votes, oscar and summer_season is 0. The intercept is usually meaningless when the predictor variables are continuous. Furthermore, the scale of critics_score starts at one and the minimum value in our data for audience_score is 11. The value of 0 therefore is beyond the range of our data points. We will be extrapolating if we assume a value of 0 for our predictors.
Coefficient for audience_score - All else held constant, for every one unit increase in audience_score the model predicts a 0.0335 increase in imdb_rating on average. The coefficients are small because the scale of audience_score is from 1 to 100, while the scale of imdb_rating is 1 to 10. Another way of saying this is, all else held constant, for every ten unit increase in audience_score the model predicts a 0.3 increase in imdb_rating on average.
Coefficient for critics_score - All else held constant, for every one unit increase in critics_score the model predicts a 0.01 increase in imdb_rating on average.
Coefficient for imdb_num_votes - All else held constant, for every 1 vote in imdb_num_votes the model predicts a 0.077*10^-5 increase in imdb_rating on average. IMDB no of votes has ver high magnitude whereas imdb_rating lies between 1-10
Coefficient for oscar - All else held constant, the model predicts that the imdb_rating for movies with at least one member of the cast (actor, actress, including director) with an Oscar award is 0.1 points higher than movies without a member of the cast with an Oscar award on average. The difference in popularity based on imdb_rating between the two is not that much.
Coefficient for summer_season - All else held constant, the model predicts that the imdb_rating for movies released in summer season is 0.115 lower than the movies released in other months.
R squared - 80.603 % of the variability of imdb_rating is explained by the model.
P values - All the coefficients in our model has a p-value that is less than 0.05. The explanatory variables in our model are significant predictors of the response variable (imdb_rating). Their slopes are different from 0.
We will now proceed to perform model diagnostics to see if our model meets the requirements for the linear conditions to be valid. ### linear relationships between each (numerical) predictor and response
t1 <- ggplot(data = flm2, aes(x = audience_score, y = resid(fit5))) + geom_hline(yintercept = 0, size = 1) + xlab("audience_score") + ylab("Residual") + geom_point()
t2 <- ggplot(data = flm2, aes(x = critics_score, y = resid(fit5))) + geom_hline(yintercept = 0, size = 1) + xlab("critics_score") + ylab("Residual") + geom_point()
t3 <- ggplot(data = flm2, aes(x =imdb_num_votes, y = resid(fit5))) + geom_hline(yintercept = 0, size = 1) + xlab("IMDB no of votes") + ylab("Residual") + geom_point()
grid.arrange(t1, t2,t3, nrow = 2, ncol = 2)
Based on the residual plots above, we can see a linear trend between our residuals and our predictor variables. This condition is met by our model.
par(mfrow = c(1,2))
hist(fit5$residuals, breaks = 25, main = "Histogram of Residuals", col = "salmon", border = "yellow", prob = TRUE)
curve(dnorm(x, mean = mean(fit5$residuals), sd = sd(fit5$residuals)), col="cyan", add=T, lwd = 3)
qqnorm(fit5$residuals)
qqline(fit5$residuals)
We can see a strong left skew in the histogram of our residuals (left) and the normal probability plots (center) above.
Our model fails to meet this requirement for linearity. This means that our model is not very reliable when audience score or critics score is low.
t3 <- ggplot(data.frame(x = fit5$fitted.values, y = resid(fit5)), aes(x=x, y=y)) + geom_hline(yintercept = 0, size = 1) + xlab("fitted.values") + ylab("Residual") + geom_point()
grid.arrange(t3, nrow = 1, ncol = 1)
The plot of the residuals and the order of observations above reveal a random pattern. The observations appear to be independent of each other. There is no fan shape that can be observed. This condition is met. ## Prediction
testing2 <- testing %>% mutate(oscar = ifelse(best_actor_win == "yes" | best_actress_win == "yes" | best_dir_win == "yes", "yes", "no"))
newprof <- testing2 %>% select(audience_score, critics_score, imdb_num_votes,oscar, summer_season)
pred1<-c()
pred <- predict(fit5, newprof, interval = "prediction", level = 0.95, se.fit = TRUE)
pred
## $fit
## fit lwr upr
## 1 4.867111 3.925388 5.808834
##
## $se.fit
## [1] 0.04406653
##
## $df
## [1] 644
##
## $residual.scale
## [1] 0.4775478
Our model has demonstrated that with few predictors, we can predict with a certain amount of accuracy the popularity of movies using imdb_rating as a measure of popularity. However, we have to remember that our predictors, audience_score and critics_score are subjective measures that are easily prone to bias. Meaning both critics and audience are both finicky in their choices, determined by many variables not found in our dataset such as age, race, and gender.
Looking at the model’s residual plots, it seems that there is greater variance when rating bad movies compared to good movies and critics are more likely than the audience to rate extreme values.
The small number of movies in our dataset that’s included in the list might have lessened the variables ability to have a small p value. It might have been better to have used a continuous variable of gross movie sales.
We also could have used a larger number of observations in our testing data set to have a better measure of the model’s accuracy.
df <- select(flm2, audience_score, feature_film,drama,mpaa_rating_R, oscar_season,summer_season)
dfmelt <- melt(df, measure.vars = 2:6)
ggplot(dfmelt, aes(x=value, y=audience_score,fill=variable))+
geom_boxplot()+
facet_grid(.~variable)+
labs(x="Type",y="Audience Score")
dfmelt %>%
group_by(variable,value) %>%
summarize(avg_rating = mean(audience_score))
## # A tibble: 10 x 3
## # Groups: variable [5]
## variable value avg_rating
## <fct> <chr> <dbl>
## 1 feature_film no 81.0
## 2 feature_film yes 60.5
## 3 drama no 59.7
## 4 drama yes 65.5
## 5 mpaa_rating_R no 62.8
## 6 mpaa_rating_R yes 62.0
## 7 oscar_season no 61.9
## 8 oscar_season yes 63.7
## 9 summer_season no 62.7
## 10 summer_season yes 61.8
Looking at summary and boxplot, we can see that except for feature film, others type of categorical variables have no little to no effect on audience_score. A reason for this could be that feature films are judged much more harshly being released in theater compared to Documentary and TV Movies where the audience is more selected.
df2 <- select(flm2,audience_score, feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)
fit1 <- bas.lm(audience_score ~ ., data=df2,
prior="BIC",
modelprior = uniform())
summary(fit1)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.06568794 0.0000 0.0000000 0.0000000
## dramayes 0.04658612 0.0000 0.0000000 0.0000000
## runtime 0.46961842 1.0000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.23638849 0.0000 0.0000000 1.0000000
## thtr_rel_year 0.10407313 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.08113439 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.07534015 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.05702086 0.0000 0.0000000 0.0000000
## critics_score 0.86805460 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13420351 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.03992350 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.15378167 0.0000 0.0000000 0.0000000
## best_actress_winyes 0.14879547 0.0000 0.0000000 0.0000000
## best_dir_winyes 0.06762404 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04746833 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.9896583 0.3074889
## PostProbs NA 0.1172 0.1160000 0.0360000
## R2 NA 0.7557 0.7533000 0.7548000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3606.9449 -3606.9552712 -3608.1241920
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 1.0000000 0.0000000
## mpaa_rating_Ryes 1.0000000 0.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 0.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 1.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2930721 0.2738347
## PostProbs 0.0343000 0.0321000
## R2 0.7572000 0.7548000
## dim 5.0000000 4.0000000
## logmarg -3608.1722124 -3608.2401063
Printing the model object and the summary command gives us both the posterior model inclusion probability for each variable and the most probable models. The most likely model, which has posterior probability of 0.1297, includes only Intercept, runtime, imdb_rating, and critics_score. While a posterior probability of 0.1297 sounds small, it is much larger than the uniform prior probability assigned to it, since there are 217 possible models.
We get the final model with the 3 variables above.
df3 <- select(df2,runtime, imdb_rating,critics_score,audience_score)
fit2 <- bas.lm(audience_score ~ ., df3,
prior="BIC",
modelprior = uniform())
summary(fit2)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.0000000 1.0000000
## runtime 0.5133911 1.0000 0.0000000 1.0000000
## imdb_rating 1.0000000 1.0000 1.0000000 1.0000000
## critics_score 0.8888276 1.0000 1.0000000 0.0000000
## BF NA 1.0000 0.9896583 0.1492362
## PostProbs NA 0.4467 0.4421000 0.0667000
## R2 NA 0.7557 0.7533000 0.7518000
## dim NA 4.0000 3.0000000 3.0000000
## logmarg NA -3606.9449 -3606.9552712 -3608.8471008
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## runtime 0.000000e+00 0.000000e+00
## imdb_rating 1.000000e+00 0.000000e+00
## critics_score 0.000000e+00 1.000000e+00
## BF 9.962547e-02 3.224912e-100
## PostProbs 4.450000e-02 0.000000e+00
## R2 7.490000e-01 4.952000e-01
## dim 2.000000e+00 2.000000e+00
## logmarg -3.609251e+03 -3.836032e+03
The residuals vs. fitted values shows no pattern. Moreover, the inclusions probabilities of all variables are above 0.5 (near 1).
par(mfrow = c(1,2))
plot(fit2,which=c(1,2))
par(mfrow = c(1,2))
plot(fit2,which=c(3,4))
audience_score = 62.4- 0.0276runtime + 14.98imdb_rating + 0.063*critics_score
We also have credible intervals for each coefficients. Since this is Bayesian statistic, there is a 95% probability the true coefficient will be in this interval (from lower to upper)
confint(coefficients(fit2))
## 2.5% 97.5% beta
## Intercept 61.63023177 63.1765969 62.40523883
## runtime -0.08196549 0.0000000 -0.02769210
## imdb_rating 13.70673380 16.5734523 14.98733796
## critics_score 0.00000000 0.1051097 0.06257525
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
deathpool <- data.frame(runtime=103, imdb_rating = 8, critics_score = 84)
pre <- predict(fit2,deathpool,estimator="MPM", se.fit=TRUE)
pre$fit
## [1] 86.95391
## attr(,"model")
## [1] 0 1 2 3
## attr(,"best")
## [1] 1
## attr(,"estimator")
## [1] "MPM"
Our model predicts the audience score of roughly 87 points. This is a very good prediction (the real audience score is at 90)
We conclude that based on the given data, critics score, runtime and imdb_rating will have the most effects on audience score of a movies on average.
One shortcoming of this analysis is the lack of prior model. Perhaps there maybe reverse causality between audience score and critics score and we should take it into account.
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
## [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
## [5] LC_TIME=English_India.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.29 reshape2_1.4.4 lubridate_1.7.9 BAS_1.5.5
## [5] ggthemes_4.2.0 GGally_2.0.0 gridExtra_2.3 statsr_0.2.1
## [9] caret_6.0-86 lattice_0.20-41 dplyr_1.0.1 ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 class_7.3-17 utf8_1.1.4
## [4] assertthat_0.2.1 digest_0.6.25 ipred_0.9-9
## [7] foreach_1.5.0 mime_0.9 R6_2.4.1
## [10] plyr_1.8.6 stats4_4.0.2 evaluate_0.14
## [13] highr_0.8 pillar_1.4.6 rlang_0.4.7
## [16] data.table_1.13.0 rpart_4.1-15 Matrix_1.2-18
## [19] rmarkdown_2.3 labeling_0.3 splines_4.0.2
## [22] gower_0.2.2 stringr_1.4.0 munsell_0.5.0
## [25] shiny_1.5.0 compiler_4.0.2 httpuv_1.5.4
## [28] xfun_0.15 pkgconfig_2.0.3 htmltools_0.5.0
## [31] nnet_7.3-14 tidyselect_1.1.0 tibble_3.0.3
## [34] prodlim_2019.11.13 codetools_0.2-16 fansi_0.4.1
## [37] reshape_0.8.8 crayon_1.3.4 withr_2.2.0
## [40] later_1.1.0.1 MASS_7.3-51.6 recipes_0.1.13
## [43] ModelMetrics_1.2.2.2 grid_4.0.2 nlme_3.1-148
## [46] xtable_1.8-4 gtable_0.3.0 lifecycle_0.2.0
## [49] magrittr_1.5 pROC_1.16.2 scales_1.1.1
## [52] cli_2.0.2 stringi_1.4.6 farver_2.0.3
## [55] promises_1.1.1 timeDate_3043.102 ellipsis_0.3.1
## [58] generics_0.0.2 vctrs_0.3.2 lava_1.6.7
## [61] RColorBrewer_1.1-2 iterators_1.0.12 tools_4.0.2
## [64] glue_1.4.1 purrr_0.3.4 fastmap_1.0.1
## [67] survival_3.1-12 yaml_2.2.1 colorspace_1.4-1