Linear Regression & Modelling: Predicting Movie Popularity

Abstract

In this project we have developed a Multi-variate Linear Regression Model that will explain the factors that makes a movie popular and help predict movie popularity. The data in the dataset is collected from the following sources:

Rotten Tomatoes, a website that keeps track of all reviews for each films and aggregates the results and Internet Movie Database
IMDB, an online database of information related to film, television programs and video games.

Load Packages

library(ggplot2)
library(dplyr)
library(caret)
library(statsr)
library(gridExtra)
library(GGally)
library(ggthemes)
library(BAS)
library(lubridate)
library(reshape2)

Loading the data

The dataset can be downloaded by clicking on data

load('movies.Rdata')

Understanding the data

dim(movies)

## [1] 651  32

str(movies)

## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

We can see that the dataset provided consists of 651 rows and 32 column variables.
Each row is a record for movie and each column is an attribute for the movie.
For Prediction/ testing the model, we decided to pick one row in the dataset and separate that from the rest of the data we will use for making our model. it will be used later for our prediction. We will call our data with a single observation as our testing set and the main bulk of our data as our training set.

Dividing the data into training set and testing set

set.seed(3999)
# using caret package 
inTrain <- createDataPartition(y=movies$imdb_rating, p=0.994, list=FALSE) 
training <- movies[inTrain,]
testing <- movies[-inTrain,]

dim(training)

## [1] 650  32

dim(testing)

## [1]  1 32

summary(training)

##     title                  title_type                 genre        runtime     
##  Length:650         Documentary : 55   Drama             :304   Min.   : 39.0  
##  Class :character   Feature Film:590   Comedy            : 87   1st Qu.: 92.0  
##  Mode  :character   TV Movie    :  5   Action & Adventure: 65   Median :103.0  
##                                        Mystery & Suspense: 59   Mean   :105.8  
##                                        Documentary       : 52   3rd Qu.:116.0  
##                                        Horror            : 23   Max.   :267.0  
##                                        (Other)           : 60   NA's   :1      
##   mpaa_rating                               studio    thtr_rel_year 
##  G      : 19   Paramount Pictures              : 36   Min.   :1970  
##  NC-17  :  2   Warner Bros. Pictures           : 30   1st Qu.:1990  
##  PG     :118   Sony Pictures Home Entertainment: 27   Median :2000  
##  PG-13  :132   Universal Pictures              : 23   Mean   :1998  
##  R      :329   Warner Home Video               : 19   3rd Qu.:2007  
##  Unrated: 50   (Other)                         :507   Max.   :2014  
##                NA's                            :  8                 
##  thtr_rel_month    thtr_rel_day   dvd_rel_year  dvd_rel_month    dvd_rel_day   
##  Min.   : 1.000   Min.   : 1.0   Min.   :1991   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.: 4.000   1st Qu.: 7.0   1st Qu.:2001   1st Qu.: 3.00   1st Qu.: 7.00  
##  Median : 7.000   Median :15.0   Median :2004   Median : 6.00   Median :15.00  
##  Mean   : 6.746   Mean   :14.4   Mean   :2004   Mean   : 6.34   Mean   :14.99  
##  3rd Qu.:10.000   3rd Qu.:21.0   3rd Qu.:2008   3rd Qu.: 9.00   3rd Qu.:23.00  
##  Max.   :12.000   Max.   :31.0   Max.   :2015   Max.   :12.00   Max.   :31.00  
##                                  NA's   :8      NA's   :8       NA's   :8      
##   imdb_rating    imdb_num_votes           critics_rating critics_score   
##  Min.   :1.900   Min.   :   180   Certified Fresh:135    Min.   :  1.00  
##  1st Qu.:5.900   1st Qu.:  4584   Fresh          :209    1st Qu.: 33.00  
##  Median :6.600   Median : 15204   Rotten         :306    Median : 61.00  
##  Mean   :6.494   Mean   : 57621                          Mean   : 57.73  
##  3rd Qu.:7.300   3rd Qu.: 58484                          3rd Qu.: 83.00  
##  Max.   :9.000   Max.   :893008                          Max.   :100.00  
##                                                                          
##  audience_rating audience_score  best_pic_nom best_pic_win best_actor_win
##  Spilled:274     Min.   :11.00   no :628      no :643      no :557       
##  Upright:376     1st Qu.:46.00   yes: 22      yes:  7      yes: 93       
##                  Median :65.00                                           
##                  Mean   :62.42                                           
##                  3rd Qu.:80.00                                           
##                  Max.   :97.00                                           
##                                                                          
##  best_actress_win best_dir_win top200_box   director            actor1         
##  no :578          no :607      no :635    Length:650         Length:650        
##  yes: 72          yes: 43      yes: 15    Class :character   Class :character  
##                                           Mode  :character   Mode  :character  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##     actor2             actor3             actor4             actor5         
##  Length:650         Length:650         Length:650         Length:650        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    imdb_url            rt_url         
##  Length:650         Length:650        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

Data Codebook

List of variables in the dataset provided

Variable	Description
title	Title of movie	character
title_type	Type of movie (Documentary, Feature Film, TV Movie…)	factor
genre	Genre of movie (Action & Adventure, Comedy, Documentary, Drama…)	factor
runtime	Runtime of movie (in minutes)	numeric
mpaa_rating	MPAA rating of the movie (G, PG, PG-13, R, Unrated)	factor
studio	Studio that produced the movie	factor
thtr_rel_year	Year the movie is released in theaters	numeric
thtr_rel_month	Month the movie is released in theaters	numeric
thtr_rel_day	Day of the month the movie is released in theaters	numeric
dvd_rel_year	Year the movie is released on DVD	numeric
dvd_rel_month	Month the movie is released on DVD	numeric
dvd_rel_day	Day of the month the movie is released on DVD	numeric
imdb_rating	Rating on IMDB	numeric
imdb_num_votes	Number of votes on IMDB	integer
critics_rating	Critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)	factor
critics_score	Critics score on Rotten Tomatoes	numeric
audience_rating	Audience rating on Rotten Tomatoes (Spilled, Upright)	factor
audience_score	Audience score on Rotten Tomatoes	numeric
best_pic_nom	Whether or not the movie was nominated for a best picture Oscar (no, yes)	factor
best_pic_win	Whether or not the movie won a best picture Oscar (no, yes)	factor
best_actor_win	Whether or not one of the main actors in the movie ever won an Oscar (no, yes)	factor
best_actress_win	Whether or not one of the main actresses in the movie ever won an Oscar (no, yes)	factor
best_dir_win	Whether or not the director of the movie ever won an Oscar (no, yes)	factor
top200_box	Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)	factor
director	First main actor/actress in the abridged cast of the movie	character
actor1	Second main actor/actress in the abridged cast of the movie	character
actor2	Third main actor/actress in the abridged cast of the movie	character
actor3	Fourth main actor/actress in the abridged cast of the movie	character
actor4	Fifth main actor/actress in the abridged cast of the movie	character
actor5	Link to IMDB page for the movie	character
imdb_url	Link to Rotten Tomatoes page for the movie	character
rt_url	Title of movie	character

Data Exploration

Generalizability and Bias

The sample data is a random sample of movies released in the United States, the results of our analysis can therefore be generalized to the population of movies released in the US.
The characteristics of movies that form the variables in our dataset are derived from the the website Rotten Tomatoes and IMDb. Since voting and rating are voluntary in these websites, the data suffers from positive responses bias, where the votes and ratings of those who did not respond were not counted.
Random assignment was not performed. Causality therefore cannot be inferred from the results of our analysis.
Before answering any question , we should have knowledge about the field. Before heading on the problem of Multiple Linear Regression Modeling and creating a model, we decided to choose the variables that will obviously not be of any use for our model. They are as follows:
Uniform Resource Locators or commonly known as URLs provide an easy way to find more information for each movie but will not provide information whether a movie is popular or not. We can drop the attributes imdb_url and rt_url
If the variable provide information about how many visitors visited their website, that would be even more useful for our purpose.
Runtime or length of the movie in minutes is not a key ingredient for popularity for a movie. Most movies are of similar length.
Runtime would probably be a good predictor of movie genre.
The title of a movie is usually what a moviegoer remembers when a movie is popular but it is not what makes a movie popular.

Which attribute should we select to be our response variable i.e quantify the popularity of movie? Below we can see the possible options that we can consider:

library(knitr)
Variable <- c('imdb_rating', 'imdb_num_votes','critics_rating', 'critics_score','audience_rating', 'audience_score')
Description<-c('Rating on IMDB', 'Number of votes on IMDB','Critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)','Critics score on Rotten Tomatoes','Audience rating on Rotten Tomatoes (Spilled, Upright)','Audience score on Rotten Tomatoes' )
DataType <-c("num","int","factor","num","Factor","num")
df <- cbind(Variable, Description, DataType)
kable(df)

Variable	Description	DataType
imdb_rating	Rating on IMDB	num
imdb_num_votes	Number of votes on IMDB	int
critics_rating	Critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)	factor
critics_score	Critics score on Rotten Tomatoes	num
audience_rating	Audience rating on Rotten Tomatoes (Spilled, Upright)	Factor
audience_score	Audience score on Rotten Tomatoes	num

We have not taken categorical variable as our response variable to avoid logistic regression.

library(knitr)
minv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(min) %>% sapply(round,2)
maxv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(max) %>% sapply(round,2)
meanv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(mean) %>% sapply(round,2)
medianv <- training %>% select(imdb_rating, imdb_num_votes, critics_score, audience_score) %>% sapply(median) %>% sapply(round,2)
df <- rbind(minv, maxv, meanv, medianv)
rownames(df) <- c("min", "max", "mean", "median")
kable(df)

	imdb_rating	imdb_num_votes	critics_score	audience_score
min	1.90	180.0	1.00	11.00
max	9.00	893008.0	100.00	97.00
mean	6.49	57620.9	57.73	62.42
median	6.60	15203.5	61.00	65.00

imdb_rating is scale of 1 to 10 scored by users of IMDb. An IMDb user have one vote per title per user and can change their vote anytime.The totals are converted into a weighted mean-rating that is displayed beside each title in the website.
The imdb_num_votes is the number of users of IMDb who voted for a particular film. It is an open ended scale beginning with 0 and the maximum votes is dependent on the number of users of the website that voted for a particular film.
critics_score and audience_score are ratings from the website Rotten tomatoes. Both have scale of 1 to 100. Each movie features a “user average,” which calculates the percentage of users who have rated the film positively. Users rate the movie on a scale of 0-10, while critics reviews generally use 4-star ratings and are often qualitative.

We cannot make a judgment without looking at the distribution of these variables

gp1 <- ggplot(data = training, aes(x = imdb_rating)) + geom_histogram(colour = "black", fill = "salmon", binwidth = .3)
gp2 <- ggplot(data = training, aes(x = imdb_num_votes)) + geom_histogram(colour = "black", fill = "magenta", binwidth = 40000, alpha = 0.5)
gp3 <- ggplot(data = training, aes(x = critics_score)) + geom_histogram(colour = "black", fill = "cyan", binwidth = 5, alpha = 0.5)
gp4 <- ggplot(data = training, aes(x = audience_score)) + geom_histogram(colour = "black", fill = "yellow", binwidth = 5, alpha = 0.7)
grid.arrange(gp1, gp2, gp3, gp4, nrow = 2, ncol = 2)

- Looking at the above plots, the significant observations are:

imdb_rating seems to have the closest affinity to the shape of a normal distribution

quantile(training$imdb_rating, c(0, 0.25, 0.5, 0.75, 1))

##   0%  25%  50%  75% 100% 
##  1.9  5.9  6.6  7.3  9.0

imdb_num_votes is heavily right skewed with more than 75% of the movies having a score of and below

quantile(training$imdb_num_votes, c(0, 0.25, 0.5, 0.75, 0.9, 1))

##        0%       25%       50%       75%       90%      100% 
##    180.00   4584.25  15203.50  58484.25 151962.20 893008.00

The distribution of critics_score and audience_score appear similar except that the audience score taper more in both ends.

quantile(training$critics_score, c(0, 0.25, 0.5, 0.75, 1))

##   0%  25%  50%  75% 100% 
##    1   33   61   83  100

quantile(training$audience_score, c(0, 0.25, 0.5, 0.75, 1))

##   0%  25%  50%  75% 100% 
##   11   46   65   80   97

We have two categorical variables critics_rating and audience_rating both taken from same website that measure the same thing. So we will take only one in our model so that all the predictors are independent of each other

gp1 <- ggplot(data = training, aes(y = imdb_rating, x = critics_score, colour = critics_rating)) + geom_point()
plot(gp1)

From the above plot we can state that as the value of critics_score increases the color of the points changes correspondingly to the levels of critics_rating with a clear demarcation between the level rotten and fresh around the value 60.
This is because both were computed from the same data from the same company, Rotten Tomatoes. There is an overlap between the level of fresh and certified fresh.
It seems that the numerical scale and the categorical scale find it easier to agree on which movie is rotten than it is to agree which movie is fresh or certified fresh.

gp2 <- ggplot(data = training, aes(y = imdb_rating, x = critics_score, colour = audience_rating)) + geom_point()
plot(gp2)

- From the above plot we can state that a similar trends can be seen between imdb_rating and audience_rating . - Both were computed from data collected by IMDB. There are only 2 levels however in their categorical variable.

We will choose critics_rating to be in our model, because it has three categories.
Popularity is not a binary concept
Also, film industry invests lots of money behind critics rating and it is important to see if critics rating play a significant role when it comes to popularity of movie.

Research Question

Is movie popularity, as measured by IMDb rating, associated with other film characteristics like: having actors, actresses, directors who won an oscar, critic reviews, studios that produced the movie, and year that the movie was released in theaters

Exploratory Data Analysis

It is important to comprehend and scrutinize the relationship of the variables in our dataset.
According the New York Times, An actor, actress, or director who has won an oscar award is a great motivation to see a movie. An oscar award is like a guarantee of excellence.
Does an actress starring in a movie who has won an oscar has the same weight as an actor or a director in making a movie popular?

actor <- table(training$best_actor_win)
actress <- table(training$best_actress_win)
director <- table(training$best_dir_win)
flm2 <- training %>% mutate(oscar = ifelse(best_actor_win == "yes" | best_actress_win == "yes" | best_dir_win == "yes", "yes", "no"))
testing  = mutate(testing,
       feature_film = ifelse(title_type == 'Feature Film', "yes", "no"),
       drama = ifelse(genre == 'Drama', "yes", "no"),
       mpaa_rating_R = ifelse(mpaa_rating == 'R', "yes", "no"),
       oscar_season = ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no"),
       summer_season = ifelse(thtr_rel_month %in% c(5, 6, 7, 8), "yes", "no"))
flm2  = mutate(flm2,
       feature_film = ifelse(title_type == 'Feature Film', "yes", "no"),
       drama = ifelse(genre == 'Drama', "yes", "no"),
       mpaa_rating_R = ifelse(mpaa_rating == 'R', "yes", "no"),
       oscar_season = ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no"),
       summer_season = ifelse(thtr_rel_month %in% c(5, 6, 7, 8), "yes", "no"))
osc <- flm2 %>% select(oscar) %>% group_by(oscar) %>% table() %>% rbind(actor, actress, director)
rownames(osc) <- c("At.least.one.Oscar", "best.actor", "best.actress", "best.director")
osc

##                     no yes
## At.least.one.Oscar 479 171
## best.actor         557  93
## best.actress       578  72
## best.director      607  43

We have added the following new variables: 1. oscar which has two levels, combing all the 3 variables on Oscar. 2. feature_film checks if the title type is “Feature Film” or not 3. mpaa_rating_R checks if the mpaa_rating is R or not 4. oscar_season to check if the theater release month is in October, November and December. 5. summer_season to check if the theater release months is in May, June, July and August 6. drama checks if the genre is drama or not

The goal is to observe if having at least one oscar, irrespective of the field, is a better predictor of movie popularity.

oscar_in_cast <- flm2 %>% filter(oscar == "yes") %>% arrange(imdb_rating) %>% select(title) %>% data.frame() %>% head(6)
x1 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = oscar)) + geom_point() + scale_colour_discrete(name="Combined") + scale_fill_hue(name="Combined")
x2 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_actor_win)) + geom_point() + scale_colour_discrete(name="Actor") + scale_fill_hue(name="Actor")
x3 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_actress_win)) + geom_point() + scale_colour_discrete(name="Actress") + scale_fill_hue(name="Actress")
x4 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_dir_win)) + geom_point() + scale_colour_discrete(name="Director") + scale_fill_hue(name="Director")
grid.arrange(x1, x2, x3, x4, nrow = 2, ncol = 2)

- From the above plot we can say that:

Oscar winners appear to be randomly distributed across the range of both the imdb_rating and critics_score and follows the linear trend in the data.
It seems having an oscar winner in the movie is not a guarantee of popularity

There are few movies who have won an oscar but have very low imdb rating

flm2%>%select(title, imdb_rating)%>%filter(flm2$oscar == 'yes' & imdb_rating<=5)%>%arrange(imdb_rating)

## # A tibble: 5 x 2
##   title                            imdb_rating
##   <chr>                                  <dbl>
## 1 Battlefield Earth                        2.4
## 2 Jack and Jill                            3.4
## 3 Town & Country                           4.3
## 4 Ghost Rider: Spirit of Vengeance         4.3
## 5 Body of Evidence                         4.3

There are many movies which get nominated for an Oscar, so it’s interesting to compare whether getting an oscar nomination and winning an Oscar for best picture are the same in predicting movie popularity.

nom <- table(flm2$best_pic_nom)
win <- table(flm2$best_pic_win)
flm2 <- flm2 %>% mutate(oscar_nom_win = ifelse(best_pic_nom == "yes" | best_pic_win == "yes", "yes", "no"))
nom_win <- table(flm2$oscar_nom_win)
comb_nom_win <- rbind(nom_win, nom, win)
rownames(comb_nom_win) <- c("combined", "nominations", "wins")
comb_nom_win

##              no yes
## combined    627  23
## nominations 628  22
## wins        643   7

From the above table we can say
Out of the 22 movies nominated for a best picture award in the Oscars only 7 won. However, when we combined these two variables the combined number of yes is 23. This is either an error in data entry or one movie won a best picture award that did not received a nomination.

flm2%>%select(title, best_pic_nom, oscar_nom_win)%>%filter(best_pic_nom =="no" & oscar_nom_win =="yes")

## # A tibble: 1 x 3
##   title           best_pic_nom oscar_nom_win
##   <chr>           <fct>        <chr>        
## 1 The Hurt Locker no           yes

The movie Hurt Locker was not nominated for an Oscar but still won an Oscar according to the data We cross checked on online resources and found that it was actually nominated and won an oscar

gp1 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = oscar_nom_win)) + geom_point()
gp2 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_pic_nom)) + geom_point()
gp3 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = best_pic_win)) + geom_point()
grid.arrange(gp1, gp2, gp3, nrow = 1, ncol = 3)

We can see that an oscar nomination for best picture has almost the same effect as winning an Oscar in terms of movie popularity.
The points for the nominated films and those that won are all clustered in the extreme high end of both the imdb_rating and critics_score scale except for one point that is in the middle. We also combined the best_pic_nom and best_pic_win variables into one.

gp1 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = drama)) + geom_point()+scale_colour_discrete(name="Drama") + scale_fill_hue(name="Director")
gp2 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = mpaa_rating_R)) + geom_point()+scale_colour_discrete(name="R mpaa_rating") + scale_fill_hue(name="Director")
gp3 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = feature_film)) + geom_point()+scale_colour_discrete(name="feature film") + scale_fill_hue(name="Director")
grid.arrange(gp1, gp2, gp3, nrow = 1, ncol = 3)

flm2%>%select(title,imdb_rating, best_pic_nom, oscar_nom_win)%>%filter(best_pic_nom =="yes" & imdb_rating < 7)

## # A tibble: 1 x 4
##   title          imdb_rating best_pic_nom oscar_nom_win
##   <chr>                <dbl> <fct>        <chr>        
## 1 A Star is Born         6.5 yes          yes

We identified that movie to be A Star is Born

outlier_best_pic_nom <- flm2 %>% filter(best_pic_nom == "yes") %>% arrange(imdb_rating) %>% data.frame() %>% head(1)
g1 <- ggplot(data = flm2, aes(x = thtr_rel_year)) + geom_histogram(colour = "black", fill = "orange", alpha = 0.5)
g2 <- ggplot(data = flm2, aes(x = thtr_rel_month)) + geom_histogram(colour = "black", fill = "blue", alpha = 0.5)
g3 <- ggplot(data = flm2, aes(x = thtr_rel_day)) + geom_histogram(colour = "black", fill = "green", alpha = 0.5)
grid.arrange(g1,g2,g3,nrow =1, ncol =3)

The histograms above show that there are particular years, months, and days where more movies are released in theaters for the first time.

g4 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(thtr_rel_year))) + geom_point()
plot(g4)

g5 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(thtr_rel_month))) + geom_point()
plot(g5)

g6 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, colour = factor(thtr_rel_day))) + geom_point()
plot(g6)

We do not observe any clustering of these points in the scatterplot along the imdb_rating and critics_score scale.

g1 <- ggplot(data = flm2, aes(x = dvd_rel_year)) + geom_histogram(colour = "black", fill = "orange", alpha = 0.5)
g2 <- ggplot(data = flm2, aes(x = dvd_rel_month)) + geom_histogram(colour = "black", fill = "blue", alpha = 0.5)
g3 <- ggplot(data = flm2, aes(x = dvd_rel_day)) + geom_histogram(colour = "black", fill = "green", alpha = 0.5)
grid.arrange(g1,g2,g3,nrow =1, ncol =3)

gp1 <-ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(dvd_rel_year))) + geom_point()
plot(gp1)

gp2 <- ggplot(data = movies, aes(y = imdb_rating, x = critics_score, colour = factor(dvd_rel_month))) + geom_point()
plot(gp2)

gp3 <- ggplot(data = flm2, aes(y = imdb_rating, x = critics_score, colour = factor(dvd_rel_day))) + geom_point()
plot(gp3)

The histograms above show that there are particular years, months, and days where more dvds are released for the first time. We do not observe any clustering of these points in the scatterplot along the imdb_rating and critics_score scale.

Modeling

num_var <- flm2 %>% select(imdb_rating,imdb_num_votes, critics_score,audience_score)
ggcorr(num_var, name = "Correlation", label = TRUE, alpha = TRUE, palette = "PuOr") + ggtitle("Correlation Matrix Plot") + theme_light()

Our goal is to create a model of what makes a movie popular. We choose imdb_rating as our response variable as it’s distribution is most similar to that of a normal distribution among the other numerical variables that measures movie popularity.
Most of the reason why a movie is popular is based on how many moviegoers appreciate it. We choose audience_score as our first predictor as it is the most correlated to our response variable based on the table above and it makes sense that audience rating is important for a movie to be popular.

fit1 <- lm(imdb_rating ~ audience_score, data = flm2)
summary(fit1)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score, data = flm2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2074 -0.1865  0.0704  0.3096  1.1557 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.590240   0.069413   51.72   <2e-16 ***
## audience_score 0.046525   0.001058   43.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5441 on 648 degrees of freedom
## Multiple R-squared:  0.7489, Adjusted R-squared:  0.7486 
## F-statistic:  1933 on 1 and 648 DF,  p-value: < 2.2e-16

Movie reviews have been patronized by moviegoers to help them decide which movie to watch. The weight of the opinions of these reviewers or critics can be measured by the number of people who visit their websites and how the movie industry has allotted ample budget for good public relations with critics. It seems logical to include critics_score in our model to complement audience_score.

fit2 <- lm(imdb_rating ~ audience_score + critics_score, data = flm2)
summary(fit2)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score, data = flm2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.52044 -0.20001  0.03047  0.30616  1.22958 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.6383938  0.0625863   58.13   <2e-16 ***
## audience_score 0.0348491  0.0013405   26.00   <2e-16 ***
## critics_score  0.0117895  0.0009526   12.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4897 on 647 degrees of freedom
## Multiple R-squared:  0.797,  Adjusted R-squared:  0.7964 
## F-statistic:  1270 on 2 and 647 DF,  p-value: < 2.2e-16

fit3 <- lm(imdb_rating ~ audience_score + critics_score + imdb_num_votes, data = flm2)
summary(fit3)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes, 
##     data = flm2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48392 -0.18513  0.02148  0.29555  1.17567 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.674e+00  6.203e-02  59.226  < 2e-16 ***
## audience_score 3.354e-02  1.347e-03  24.895  < 2e-16 ***
## critics_score  1.176e-02  9.373e-04  12.543  < 2e-16 ***
## imdb_num_votes 8.332e-07  1.761e-07   4.732 2.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4818 on 646 degrees of freedom
## Multiple R-squared:  0.8038, Adjusted R-squared:  0.8029 
## F-statistic: 882.2 on 3 and 646 DF,  p-value: < 2.2e-16

Adding imdb_num_votes has undoubtedly aggrandized our adjusted R-squared and also the p-value of 2.74e-06 indicates it to be a significant predictor

fit4 <- lm(imdb_rating ~ audience_score + critics_score + imdb_num_votes+oscar, data = flm2)
summary(fit4)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes + 
##     oscar, data = flm2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.45897 -0.19332  0.01364  0.28747  1.20491 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.647e+00  6.278e-02  58.102  < 2e-16 ***
## audience_score 3.374e-02  1.345e-03  25.085  < 2e-16 ***
## critics_score  1.159e-02  9.363e-04  12.382  < 2e-16 ***
## imdb_num_votes 7.673e-07  1.776e-07   4.320  1.8e-05 ***
## oscaryes       1.039e-01  4.342e-02   2.393    0.017 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.48 on 645 degrees of freedom
## Multiple R-squared:  0.8055, Adjusted R-squared:  0.8043 
## F-statistic: 667.9 on 4 and 645 DF,  p-value: < 2.2e-16

Oscar is the categorical variable introduced by us, adding it to the model, there’s an increase in the Adjusted R-squared valued and at a significance level of 0.05 it is a significant predictor.
From all the other categorical variable but the only variable that seems logical with a significant value of less than 0.05 is the new variable we created oscar and summer_season the former, which gives a yes when an actor, actress, or director in the movie has won an Oscar and latter if theater released month is in 5-8
Other categorical variables like studio, genre and mpaa_rating have too many levels to be useful. Furthermore, some of these levels have very few observations and may appear as outliers.

fit5 <- lm(imdb_rating ~ audience_score + critics_score + imdb_num_votes+oscar+summer_season, data = flm2)
summary(fit5)

## 
## Call:
## lm(formula = imdb_rating ~ audience_score + critics_score + imdb_num_votes + 
##     oscar + summer_season, data = flm2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48947 -0.19010  0.02537  0.27370  1.27897 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.688e+00  6.417e-02  57.475  < 2e-16 ***
## audience_score    3.358e-02  1.339e-03  25.070  < 2e-16 ***
## critics_score     1.169e-02  9.322e-04  12.541  < 2e-16 ***
## imdb_num_votes    7.769e-07  1.767e-07   4.396 1.29e-05 ***
## oscaryes          9.911e-02  4.323e-02   2.293  0.02220 *  
## summer_seasonyes -1.115e-01  4.022e-02  -2.771  0.00575 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4775 on 644 degrees of freedom
## Multiple R-squared:  0.8078, Adjusted R-squared:  0.8063 
## F-statistic: 541.4 on 5 and 644 DF,  p-value: < 2.2e-16

Model Interpretation

We used a forward selection approach with a combined criteria of p value, adjusted R squared, and logical reasoning.

imdb_rating = 3.668 + 0.03358audience_score + 0.0169critics_score + 7.769e-07 imdb_num_votes+ 0.099oscar:yes - 0.1115summer_season:yes

Intercept - The intercept (3.668) is interpreted as the predicted mean imdb_rating when audience_score, critics_score, imdb_num_votes, oscar and summer_season is 0. The intercept is usually meaningless when the predictor variables are continuous. Furthermore, the scale of critics_score starts at one and the minimum value in our data for audience_score is 11. The value of 0 therefore is beyond the range of our data points. We will be extrapolating if we assume a value of 0 for our predictors.
Coefficient for audience_score - All else held constant, for every one unit increase in audience_score the model predicts a 0.0335 increase in imdb_rating on average. The coefficients are small because the scale of audience_score is from 1 to 100, while the scale of imdb_rating is 1 to 10. Another way of saying this is, all else held constant, for every ten unit increase in audience_score the model predicts a 0.3 increase in imdb_rating on average.
Coefficient for critics_score - All else held constant, for every one unit increase in critics_score the model predicts a 0.01 increase in imdb_rating on average.
Coefficient for imdb_num_votes - All else held constant, for every 1 vote in imdb_num_votes the model predicts a 0.077*10^-5 increase in imdb_rating on average. IMDB no of votes has ver high magnitude whereas imdb_rating lies between 1-10
Coefficient for oscar - All else held constant, the model predicts that the imdb_rating for movies with at least one member of the cast (actor, actress, including director) with an Oscar award is 0.1 points higher than movies without a member of the cast with an Oscar award on average. The difference in popularity based on imdb_rating between the two is not that much.
Coefficient for summer_season - All else held constant, the model predicts that the imdb_rating for movies released in summer season is 0.115 lower than the movies released in other months.

R squared - 80.603 % of the variability of imdb_rating is explained by the model.

P values - All the coefficients in our model has a p-value that is less than 0.05. The explanatory variables in our model are significant predictors of the response variable (imdb_rating). Their slopes are different from 0.

Conditions for Linear Regression

We will now proceed to perform model diagnostics to see if our model meets the requirements for the linear conditions to be valid. ### linear relationships between each (numerical) predictor and response

t1 <-  ggplot(data = flm2, aes(x = audience_score, y = resid(fit5))) + geom_hline(yintercept = 0, size = 1)  + xlab("audience_score") + ylab("Residual") + geom_point()  
t2 <-  ggplot(data = flm2, aes(x = critics_score, y = resid(fit5))) + geom_hline(yintercept = 0, size = 1)  + xlab("critics_score") + ylab("Residual") + geom_point()
t3 <-  ggplot(data = flm2, aes(x =imdb_num_votes, y = resid(fit5))) + geom_hline(yintercept = 0, size = 1)  + xlab("IMDB no of votes") + ylab("Residual") + geom_point()  
grid.arrange(t1, t2,t3, nrow = 2, ncol = 2)

Based on the residual plots above, we can see a linear trend between our residuals and our predictor variables. This condition is met by our model.

nearly normal residuals with mean 0 and independent residuals

par(mfrow = c(1,2))
hist(fit5$residuals, breaks = 25, main = "Histogram of Residuals", col = "salmon", border = "yellow", prob = TRUE)
curve(dnorm(x, mean = mean(fit5$residuals), sd = sd(fit5$residuals)), col="cyan", add=T, lwd = 3)  
qqnorm(fit5$residuals)
qqline(fit5$residuals)

We can see a strong left skew in the histogram of our residuals (left) and the normal probability plots (center) above.

Our model fails to meet this requirement for linearity. This means that our model is not very reliable when audience score or critics score is low.

constant variability of residuals

t3 <- ggplot(data.frame(x = fit5$fitted.values, y = resid(fit5)), aes(x=x, y=y)) + geom_hline(yintercept = 0, size = 1)  + xlab("fitted.values") + ylab("Residual") + geom_point()  

grid.arrange(t3, nrow = 1, ncol = 1)

The plot of the residuals and the order of observations above reveal a random pattern. The observations appear to be independent of each other. There is no fan shape that can be observed. This condition is met. ## Prediction

testing2 <- testing %>% mutate(oscar = ifelse(best_actor_win == "yes" | best_actress_win == "yes" | best_dir_win == "yes", "yes", "no"))
newprof <- testing2 %>% select(audience_score, critics_score, imdb_num_votes,oscar, summer_season)
pred1<-c()
pred <- predict(fit5, newprof, interval = "prediction", level = 0.95, se.fit = TRUE)
pred

## $fit
##        fit      lwr      upr
## 1 4.867111 3.925388 5.808834
## 
## $se.fit
## [1] 0.04406653
## 
## $df
## [1] 644
## 
## $residual.scale
## [1] 0.4775478

Part -6 Conclusion

Our model has demonstrated that with few predictors, we can predict with a certain amount of accuracy the popularity of movies using imdb_rating as a measure of popularity. However, we have to remember that our predictors, audience_score and critics_score are subjective measures that are easily prone to bias. Meaning both critics and audience are both finicky in their choices, determined by many variables not found in our dataset such as age, race, and gender.

Looking at the model’s residual plots, it seems that there is greater variance when rating bad movies compared to good movies and critics are more likely than the audience to rate extreme values.

The small number of movies in our dataset that’s included in the list might have lessened the variables ability to have a small p value. It might have been better to have used a continuous variable of gross movie sales.

We also could have used a larger number of observations in our testing data set to have a better measure of the model’s accuracy.

Bayesian Model

Exploratory Data Analysis

df <- select(flm2, audience_score, feature_film,drama,mpaa_rating_R, oscar_season,summer_season)
dfmelt <- melt(df, measure.vars = 2:6)
ggplot(dfmelt, aes(x=value, y=audience_score,fill=variable))+
  geom_boxplot()+
  facet_grid(.~variable)+
  labs(x="Type",y="Audience Score")

dfmelt %>%
        group_by(variable,value) %>%
        summarize(avg_rating = mean(audience_score))

## # A tibble: 10 x 3
## # Groups:   variable [5]
##    variable      value avg_rating
##    <fct>         <chr>      <dbl>
##  1 feature_film  no          81.0
##  2 feature_film  yes         60.5
##  3 drama         no          59.7
##  4 drama         yes         65.5
##  5 mpaa_rating_R no          62.8
##  6 mpaa_rating_R yes         62.0
##  7 oscar_season  no          61.9
##  8 oscar_season  yes         63.7
##  9 summer_season no          62.7
## 10 summer_season yes         61.8

Looking at summary and boxplot, we can see that except for feature film, others type of categorical variables have no little to no effect on audience_score. A reason for this could be that feature films are judged much more harshly being released in theater compared to Documentary and TV Movies where the audience is more selected.

Modeling

df2 <- select(flm2,audience_score, feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)

fit1 <- bas.lm(audience_score ~ ., data=df2,
               prior="BIC",
               modelprior = uniform())
summary(fit1)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.06568794     0.0000     0.0000000     0.0000000
## dramayes               0.04658612     0.0000     0.0000000     0.0000000
## runtime                0.46961842     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.23638849     0.0000     0.0000000     1.0000000
## thtr_rel_year          0.10407313     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.08113439     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.07534015     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.05702086     0.0000     0.0000000     0.0000000
## critics_score          0.86805460     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13420351     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.03992350     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.15378167     0.0000     0.0000000     0.0000000
## best_actress_winyes    0.14879547     0.0000     0.0000000     0.0000000
## best_dir_winyes        0.06762404     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04746833     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.9896583     0.3074889
## PostProbs                      NA     0.1172     0.1160000     0.0360000
## R2                             NA     0.7557     0.7533000     0.7548000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3606.9449 -3606.9552712 -3608.1241920
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 1.0000000     0.0000000
## mpaa_rating_Ryes        1.0000000     0.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     1.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2930721     0.2738347
## PostProbs               0.0343000     0.0321000
## R2                      0.7572000     0.7548000
## dim                     5.0000000     4.0000000
## logmarg             -3608.1722124 -3608.2401063

Printing the model object and the summary command gives us both the posterior model inclusion probability for each variable and the most probable models. The most likely model, which has posterior probability of 0.1297, includes only Intercept, runtime, imdb_rating, and critics_score. While a posterior probability of 0.1297 sounds small, it is much larger than the uniform prior probability assigned to it, since there are 217 possible models.

We get the final model with the 3 variables above.

df3 <- select(df2,runtime, imdb_rating,critics_score,audience_score)
fit2 <- bas.lm(audience_score ~ ., df3,
               prior="BIC",
               modelprior = uniform())
summary(fit2)

##               P(B != 0 | Y)    model 1       model 2       model 3
## Intercept         1.0000000     1.0000     1.0000000     1.0000000
## runtime           0.5133911     1.0000     0.0000000     1.0000000
## imdb_rating       1.0000000     1.0000     1.0000000     1.0000000
## critics_score     0.8888276     1.0000     1.0000000     0.0000000
## BF                       NA     1.0000     0.9896583     0.1492362
## PostProbs                NA     0.4467     0.4421000     0.0667000
## R2                       NA     0.7557     0.7533000     0.7518000
## dim                      NA     4.0000     3.0000000     3.0000000
## logmarg                  NA -3606.9449 -3606.9552712 -3608.8471008
##                     model 4        model 5
## Intercept      1.000000e+00   1.000000e+00
## runtime        0.000000e+00   0.000000e+00
## imdb_rating    1.000000e+00   0.000000e+00
## critics_score  0.000000e+00   1.000000e+00
## BF             9.962547e-02  3.224912e-100
## PostProbs      4.450000e-02   0.000000e+00
## R2             7.490000e-01   4.952000e-01
## dim            2.000000e+00   2.000000e+00
## logmarg       -3.609251e+03  -3.836032e+03

The residuals vs. fitted values shows no pattern. Moreover, the inclusions probabilities of all variables are above 0.5 (near 1).

par(mfrow = c(1,2))
plot(fit2,which=c(1,2))

par(mfrow = c(1,2))
plot(fit2,which=c(3,4))

Final Model

audience_score = 62.4- 0.0276runtime + 14.98imdb_rating + 0.063*critics_score

We also have credible intervals for each coefficients. Since this is Bayesian statistic, there is a 95% probability the true coefficient will be in this interval (from lower to upper)

confint(coefficients(fit2))

##                      2.5%      97.5%        beta
## Intercept     61.63023177 63.1765969 62.40523883
## runtime       -0.08196549  0.0000000 -0.02769210
## imdb_rating   13.70673380 16.5734523 14.98733796
## critics_score  0.00000000  0.1051097  0.06257525
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Prediction

deathpool <- data.frame(runtime=103, imdb_rating = 8, critics_score = 84)
pre <- predict(fit2,deathpool,estimator="MPM", se.fit=TRUE)
pre$fit

## [1] 86.95391
## attr(,"model")
## [1] 0 1 2 3
## attr(,"best")
## [1] 1
## attr(,"estimator")
## [1] "MPM"

Our model predicts the audience score of roughly 87 points. This is a very good prediction (the real audience score is at 90)

Conclusion

We conclude that based on the given data, critics score, runtime and imdb_rating will have the most effects on audience score of a movies on average.

One shortcoming of this analysis is the lack of prior model. Perhaps there maybe reverse causality between audience score and critics score and we should take it into account.

References

sessionInfo()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252   
## [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] knitr_1.29      reshape2_1.4.4  lubridate_1.7.9 BAS_1.5.5      
##  [5] ggthemes_4.2.0  GGally_2.0.0    gridExtra_2.3   statsr_0.2.1   
##  [9] caret_6.0-86    lattice_0.20-41 dplyr_1.0.1     ggplot2_3.3.2  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5           class_7.3-17         utf8_1.1.4          
##  [4] assertthat_0.2.1     digest_0.6.25        ipred_0.9-9         
##  [7] foreach_1.5.0        mime_0.9             R6_2.4.1            
## [10] plyr_1.8.6           stats4_4.0.2         evaluate_0.14       
## [13] highr_0.8            pillar_1.4.6         rlang_0.4.7         
## [16] data.table_1.13.0    rpart_4.1-15         Matrix_1.2-18       
## [19] rmarkdown_2.3        labeling_0.3         splines_4.0.2       
## [22] gower_0.2.2          stringr_1.4.0        munsell_0.5.0       
## [25] shiny_1.5.0          compiler_4.0.2       httpuv_1.5.4        
## [28] xfun_0.15            pkgconfig_2.0.3      htmltools_0.5.0     
## [31] nnet_7.3-14          tidyselect_1.1.0     tibble_3.0.3        
## [34] prodlim_2019.11.13   codetools_0.2-16     fansi_0.4.1         
## [37] reshape_0.8.8        crayon_1.3.4         withr_2.2.0         
## [40] later_1.1.0.1        MASS_7.3-51.6        recipes_0.1.13      
## [43] ModelMetrics_1.2.2.2 grid_4.0.2           nlme_3.1-148        
## [46] xtable_1.8-4         gtable_0.3.0         lifecycle_0.2.0     
## [49] magrittr_1.5         pROC_1.16.2          scales_1.1.1        
## [52] cli_2.0.2            stringi_1.4.6        farver_2.0.3        
## [55] promises_1.1.1       timeDate_3043.102    ellipsis_0.3.1      
## [58] generics_0.0.2       vctrs_0.3.2          lava_1.6.7          
## [61] RColorBrewer_1.1-2   iterators_1.0.12     tools_4.0.2         
## [64] glue_1.4.1           purrr_0.3.4          fastmap_1.0.1       
## [67] survival_3.1-12      yaml_2.2.1           colorspace_1.4-1

Linear Regression & Modelling: Predicting Movie Popularity

Janish Parikh

08/10/2020

Abstract

Load Packages

Loading the data

Understanding the data

Dividing the data into training set and testing set

Data Codebook

Data Exploration

Generalizability and Bias

Research Question

Exploratory Data Analysis

Modeling

Model Interpretation

Conditions for Linear Regression

nearly normal residuals with mean 0 and independent residuals

constant variability of residuals

Part -6 Conclusion

Bayesian Model

Exploratory Data Analysis

Modeling

Final Model

Prediction

Conclusion

References