library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
Setting the file path and loading the data
setwd('C:/Users/sanso/Documents/Git/StatsR/Bayesian')
load("movies.Rdata")
dim(movies)
## [1] 651 32
The data contains 651 observations with 32 variables.
The data pertains to movies released before 2016 and have been gathered from Rotten tomatoes and IMDB.
This data has been randomly sampled. Because it is randomly sampled, inferences on this data can be generalized to the population.
However, since no random assignment has been used (this is an observational study), we would not be able to make any causal statements irresepective of the kind of inference
The main purpose of this project is to develop a model for predicting “audience_score” using the following variables. Some already exist which we will not be changing. The other vraibles are new variables which would have to be created
Hence we would be creating 5 new variables namely
Code for creating the new variables
movies$feature_film <- ifelse(movies$title_type == "Feature Film", "yes", "no")
movies$drama <- ifelse(movies$genre =="Drama", "yes", "no")
movies$mpaa_rating_R <- ifelse(movies$mpaa_rating == "R", "yes", "no")
movies$oscar_season <- ifelse(movies$thtr_rel_month == 10 | movies$thtr_rel_month == 11 | movies$thtr_rel_month ==12, "yes","no")
movies$summer_season <- ifelse(movies$thtr_rel_month == 5 | movies$thtr_rel_month == 6 | movies$thtr_rel_month ==7 | movies$thtr_rel_month ==8,
"yes","no")
We have added 5 new variables which should now make the number of variables 37
dim(movies)
## [1] 651 37
However, of this 37 variables, we would only be using some- the 16 variables specified. I would be creating a new data set with only these specific variables for all further analysis along with the outcome variable which is audience_score
movie1 <- data.frame(movies$feature_film, movies$drama, movies$runtime,
movies$mpaa_rating_R, movies$thtr_rel_year,
movies$oscar_season, movies$summer_season,
movies$imdb_rating, movies$imdb_num_votes,
movies$critics_score, movies$best_pic_nom,
movies$best_pic_win, movies$best_actor_win,
movies$best_actress_win, movies$best_dir_win,
movies$top200_box, movies$audience_score)
dim(movie1)
## [1] 651 17
We have created a new data set “movie1”, which we would be using for all further analysis. I would be checking what the new data set looks like
names(movie1)
## [1] "movies.feature_film" "movies.drama"
## [3] "movies.runtime" "movies.mpaa_rating_R"
## [5] "movies.thtr_rel_year" "movies.oscar_season"
## [7] "movies.summer_season" "movies.imdb_rating"
## [9] "movies.imdb_num_votes" "movies.critics_score"
## [11] "movies.best_pic_nom" "movies.best_pic_win"
## [13] "movies.best_actor_win" "movies.best_actress_win"
## [15] "movies.best_dir_win" "movies.top200_box"
## [17] "movies.audience_score"
I would be renaming all the column names with the names in the original data set without the “movies” prefiix
colnames(movie1) <- c ("feature_film","drama","runtime","mpaa_rating_R",
"thtr_rel_year", "oscar_season",
"summer_season", "imdb_rating",
"imdb_num_votes", "critics_score",
"best_pic_nom","best_pic_win",
"best_actor_win", "best_actress_win",
"best_dir_win", "top_200_box",
"audience_score")
names(movie1)
## [1] "feature_film" "drama" "runtime"
## [4] "mpaa_rating_R" "thtr_rel_year" "oscar_season"
## [7] "summer_season" "imdb_rating" "imdb_num_votes"
## [10] "critics_score" "best_pic_nom" "best_pic_win"
## [13] "best_actor_win" "best_actress_win" "best_dir_win"
## [16] "top_200_box" "audience_score"
In this section, we would be conducting exploratory analysis to check what are the factors that contribute to audience_score which is our outcome variable
First I would like to see the distribution of the outcome variable - audience_score. To do this I would be making a histogram
hist(movie1$audience_score, col = "light blue", probability = T,
main = "Distribution of Audience Score",
xlab = "Audience Score"); lines(density(movie1$audience_score))
We see that audience score is a left skewed distribution with the highest density being in the 70 - 90 range
Now, I would first check the impact of the new variables that we created
g <- ggplot( data = movie1, aes(x = feature_film, y = audience_score))
g <- g + geom_boxplot(aes(fill= feature_film))
g
This shows that there exists a major difference in audince score of feature films vs non feature films.
I will be validating the difference with a t-test
t.test(audience_score ~ feature_film, data = movie1, conf.level = 0.99)
##
## Welch Two Sample t-test
##
## data: audience_score by feature_film
## t = 10.648, df = 86.899, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
## 15.49363 25.67574
## sample estimates:
## mean in group no mean in group yes
## 81.05000 60.46531
Here we see that p-value is almost zero. Also the 99% confidence interval for difference of means is between 15.5 and 25.7 signifying that there is indeed a statistically signficiant difference in audience scores of feature films vs non featire films
g <- ggplot( data = movie1, aes(x = drama, y = audience_score))
g <- g + geom_boxplot(aes(fill= drama))
g
Visually there does not exist any major difference in audience score between drama and non drama movies.
I would be further conducting a t-test to validate this. Since there does not seem to be any major difference I would be using a 95% confidence level instead of the 99% I used in the earlier test
t.test(audience_score ~ drama, data = movie1)
##
## Welch Two Sample t-test
##
## data: audience_score by drama
## t = -3.5987, df = 648.92, p-value = 0.0003442
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.680895 -2.551760
## sample estimates:
## mean in group no mean in group yes
## 59.73121 65.34754
While visually I did not perceive any major difference, the test tells me that there is a statistically significant difference between audience score of Drama vs Non drama movies.
the p- value is 0.0003 helping us reject the Null hypothesis of no difference at a 5% level of significance
Also the 95% confidence interval does not contain 0
We can infer that there is indeed a difference in the observed values of audience scores of Drama vs Non Drama movies
g <- ggplot( data = movie1, aes(x = mpaa_rating_R, y = audience_score))
g <- g + geom_boxplot(aes(fill= mpaa_rating_R))
g
Again, I do not visually see any major difference visually.
I would be validating using a t-test
t.test(audience_score ~ mpaa_rating_R, data = movie1)
##
## Welch Two Sample t-test
##
## data: audience_score by mpaa_rating_R
## t = 0.40777, df = 648.44, p-value = 0.6836
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.468251 3.762027
## sample estimates:
## mean in group no mean in group yes
## 62.68944 62.04255
Here the p-value is 0.68. Hence we will accept the Null Hypothesis that there is no difference
The 95% confidence level also contains 0 ranging from -2.5 to 3.8.
Hence we can infer that we do not observe any difference in audience score between movies rated R and movies that are not
g <- ggplot( data = movie1, aes(x = oscar_season, y = audience_score))
g <- g + geom_boxplot(aes(fill= oscar_season))
g
Again, I do not see any major visual difference (except medians) between audience scores of movies which have released in the oscar season vs movies that have released in other times
Validating with a t-test
t.test(audience_score ~ oscar_season, data = movie1)
##
## Welch Two Sample t-test
##
## data: audience_score by oscar_season
## t = -1.0685, df = 349.84, p-value = 0.286
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -5.319956 1.574316
## sample estimates:
## mean in group no mean in group yes
## 61.81304 63.68586
The t-test findings also corroborate the visual findings with a p value of 0.29. The 95% confidence interval also contains 0 ranging from -5.3 to 1.6
We accept the Null hypothesis and infer that there is no observed difference between the audience scores of movies released in the oscar season vs movies that have not
g <- ggplot( data = movie1, aes(x = summer_season, y = audience_score))
g <- g + geom_boxplot(aes(fill= summer_season))
g
Again, no visual difference between audience scores of movies released in the summer season vs those in other seasons
Validating with a t-test
t.test(audience_score ~ summer_season, data = movie1)
##
## Welch Two Sample t-test
##
## data: audience_score by summer_season
## t = 0.48351, df = 414, p-value = 0.629
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.499424 4.130089
## sample estimates:
## mean in group no mean in group yes
## 62.62302 61.80769
The t- test also reiterates what we observed visually. The p-value is 0.63 and the 95% confidence interval contains 0 ranging from -2.5 to 4.13. Hence, we would be accepting the Null hypothesis thereby infering that there is no observed difference between movies released in the summer season vs movies that have released in other seasons
The reason I want to check this is to see if there is any relation between the year of release and the audience score – to also see of it behaves in any way like a time series and if there are any trends available.
g <- ggplot(data = movie1, aes(x=thtr_rel_year, y = audience_score))
g <- g + geom_line() + geom_smooth(method = "lm",se = F)
g
There does not seem to be any trend emerging from this. While this shows that audience scores of movies have decreased very marginally over time, this is debatable since the strength of the relation is suspect.
g <- ggplot(data = movie1, aes(x = imdb_rating, y = audience_score))
g <- g + geom_point() + geom_smooth(method = "lm" , se = FALSE)
g <- g + ggtitle(label = "IMDB Rating vs Score")
g
Very Strong linear relation is observed
Validating with the correlaion coefficient
cor(movie1$imdb_rating, movie1$audience_score)
## [1] 0.8648652
The correlation also corroborates the visual finding
g <- ggplot(data = movie1, aes(x = imdb_num_votes, y = audience_score))
g <- g + geom_point() + geom_smooth(method = "lm" , se = FALSE)
g <- g + ggtitle(label = "IMDB num Votes vs Score")
g
Not a particularly strong linear relation but it exists nevertheless
Checking the correlation coefficient
cor(movie1$imdb_num_votes, movie1$audience_score)
## [1] 0.2898128
The correlation coefficient also shows that the linear relationship is not particularly strong
In this section we would be developing a bayesian linear model for predicting Audience Score (audience_score) using all the other variables specified ( data set movie1)
fit1 <- bas.lm(audience_score ~ ., data = movie1, method = 'MCMC',
modelprior = uniform())
## Warning in bas.lm(audience_score ~ ., data = movie1, method = "MCMC",
## modelprior = uniform()): dropping 1 rows due to missing data
## Warning in bas.lm(audience_score ~ ., data = movie1, method = "MCMC",
## modelprior = uniform()): We recommend using the implementation using the
## Jeffreys-Zellner-Siow prior (prior='JZS') which uses numerical integration
## rahter than the Laplace approximation
summary(fit1)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.06777115 0.0000 0.0000000 0.0000000
## dramayes 0.04671097 0.0000 0.0000000 0.0000000
## runtime 0.46314926 0.0000 1.0000000 0.0000000
## mpaa_rating_Ryes 0.20214996 0.0000 0.0000000 1.0000000
## thtr_rel_year 0.09446869 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.07770004 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.08284302 0.0000 0.0000000 0.0000000
## imdb_rating 0.99999619 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.06122208 0.0000 0.0000000 0.0000000
## critics_score 0.88149872 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13775024 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.04161072 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.14723282 0.0000 0.0000000 0.0000000
## best_actress_winyes 0.14515762 0.0000 0.0000000 0.0000000
## best_dir_winyes 0.06972046 0.0000 0.0000000 0.0000000
## top_200_boxyes 0.05005798 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.8702806 0.2217602
## PostProbs NA 0.1383 0.1203000 0.0309000
## R2 NA 0.7525 0.7549000 0.7539000
## dim NA 3.0000 4.0000000 4.0000000
## logmarg NA 443.9495 443.8105657 442.4433468
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 0.0000000 1.0000000
## mpaa_rating_Ryes 0.0000000 1.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 0.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 1.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top_200_boxyes 0.0000000 0.0000000
## BF 0.2236679 0.2055844
## PostProbs 0.0303000 0.0287000
## R2 0.7539000 0.7563000
## dim 4.0000000 5.0000000
## logmarg 442.4519125 442.3676066
I would be plotting the residuals
plot(fit1, which = 1)
This does not show a random scattering of data points. There is a bit of a wave pattern which indicates that the model predicts skewed rating for ratings under 30
plot(fit1, which = 2)
image(fit1)
The model with the highest probablity contains only 2 parameters imdb_rating and critics_score
Checking for Normal Distribution of posterior probabilities
diagnostics(fit1, type = "model", pch = 5)
This shows that the posterior model probabilities follow a normal distribution.
In this section, I would be using the model I built to predict the audience score of a movie which is not part of the data set and is a movie which released in 2016
The link for the data set is given below
http://www.imdb.com/title/tt1431045/?ref_=nv_sr_2
https://www.rottentomatoes.com/m/deadpool/
The movie is deadpool which was released in 2016
I would be first creating a dataframe with the parameters that I would be needing
deadpool <- data.frame(runtime=108,
thtr_rel_year=2016,
imdb_rating=8.1,
imdb_num_votes=500049,
critics_score=84,
audience_score=0,
best_pic_nom=factor("no", levels=c("no", "yes")),
best_pic_win=factor("no", levels=c("no", "yes")),
best_actor_win=factor("no", levels=c("no", "yes")),
best_actress_win=factor("no", levels=c("no", "yes")),
best_dir_win=factor("no", levels=c("no", "yes")),
top200_box=factor("no", levels=c("no", "yes")),
feature_film=factor("yes", levels=c("no", "yes")),
drama=factor("no", levels=c("no", "yes")),
mpaa_rating_R=factor("yes", levels=c("no", "yes")),
oscar_season=factor("no", levels=c("no", "yes")),
summer_season=factor("no", levels=c("no", "yes")),
top_200_box = factor("no", levels = c("no", "yes")))
I would now use the model that we developed “fit1” to predict the audience score of Deadpool and check what the variance with the actual is.
While prediccting the score, I would be using the estimator as “BMA” which is the bayesian
pred <- predict(fit1, newdata = deadpool, estimator = "BMA")
pred$Ybma
## [,1]
## [1,] 87.90591
The predicted value is 88 which is close to the actual value of 90
Overall this model has performed well for the movie Deadpool.
However, most of the parameters that have been used are data points that would be available after a movies has released and running.
If I was a producer, who wanted to know what is the kind of movie to be made so as to maximise profits (Assuming high audience score relates to higher profits), this analysis and model might not be very helpful.
Ideally I would start with parameters which can be set while designing the movie like runtime, genre, rating