library(ggplot2)
library(dplyr)## Warning: Installed Rcpp (0.12.10) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
library(statsr)
library(BAS)
getwd()## [1] "C:/Users/HP/Documents"
#load("movies.Rdata")
load("movies.Rdata")
movies<-data.frame(movies)The dataset consists of variables pertaining to audience and critic review scores from Rotten Tomatoes and IMDB web sites for 651 movies released to theaters during 1970-2014.
Though Rotten Tomatoes and IMDB presumably uses randomly sampled data, no solid information was available regarding the specific sampling method used. This may impede the ability of the conclusions to be generalized to the population of all movies.
A possible selection bias may arise with regard to the audience if only the theater visiting audience is included and DVD-watching and online movie watching participants are excluded. Recall bias may affect the responses if there were a considerable time gap between the movie watched and the data collection. Moreover, a sample size of 651 may be considered modest but perhaps not adequate to unearth subtle patterns underlying in the populations which could be elicited through big data analytics and data mining with a bigger sample. This may nevertheless be adequate for classical cross-sectional analyses.
This is an observational study, more specifically, a cross-sectional survey and therefore no causal relationships can be inferred or assumed from the conclusions drawn because temporality of associations cannot be elicited as well as confounding and bias may affect the results achieved.
str(movies)## 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
head(movies)## title title_type genre runtime mpaa_rating
## 1 Filly Brown Feature Film Drama 80 R
## 2 The Dish Feature Film Drama 101 PG-13
## 3 Waiting for Guffman Feature Film Comedy 84 R
## 4 The Age of Innocence Feature Film Drama 139 PG
## 5 Malevolence Feature Film Horror 90 R
## 6 Old Partner Documentary Documentary 78 Unrated
## studio thtr_rel_year thtr_rel_month thtr_rel_day
## 1 Indomina Media Inc. 2013 4 19
## 2 Warner Bros. Pictures 2001 3 14
## 3 Sony Pictures Classics 1996 8 21
## 4 Columbia Pictures 1993 10 1
## 5 Anchor Bay Entertainment 2004 9 10
## 6 Shcalo Media Group 2009 1 15
## dvd_rel_year dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## 1 2013 7 30 5.5 899
## 2 2001 8 28 7.3 12285
## 3 2001 8 21 7.6 22381
## 4 2001 11 6 7.2 35096
## 5 2005 4 19 5.1 2386
## 6 2010 4 20 7.8 333
## critics_rating critics_score audience_rating audience_score
## 1 Rotten 45 Upright 73
## 2 Certified Fresh 96 Upright 81
## 3 Certified Fresh 91 Upright 91
## 4 Certified Fresh 80 Upright 76
## 5 Rotten 33 Spilled 27
## 6 Fresh 91 Upright 86
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## 1 no no no no no
## 2 no no no no no
## 3 no no no no no
## 4 no no yes no yes
## 5 no no no no no
## 6 no no no no no
## top200_box director actor1 actor2
## 1 no Michael D. Olmos Gina Rodriguez Jenni Rivera
## 2 no Rob Sitch Sam Neill Kevin Harrington
## 3 no Christopher Guest Christopher Guest Catherine O'Hara
## 4 no Martin Scorsese Daniel Day-Lewis Michelle Pfeiffer
## 5 no Stevan Mena Samantha Dark R. Brandon Johnson
## 6 no Chung-ryoul Lee Choi Won-kyun Lee Sam-soon
## actor3 actor4 actor5
## 1 Lou Diamond Phillips Emilio Rivera Joseph Julian Soria
## 2 Patrick Warburton Tom Long Genevieve Mooy
## 3 Parker Posey Eugene Levy Bob Balaban
## 4 Winona Ryder Richard E. Grant Alec McCowen
## 5 Brandon Johnson Heather Magee Richard Glover
## 6 Moo <NA> <NA>
## imdb_url
## 1 http://www.imdb.com/title/tt1869425/
## 2 http://www.imdb.com/title/tt0205873/
## 3 http://www.imdb.com/title/tt0118111/
## 4 http://www.imdb.com/title/tt0106226/
## 5 http://www.imdb.com/title/tt0388230/
## 6 http://www.imdb.com/title/tt1334549/
## rt_url
## 1 //www.rottentomatoes.com/m/filly_brown_2012/
## 2 //www.rottentomatoes.com/m/dish/
## 3 //www.rottentomatoes.com/m/waiting_for_guffman/
## 4 //www.rottentomatoes.com/m/age_of_innocence/
## 5 //www.rottentomatoes.com/m/10004684-malevolence/
## 6 //www.rottentomatoes.com/m/old-partner/
summary(movies)## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## runtime mpaa_rating studio
## Min. : 39.0 G : 19 Paramount Pictures : 37
## 1st Qu.: 92.0 NC-17 : 2 Warner Bros. Pictures : 30
## Median :103.0 PG :118 Sony Pictures Home Entertainment: 27
## Mean :105.8 PG-13 :133 Universal Pictures : 23
## 3rd Qu.:115.8 R :329 Warner Home Video : 19
## Max. :267.0 Unrated: 50 (Other) :507
## NA's :1 NA's : 8
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1970 Min. : 1.00 Min. : 1.00 Min. :1991
## 1st Qu.:1990 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.00 Median :15.00 Median :2004
## Mean :1998 Mean : 6.74 Mean :14.42 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008
## Max. :2014 Max. :12.00 Max. :31.00 Max. :2015
## NA's :8
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 180
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546
## Median : 6.000 Median :15.00 Median :6.600 Median : 15116
## Mean : 6.333 Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58301
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
## NA's :8 NA's :8
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:135 Min. : 1.00 Spilled:275 Min. :11.00
## Fresh :209 1st Qu.: 33.00 Upright:376 1st Qu.:46.00
## Rotten :307 Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :629 no :644 no :558 no :579 no :608
## yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :636 Length:651 Length:651 Length:651
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url
## Length:651 Length:651
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
Following variables were synthesized as given in the instructions for the project.
feature_film: “yes” if title_type is “Feature Film”, “no” otherwise.
drama: “yes” if genre is “Drama”, “no” otherwise.
mpaa_rating_R: “yes” if mpaa_rating is “R”, “no” otherwise.
oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise.
summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise.
movies <- mutate(movies, feature_film = factor(ifelse(title_type == 'Feature Film', 'yes', 'no')))
movies <- mutate(movies, drama = factor(ifelse(genre =='Drama', 'yes', 'no')))
movies <- mutate(movies, mpaa_rating_R = factor(ifelse(mpaa_rating == 'R', 'yes', 'no')))
movies <- mutate(movies, oscar_season = factor(ifelse(thtr_rel_month >= 10, 'yes', 'no')))
movies <- mutate(movies, summer_season = factor(ifelse(thtr_rel_month %in% c(5,6,7,8), 'yes', 'no')))
head(movies[c("feature_film","drama", "mpaa_rating_R", "oscar_season", "summer_season")])## feature_film drama mpaa_rating_R oscar_season summer_season
## 1 yes yes yes no no
## 2 yes yes no no no
## 3 yes no yes no yes
## 4 yes yes no yes no
## 5 yes no yes no no
## 6 no no no no no
Calling the summary command showed that there is a single missing data point in the runtime variable.
# Deleting the single missing data point in the runtime variable.
movies <- filter(movies, !is.na(runtime))#histograms and barplots
ggplot(data=movies, aes(x=title_type)) +
geom_bar(fill="red") +
xlab("Movie Type")ggplot(data=movies, aes(x=genre)) +
geom_bar(fill="green") +
xlab("Genre")ggplot(data=movies, aes(x=runtime)) +
geom_histogram(fill="blue") +
xlab("Runtime")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=movies, aes(x=mpaa_rating)) +
geom_bar(fill="black") +
xlab("MPAA Rating")ggplot(data=movies, aes(x=thtr_rel_month)) +
geom_histogram(binwidth=1, fill="purple") +
xlab("Theater Release Month")ggplot(data=movies, aes(x=imdb_rating)) +
geom_histogram(binwidth=1, fill="green") +
xlab("IMDb Rating")ggplot(data=movies, aes(x=critics_score)) +
geom_histogram(binwidth=5, fill="orange") +
xlab("critics Score")ggplot(data=movies, aes(x=audience_score)) +
geom_histogram(binwidth=5, fill="blue") +
xlab("Audience Score")The runtime variable ranges from 39.0 to 267.0 minutes with a mean of 105.8 and a median of 103.0 minutes and the corresponding histogram indicates that the distribution is roughly normal.The audience score variable has a minimum of 11 and a maximum of 97 with a mean of 62.36 and a median of 65.00.
#distribution of the newly created dichotomous variables
summary(movies[,c("feature_film","drama", "mpaa_rating_R", "oscar_season", "summer_season")])## feature_film drama mpaa_rating_R oscar_season summer_season
## no : 59 no :345 no :321 no :460 no :442
## yes:591 yes:305 yes:329 yes:190 yes:208
ggplot(movies, aes(factor(feature_film), audience_score, fill=feature_film)) +
geom_boxplot() +
ggtitle('Box plot of audience Score by Feature Film') +
xlab('feature_film') +
ylab('audience_score')ggplot(movies, aes(factor(drama), audience_score, fill= drama)) +
geom_boxplot() +
ggtitle('Box plot of audience score by drama type') +
xlab('drama') +
ylab('audience_score')ggplot(movies, aes(factor(mpaa_rating_R), audience_score, fill= mpaa_rating_R)) +
geom_boxplot() +
ggtitle('Box plot of audience score by mpaa_rating_R') +
xlab('mpaa_rating_R') +
ylab('audience_score')ggplot(movies, aes(factor(oscar_season), audience_score, fill= oscar_season)) +
geom_boxplot() +
ggtitle('Box plot of audience score by oscar_season') +
xlab('oscar_season') +
ylab('audience_score')ggplot(movies, aes(factor(summer_season), audience_score, fill= summer_season)) +
geom_boxplot() +
ggtitle('Box plot of audience score by summer_season') +
xlab('summer_season') +
ylab('audience_score')According to the boxplots, out of the 5 newly created binary variables, three variables, namely, feature_film, drama, and oscar_season, show a clearly visible difference in the distribution by the audience score.
bayes_inference(y = audience_score, x = feature_film, data = movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 59, y_bar_no = 81.2034, s_no = 13.6404
## n_yes = 591, y_bar_yes = 60.4653, s_yes = 19.824
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no = mu_yes
## H2: mu_no != mu_yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 2.794604e+13
## P(H1|data) = 0
## P(H2|data) = 1
bayes_inference(y = audience_score, x = drama, data = movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 345, y_bar_no = 59.6957, s_no = 21.2981
## n_yes = 305, y_bar_yes = 65.3475, s_yes = 18.5418
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no = mu_yes
## H2: mu_no != mu_yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 24.1609
## P(H1|data) = 0.0397
## P(H2|data) = 0.9603
bayes_inference(y = audience_score, x = oscar_season, data = movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 460, y_bar_no = 61.813, s_no = 20.1196
## n_yes = 190, y_bar_yes = 63.6421, s_yes = 20.5062
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no = mu_yes
## H2: mu_no != mu_yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 13.7738
## P(H1|data) = 0.9323
## P(H2|data) = 0.0677
In terms of Bayes Factors, after Jeffrey’s criteria, there seem to be a clear difference in the audience score by the three selected binary variables above.
The audience_score was the dependent variable chosen as the proxy of the popularity of a movie, and a linear regression model and bayesian model averaging were used for the modelling. Following 12 variables were selected as predictors; run_time, thtr_rel_month, dvd_rel_month, imdb_rating, critics_score,best_pic_win, best_actor_win, best_actress_win, top200_box, feature_film, drama, and oscar_season.
#extracting the variables
movies <- select(movies, runtime, thtr_rel_month, dvd_rel_month, imdb_rating, critics_score, audience_score, best_pic_win, best_actor_win, best_actress_win, top200_box, feature_film, drama, oscar_season)
#model fitting
model <- bas.lm(audience_score ~ ., data=movies, method='MCMC', prior='ZS-null', modelprior=uniform())## Warning in bas.lm(audience_score ~ ., data = movies, method = "MCMC", prior
## = "ZS-null", : dropping 8 rows due to missing data
#Diagnostic plots
par(mfrow=c(2,2))
plot(model, which=c(1, 2, 3, 4), ask=FALSE)The first plot of residuals versus fitted values shows the random error variance assumption is not met throughout the predicted values. In fact, the random scatter is acceptable in the range 30-90 but outside that range it is not well met. The second plot shows that the convergnce of the posterior probability occured with about 350 model combinations sampling. The fourth plot suggests that imdb_rating and critics_score are important predictors with higher inclusion probabilities.
image(model, rotate = FALSE)The best model with the highest posterior probability contains only two predictors, namely, imdb_rating and critics_score. In fact, both these variables are almost always included in the best 15 models.
par(mfrow=c(1,2))
plot(coefficients(model), subset=c(5, 6), ask=FALSE)#checking the normality of posterior probabilities
diagnostics(model, type="model")The above plot indicates that normality is satisfactory and the MCMC iterations were suffificient.
#predictions on the dataset and making credible intervals
BMA_model = predict(model, estimator="BMA", se.fit=TRUE)
BMA_confint_fit = confint(BMA_model, parm="mean")
BMA_confint_pred = confint(BMA_model, parm="pred")
head(cbind(BMA_confint_fit, BMA_confint_pred))## 2.5% 97.5% mean 2.5% 97.5% pred
## [1,] 45.76137 48.98393 47.29617 27.10530 66.52969 47.29617
## [2,] 75.20927 78.76845 77.06186 57.66762 97.38306 77.06186
## [3,] 79.71160 83.61221 81.53372 61.23741 100.73663 81.53372
## [4,] 71.12974 75.41777 73.37822 53.35927 93.01701 73.37822
## [5,] 38.84557 41.75027 40.27196 19.96140 60.21597 40.27196
## [6,] 82.71204 87.21853 84.81863 64.84948 104.79403 84.81863
The best movie of the year 2016 that won the Oscar was chosen for prediction. Relevant data were gained from the IMDB and Rotten Tomatoes web sites.
#making a dataframe with moonlight movie's data
moonlight <- data.frame(runtime= 111,
thtr_rel_month = 11,
dvd_rel_month = 2,
imdb_rating= 7.5,
critics_score= 98,
audience_score= 80,
best_pic_win= "yes",
best_actor_win= "no",
best_actress_win= "no",
top200_box= "yes",
feature_film= "yes",
drama= "yes",
oscar_season= "yes"
)
# predicting of audience_score using bayesian model averaging.
BMA_moonlight <- predict(model, newdata= moonlight, estimator="BMA", se.fit=TRUE)
# prediction intervals
moonlight_pred <- qt(0.95, df=BMA_moonlight$se.bma.pred[1]) *
mean(BMA_moonlight$se.bma.pred)
# Show prediction results.
outcome <- data.frame(t="moonlight",
p=sprintf("%2.1f", BMA_moonlight$Ybma),
i=sprintf("%2.1f - %2.1f", BMA_moonlight$Ybma - moonlight_pred,
BMA_moonlight$Ybma + moonlight_pred),
r= 80)
colnames(outcome) <- c("Movie", "Predicted", "95% Interval",
"Actual")
print(outcome)## Movie Predicted 95% Interval Actual
## 1 moonlight 79.9 61.5 - 98.4 80
And lo behold! :)
In conclusion, a parsimonious bayesian regression model has been produced that made pretty impressive predictions on a new movie. However, in view of the model assumptions that were not met according to the diagnostic plots, there is room for further analyses and improving the model.