This is a project for Bayesian Statistics course, which is a part of Coursera’s Statistics with R Specialization.
The Project aims to find out what attributes make a movie popular. To achieve this, an exploratory data analysis (EDA) has been completed, and Bayesian regression model predicting audience score for a movie has been developed.
Code buttonlibrary(ggplot2); library(dplyr); library(statsr)
library(BAS); library(tidyr); library(corrplot)
if(!file.exists("./data/1209_SR-LR-w4_Movies/movies.RData")) {
download.file("https://d3c33hcgiwev3.cloudfront.net/_e1fe0c85abec6f73c72d73926884eaca_movies.Rdata?Expires=1609459200&Signature=IQ4DIjVl-9DlmYVsHOIVdkNqCtFCLKrCV5mGIq15iVdqGl280YSagXpxBNSsn7iIKQYbaDRtynFeSet~09G22DA5KXvM7aa3U661bXIIu0JO3QW6ZMvzcW6sx1ZbpuvcAuy7wZ~iS78-rNUQnGGxOsbWXYnyyUF5UWSuJrdbJJ8_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A",
destfile = "./data/1209_SR-LR-w4_Movies/movies.RData",
method = "curl")
}
load("./data/1209_SR-LR-w4_Movies/movies.RData")The data set includes \(651\) observations on \(32\) variables composed of information from Rotten Tomatoes and IMDb for a random sample of movies. The movies are released from 1970 to 2014 and comes from US-based Studios as MPAA Ratings in the data applies only to American films.
Start with examination the data set variables, their description can be found in the Codebook.
Get rid of unnecessary variables, and create new ones for analysis:
feature_film (yes if title_type is Feature Film, no otherwisedrama (yes if genre is Drama, no otherwise)mpaa_rating_R (yes if mpaa_rating is R, no otherwise)oscar_season (based on thtr_rel_month, yes if movie is released in Nov/ Oct/ Dec, no otherwise)summer_season (based on thtr_rel_month, yes if movie is released in May/ Jun/ Jul/ Aug, no otherwise)movies <- movies %>%
transmute(
audience_score,
feature_film=as.factor(
ifelse(title_type == 'Feature Film', 'yes', 'no')),
drama=as.factor(ifelse(genre == 'Drama', 'yes', 'no')),
runtime,
mpaa_rating_R=as.factor(ifelse(mpaa_rating == 'R', 'yes', 'no')),
thtr_rel_year,
oscar_season=as.factor(
ifelse(thtr_rel_month %in% c(10:12), 'yes', 'no')),
summer_season=as.factor(
ifelse(thtr_rel_month %in% c(5:8), 'yes', 'no')),
imdb_rating, imdb_num_votes, critics_score, best_pic_nom,
best_pic_win, best_actor_win, best_actress_win, best_dir_win,
top200_box)NArows <- sum(!complete.cases(movies))Now, there is only 1 row (0.15\(\%\)) with NA values. So, it can be removed
movies<- na.omit(movies)Examine the structure of the resulting data set:
str(movies)tibble [650 × 17] (S3: tbl_df/tbl/data.frame)
$ audience_score : num [1:650] 73 81 91 76 27 86 76 47 89 66 ...
$ feature_film : Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 2 2 1 2 ...
$ drama : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 2 1 2 ...
$ runtime : num [1:650] 80 101 84 139 90 78 142 93 88 119 ...
$ mpaa_rating_R : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 1 1 ...
$ thtr_rel_year : num [1:650] 2013 2001 1996 1993 2004 ...
$ oscar_season : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
$ summer_season : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
$ imdb_rating : num [1:650] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
$ imdb_num_votes : int [1:650] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
$ critics_score : num [1:650] 45 96 91 80 33 91 57 17 90 83 ...
$ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
$ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
$ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "na.action")= 'omit' Named int 334
..- attr(*, "names")= chr "334"
There are 11 categorical variables (all with two levels yes/no), 5 numeric, and one integer (for imdb_num_votes).
Take a closer look at audience_score summary and histogram:
summ<-summary(movies$audience_score)
summ Min. 1st Qu. Median Mean 3rd Qu. Max.
11.00 46.00 65.00 62.35 80.00 97.00
The mean of audience_score is around 62.35, the median is 65, the IQR is 34.
ggplot(movies, aes(x = audience_score, y = ..density..)) +
geom_histogram(fill = "cornflowerblue", colour = "darkgrey") +
geom_density(size = 1, colour = "brown") audience_score distribution has a slightly left skewed structureUnderstand the relationship between audience_score and other numeric/ integer variables:
res <- cor.mtest(movies[,c(1,4,6,9:11)], conf.level = .95)
corrmat<- corrplot.mixed(cor(movies[,c(1,4,6,9:11)]), diag = NULL, order="FPC",
p.mat = res$p, number.font = 4, number.cex=.8,
tl.col = "purple3", tl.cex=.7, mar = c(0, 0, 1, 0))The correlation matrix shows:
audience_score and imdb_rating/ critics_scoreimdb_rating and critics_scoreCompare variability in audience_score for the main categorical variables:
data <- movies %>%
gather('category','yesno', feature_film, drama, mpaa_rating_R,
oscar_season,summer_season)
ggplot(data=data,aes(x=category, y= audience_score,fill=yesno))+
geom_boxplot(color="darkgrey", outlier.colour="cornflowerblue") +
scale_fill_manual(name="", values =
c("firebrick3","springgreen3")) +
theme(legend.title = element_blank())The boxplots shows:
drama movies have higher median than other genresaudience_score for feature_film is noticeably less than “no feature” oneaudience_score for movies released in Nov/ Oct/ Dec in average is higher in respect to other monthsmpaa_rating_R and summer_season, there is almost no difference if movie is from the respective category, or notThe goal is to predict the popularity of a movie quantified by audience_score using a linear regression model and Bayesian model averaging.
model<- bas.lm(data=movies, audience_score ~ ., method='MCMC',
prior="ZS-null", modelprior=uniform())REGRESSION FITTING:
par(mfrow=c(2,2))
plot(model, ask=FALSE, cex.lab=0.6)audience_scoreimdb_rating and critics_score are the most important predictorsMODELS RANKS:
image(model, rotate = F, cex.axis=0.9)imdb_rating and critics_score present in almost all of the top-20 highest probability modelsCONVERGENCE
diagnostics(model, type="model", col = "purple", pch=16)COEFFICIENT VALUES
coefficients(model)
Marginal Posterior Summaries of Coefficients:
Using BMA
Based on the top 6318 models
post mean post SD post p(B != 0)
Intercept 62.3476923077 0.3946336397 1.0000000000
feature_filmyes -0.1056235443 0.5674997344 0.0663467407
dramayes 0.0185910799 0.2052516528 0.0474327087
runtime -0.0254840615 0.0311544044 0.4662933350
mpaa_rating_Ryes -0.3060973255 0.7049385287 0.2018829346
thtr_rel_year -0.0047953020 0.0186969548 0.0952659607
oscar_seasonyes -0.0821517208 0.3812201276 0.0775772095
summer_seasonyes 0.0880613194 0.3850596057 0.0817718506
imdb_rating 14.9667087348 0.7400006234 0.9999870300
imdb_num_votes 0.0000002254 0.0000013659 0.0615058899
critics_score 0.0621940361 0.0307559771 0.8796325684
best_pic_nomyes 0.5375804370 1.6132832773 0.1380172729
best_pic_winyes -0.0122632925 0.8816404902 0.0422111511
best_actor_winyes -0.2849496573 0.8267121907 0.1440933228
best_actress_winyes -0.3167611464 0.9150215809 0.1453155518
best_dir_winyes -0.1231790787 0.6315837028 0.0699012756
top200_boxyes 0.0906963186 0.7215089841 0.0499572754
imdb_rating and critics_score variables have the highest probabilities that their coefficients are not zero: about \(100\%\) and \(90\%\) respectivelyimdb_rating by one point increases audience_score by almost 15 points
critics_score by one point increases audience_score by only 0.06 point because of high correlation between this variable and imdb_ratingaudience_scoreaudience_scoreno levelCREDIBLE INTERVALS
par(mfrow=c(1,3))
plot(coefficients(model), subset=c(1, 9, 11), ask=FALSE, col="cornflowerblue")imdb_rating coefficients distribution does not visually look entirely normal, there appears to be a small non-symmetric bump on the right side
audience score distributionaudience_score median
Now test the predictive capability of Bayesian model averaging on movie not included in the original data set: Mad Max: Fury Road, named the best film of 2015 by the International Federation of Film Critics. According to IMDb and Rotten Tomatoes, the movie was released in May-2015, lasts two hours, earned imdb_rating of \(8.1\), critics_score of \(97\%\), and audience_score of \(85\%\). Compare it to predicted value:
madmax <- tibble(runtime=120,
thtr_rel_year=2015,
imdb_rating=8.1,
imdb_num_votes=870737,
critics_score=97,
audience_score=0,
best_pic_nom=factor("yes", levels=c("no", "yes")),
best_pic_win=factor("no", levels=c("no", "yes")),
best_actor_win=factor("no", levels=c("no", "yes")),
best_actress_win=factor("no", levels=c("no", "yes")),
best_dir_win=factor("no", levels=c("no", "yes")),
top200_box=factor("no", levels=c("no", "yes")),
feature_film=factor("yes", levels=c("no", "yes")),
drama=factor("no", levels=c("no", "yes")),
mpaa_rating_R=factor("yes", levels=c("no", "yes")),
oscar_season=factor("no", levels=c("no", "yes")),
summer_season=factor("yes", levels=c("no", "yes")))
score_madmax <- predict(model, newdata=madmax, estimator="BMA", se.fit=TRUE)| Movie Title | Release Year | Runtime, min | IMDb Rating | Critics Score | Predicted Audience Score | True Audience Score |
|---|---|---|---|---|---|---|
| Mad Max: Fury Road | 2015 | 120 | 8.1 | 97\(\%\) | 89.12\(\%\) | \(85\%\) |
NOTE: although theaters release of the selected film is 2015, the streaming release was made in 2016, and the film is not included in the original data set
audience_score (\(89\%\) vs. \(85\%\))audience_score are critics_score and imdb_rating
imdb_rating coefficients distribution doesn’t show a normal curve?
audience_score values?audience_score?
genre variable lead to better results?