library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(BMS)
library(tidyr)
library(MASS)load("movies.Rdata")
data <- moviesJohn Eugene Driscoll | Module 4 - Data Analysis Project | Sept 2017 Submission
The purpose of this document is to complete the data analysis project required during week 5 of the Bayesian Statistics course by Duke University (Coursera.)
The background context regarding the assignment can be found at: https://www.coursera.org/learn/bayesian/supplement/VuNpl/project-information
Research question
Investigate what parameters are major influences on the audience score (audience_score,) using Bayseian statistical methods. Parameters are pulled from the movies data set.
If a meaninguful predection could be pulled from this exercise, I would be interested to repurpose this concept onto video game ratings, a form of media I am passionate about.
Scope of Inference
For the purposes of inference, this shoud be considered an observational study that uses a random sampling approach to obtain a representative sample from U.S. movies released between 1974 and 2016. Since a random sampling method is applied in data collection, the results can be generalizable to the movies released between 1974 and 2016.
Causation can only be inferred from a randomized experiment. This study does not meet the requirements of a randomized experiment, therefore causation can not be determined.
Sources of Bias
As Rotten Tomatoes audience score is created by voulnteers, the study may suffer from voluntary response bias since people with strong responses are more likely to participate. The voluntary participants may not be representative of the U.S. population.
Per the assignment, we must create new variables from the movies data souce to complement the original variables. Then we must strip the original data set down to a only relevant features.
The new variables are:
Summary - feature_film, drama, mpaa_rating_R, oscar_season, summer_season
Detail -
Create new variable based on title_type: New variable should be called feature_film with levels yes (movies that are feature films) and no
Create new variable based on genre: New variable should be called drama with levels yes (movies that are dramas) and no
Create new variable based on mpaa_rating: New variable should be called mpaa_rating_R with levels yes (movies that are R rated) and no
Create new variable called oscar_season with levels yes (if movie is released in November, October, or December) and no
Create new variable called summer_season with levels yes (if movie is released in May, June, July, or August) and no
The relevant features for our model per the instructions are:
feature_film
drama
runtime
mpaa_rating_R
thtr_rel_year
oscar_season
summer_season
imdb_rating
imdb_num_votes
critics_score
best_pic_nom
best_pic_win
best_actor_win
best_actress_win
best_dir_win
top200_box
Code for creation of new paramters and quick summary stats of the new features and the relevant features are provided.
movies <- mutate(movies, mpaa_rating_R = ifelse(mpaa_rating=="R","yes", "no"))
movies <- mutate(movies,oscar_season=ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no"))
movies <- mutate(movies,drama=ifelse(genre=='Drama', 'yes', 'no'))
movies<-mutate(movies, feature_film=ifelse(title_type=='Feature Film', 'yes', 'no'))
movies<-mutate(movies, summer_season=ifelse(thtr_rel_month %in% c(5,6,7,8), 'yes', 'no'))
movies$feature_film<-as.factor(movies$feature_film)
movies$drama<-as.factor(movies$drama)
movies$mpaa_rating_R<-as.factor(movies$mpaa_rating_R)
movies$oscar_season<-as.factor(movies$oscar_season)
movies$summer_season<-as.factor(movies$summer_season)
vars <- names(movies)%in% c("feature_film","drama","runtime","mpaa_rating",
"mpaa_rating_R","thtr_rel_year","oscar_season","summer_season","imdb_rating",
"imdb_num_votes","critics_score","best_pic_win","best_actor_win",
"best_actress_win","best_dir_win","top200_box","audience_score")
subset <- movies[vars]
summary(subset)## runtime mpaa_rating thtr_rel_year imdb_rating
## Min. : 39.0 G : 19 Min. :1970 Min. :1.900
## 1st Qu.: 92.0 NC-17 : 2 1st Qu.:1990 1st Qu.:5.900
## Median :103.0 PG :118 Median :2000 Median :6.600
## Mean :105.8 PG-13 :133 Mean :1998 Mean :6.493
## 3rd Qu.:115.8 R :329 3rd Qu.:2007 3rd Qu.:7.300
## Max. :267.0 Unrated: 50 Max. :2014 Max. :9.000
## NA's :1
## imdb_num_votes critics_score audience_score best_pic_win
## Min. : 180 Min. : 1.00 Min. :11.00 no :644
## 1st Qu.: 4546 1st Qu.: 33.00 1st Qu.:46.00 yes: 7
## Median : 15116 Median : 61.00 Median :65.00
## Mean : 57533 Mean : 57.69 Mean :62.36
## 3rd Qu.: 58301 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :893008 Max. :100.00 Max. :97.00
##
## best_actor_win best_actress_win best_dir_win top200_box mpaa_rating_R
## no :558 no :579 no :608 no :636 no :322
## yes: 93 yes: 72 yes: 43 yes: 15 yes:329
##
##
##
##
##
## oscar_season drama feature_film summer_season
## no :460 no :346 no : 60 no :443
## yes:191 yes:305 yes:591 yes:208
##
##
##
##
##
df <- arrange(subset,audience_score,critics_score,runtime,imdb_rating,imdb_num_votes,feature_film,
drama,mpaa_rating,mpaa_rating_R,thtr_rel_year,oscar_season,summer_season,best_pic_win,
best_actor_win,best_dir_win,top200_box)
arrange(df, desc(audience_score))## # A tibble: 651 x 17
## runtime mpaa_rating thtr_rel_year imdb_rating imdb_num_votes
## <dbl> <fctr> <dbl> <dbl> <int>
## 1 202 R 1974 9.0 749783
## 2 106 Unrated 2002 8.5 2362
## 3 121 R 1991 7.5 2551
## 4 133 R 1993 8.1 103378
## 5 154 PG-13 1985 7.8 58668
## 6 113 R 2000 8.5 806911
## 7 126 R 1997 8.3 572236
## 8 137 R 1986 8.4 466400
## 9 94 Unrated 2010 7.4 1935
## 10 94 R 1996 8.2 448434
## # ... with 641 more rows, and 12 more variables: critics_score <dbl>,
## # audience_score <dbl>, best_pic_win <fctr>, best_actor_win <fctr>,
## # best_actress_win <fctr>, best_dir_win <fctr>, top200_box <fctr>,
## # mpaa_rating_R <fctr>, oscar_season <fctr>, drama <fctr>,
## # feature_film <fctr>, summer_season <fctr>
Analysis of the newly created features:
For the EDA, we have created a boxplot for each newly created feature, comparing them with the audience_score. We will also analyse the variability of the new features by comparing them to each other.
In the first graph, we compare the variability of the different features we aggregated all at once.
The new features do not show much variability when looking at the audience score whether the condition of yes or no was met for the different features.
This may lead to the conclusion that none of the above features are valuable towards the prediction of the audience score. The only feature which might be relevant, could be the ‘feature_film’ feature. There seems to be a an obserevable difference. More research will follow below.
grouped_movies <- gather(movies, 'variable', 'flag', 33:37)
ggplot(grouped_movies, aes(x=variable, y=audience_score, fill=flag)) + geom_boxplot() ggplot(df, aes(x=feature_film, y=audience_score, fill=feature_film))+ geom_boxplot() ggplot(df,aes(x=drama, y=audience_score, fill=drama)) + geom_boxplot() ggplot(df,aes(x=mpaa_rating_R, y=audience_score, fill=mpaa_rating_R)) + geom_boxplot() ggplot(df, aes(x=df$oscar_season, y=df$audience_score, fill=df$oscar_season)) + geom_boxplot() ggplot(df, aes(x=summer_season, y=audience_score, fill=summer_season)) + geom_boxplot()We will use a step wise backwards elminination process to construct final model.Starting with full model and removing variables till BIC can not be lowered further. BMA will be used to find the best model.
In Bayesian model averaging (BMA) multiple models are averaged to obtain posteriors of coefficients and predictions from new data.
Full model
library(MASS)
lm1 <- lm(audience_score ~ ., data = df)
score_step <- stepAIC(lm1, trace = FALSE)
score_step$anova## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## audience_score ~ runtime + mpaa_rating + thtr_rel_year + imdb_rating +
## imdb_num_votes + critics_score + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win + top200_box + mpaa_rating_R +
## oscar_season + drama + feature_film + summer_season
##
## Final Model:
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes +
## critics_score + best_actress_win
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 630 63172.60 3014.825
## 2 - mpaa_rating_R 0 0.000000 630 63172.60 3014.825
## 3 - mpaa_rating 5 563.905967 635 63736.51 3010.601
## 4 - best_pic_win 1 1.948927 636 63738.46 3008.621
## 5 - oscar_season 1 9.743905 637 63748.20 3006.721
## 6 - top200_box 1 16.676369 638 63764.88 3004.891
## 7 - best_dir_win 1 90.517834 639 63855.39 3003.813
## 8 - best_actor_win 1 119.151581 640 63974.55 3003.024
## 9 - summer_season 1 168.616474 641 64143.16 3002.735
## 10 - drama 1 167.988533 642 64311.15 3002.435
## 11 - feature_film 1 166.871295 643 64478.02 3002.120
Reduced Model
vars1 <- names(movies) %in% c("audience_score","thtr_rel_year","critics_score","imdb_rating",
"imdb_num_votes")
df1 <- movies[vars1]
df1 <- na.omit(df1)
mf= bms(df1,mprior="unif",g=1700, mcmc="enumeration", user.int=FALSE)
image(mf[1:5],order=FALSE)Using BMA,BMP and MPM with reduced dataset(df1 with 5 numerical variables.)
movies_no_na = na.omit(df1)
bma_aud = bas.lm(audience_score ~ ., data = movies_no_na,
prior = "BIC",
modelprior = uniform())
bma_aud##
## Call:
## bas.lm(formula = audience_score ~ ., data = movies_no_na, prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept thtr_rel_year imdb_rating imdb_num_votes
## 1.00000 0.07265 1.00000 0.04135
## critics_score
## 0.91989
summary(bma_aud)## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.000000e+00 1.000000e+00
## thtr_rel_year 0.07264667 0.0000 0.000000e+00 1.000000e+00
## imdb_rating 1.00000000 1.0000 1.000000e+00 1.000000e+00
## imdb_num_votes 0.04134915 0.0000 0.000000e+00 0.000000e+00
## critics_score 0.91988729 1.0000 0.000000e+00 1.000000e+00
## BF NA 1.0000 8.469154e-02 7.534253e-02
## PostProbs NA 0.8199 6.940000e-02 6.180000e-02
## R2 NA 0.7524 7.480000e-01 7.529000e-01
## dim NA 3.0000 2.000000e+00 4.000000e+00
## logmarg NA -3621.0579 -3.623527e+03 -3.623644e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## thtr_rel_year 0.000000e+00 1.000000e+00
## imdb_rating 1.000000e+00 1.000000e+00
## imdb_num_votes 1.000000e+00 0.000000e+00
## critics_score 1.000000e+00 0.000000e+00
## BF 4.304024e-02 9.247604e-03
## PostProbs 3.530000e-02 7.600000e-03
## R2 7.524000e-01 7.488000e-01
## dim 4.000000e+00 3.000000e+00
## logmarg -3.624203e+03 -3.625741e+03
library(MASS)
full_model <- bas.lm(data=na.omit(movies), audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes
+ critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win
+ best_dir_win + top200_box, prior = 'BIC', modelprior = uniform())
full_model ##
## Call:
## bas.lm(formula = audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = na.omit(movies), prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes
## 1.00000 0.05876 0.04509
## runtime mpaa_rating_Ryes thtr_rel_year
## 0.51400 0.16498 0.08089
## oscar_seasonyes summer_seasonyes imdb_rating
## 0.06526 0.07935 1.00000
## imdb_num_votes critics_score best_pic_nomyes
## 0.06242 0.92016 0.13201
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.04077 0.11565 0.14770
## best_dir_winyes top200_boxyes
## 0.06701 0.04876
coef_aud = coefficients(full_model)
confint(coef_aud)## 2.5% 97.5% beta
## Intercept 6.141029e+01 6.301862e+01 6.221002e+01
## feature_filmyes -8.335074e-01 0.000000e+00 -9.162435e-02
## dramayes 0.000000e+00 0.000000e+00 1.956484e-02
## runtime -9.002118e-02 0.000000e+00 -3.042190e-02
## mpaa_rating_Ryes -2.011820e+00 1.120155e-02 -2.409288e-01
## thtr_rel_year -4.990744e-02 9.710504e-05 -3.878883e-03
## oscar_seasonyes -7.714591e-01 2.688653e-03 -6.293171e-02
## summer_seasonyes -5.748289e-03 1.123023e+00 8.660919e-02
## imdb_rating 1.359866e+01 1.654841e+01 1.490680e+01
## imdb_num_votes -2.477967e-08 2.543358e-06 2.466254e-07
## critics_score 0.000000e+00 1.119926e-01 6.951964e-02
## best_pic_nomyes -4.635467e-04 5.172906e+00 5.130600e-01
## best_pic_winyes 0.000000e+00 0.000000e+00 -6.383140e-03
## best_actor_winyes -2.240907e+00 7.032379e-03 -2.123363e-01
## best_actress_winyes -3.005555e+00 0.000000e+00 -3.323225e-01
## best_dir_winyes -1.221851e+00 2.782579e-02 -1.188402e-01
## top200_boxyes -1.499603e-01 4.110591e-01 8.972019e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
summary(full_model)## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.05876309 0.0000 0.0000000 0.0000000
## dramayes 0.04508592 0.0000 0.0000000 0.0000000
## runtime 0.51399873 1.0000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.16498090 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.08088668 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.06525993 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.07935408 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.06242230 0.0000 0.0000000 0.0000000
## critics_score 0.92015573 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13200908 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.04076727 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.11565473 0.0000 0.0000000 0.0000000
## best_actress_winyes 0.14770126 0.0000 0.0000000 1.0000000
## best_dir_winyes 0.06701259 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04875994 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.8715404 0.2048238
## PostProbs NA 0.1558 0.1358000 0.0319000
## R2 NA 0.7483 0.7455000 0.7470000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3434.7520 -3434.8894481 -3436.3375603
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 1.0000000 1.0000000
## mpaa_rating_Ryes 1.0000000 0.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 1.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2039916 0.1851908
## PostProbs 0.0318000 0.0289000
## R2 0.7496000 0.7495000
## dim 5.0000000 5.0000000
## logmarg -3436.3416317 -3436.4383237
Reduced_model <- lm(data=na.omit(movies), audience_score ~ imdb_rating + imdb_num_votes + critics_score + thtr_rel_year )final_reduced <- bas.lm(data=na.omit(movies), audience_score ~ imdb_rating + imdb_num_votes + critics_score + thtr_rel_year,prior= 'BIC', modelprior = uniform())
coef_reduced = coefficients(final_reduced)
confint(coef_reduced)## 2.5% 97.5% beta
## Intercept 61.37940834 62.975467028 6.221002e+01
## imdb_rating 13.39974861 16.190537288 1.468210e+01
## imdb_num_votes 0.00000000 0.000000000 8.008232e-08
## critics_score 0.00000000 0.113115780 7.319061e-02
## thtr_rel_year -0.01926706 0.003306831 -2.454141e-03
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
From the the output above we see a similar outcome when comparing frequqinst model analaysis (previous module) and the bayseian model analysis. Most of the variables, including the newly created variables, were been eliminated from the full model to create the reduced model.
By looking at the table directly above this, we can see that the highest rated model contains the intercept, imdb_rating, imdb_num_vote and critics score. Imdb_rating is the single biggest predictor of popularity.
We wanted to predict the audience score for a new movie that has not been used to fit the model. For the movie “Kung Fu Panda 3.” The data below, obtained from IMDB and Rotten Tomato represent each respective data point required to populate the SimpleModel.
Impressivley, the model was able predict an audicence score of 73 (rounded down to nearest whole number!) The actual audience score for Kung Fu Panda 3 per Rotten Tomatoes was 79.
Sources: Rotten Tomato Page for Kung Fu Panda 3 https://www.rottentomatoes.com/m/kung_fu_panda_3/
IMDB Page for Kung Fu Panda 3 http://www.imdb.com/title/tt2267968/
kfp <- data.frame(runtime=95,mpaa_rating="PG",thtr_rel_year=2016,imdb_rating=7.1,
imdb_num_votes=94961,critics_score=87,audience_score=79,
best_pic_win="no",best_actor_win="no",best_actress_win="no",
best_dir_win="no",top200_box="yes",mpaa_rating_R="no",oscar_season="no",
drama="no",feature_film="yes",summer_season="no")
df<-rbind(df,kfp)
df2<-tail(df,1)
names(kfp) ## [1] "runtime" "mpaa_rating" "thtr_rel_year"
## [4] "imdb_rating" "imdb_num_votes" "critics_score"
## [7] "audience_score" "best_pic_win" "best_actor_win"
## [10] "best_actress_win" "best_dir_win" "top200_box"
## [13] "mpaa_rating_R" "oscar_season" "drama"
## [16] "feature_film" "summer_season"
kfp <- data.frame(audience_score=79, imdb_rating=7.1, critics_score=87,imdb_num_votes=94961, thtr_rel_year=2016 )
BMA_pred_aud = predict(bma_aud, newdata= df2, estimator="BMA", se.fit=T)
predict(final_reduced, kfp)## $fit
## [,1]
## [1,] 73.35345
##
## $Ybma
## [,1]
## [1,] 73.35345
##
## $Ypred
## [,1]
## [1,] 73.47649
## [2,] 72.75486
## [3,] 72.13615
## [4,] 73.51227
## [5,] 71.23846
## [6,] 72.71466
## [7,] 72.14702
## [8,] 71.19747
## [9,] 77.22007
## [10,] 76.05445
## [11,] 76.98682
## [12,] 76.84748
## [13,] 60.03568
## [14,] 64.08266
## [15,] 62.21002
## [16,] 59.72481
##
## $postprobs
## [1] 8.454053e-01 5.564066e-02 5.095065e-02 3.817022e-02 4.707887e-03
## [6] 2.809584e-03 2.103112e-03 2.125490e-04 1.927465e-87 1.517291e-88
## [11] 2.736049e-92 1.110021e-93 1.769794e-169 2.304423e-170 5.770755e-182
## [16] 1.306043e-182
##
## $se.fit
## NULL
##
## $se.pred
## NULL
##
## $se.bma.fit
## NULL
##
## $se.bma.pred
## NULL
##
## $df
## [1] 616 615 617 615 616 614 616 615 616 615 617 616 616 617 618 617
##
## $best
## [1] 13 10 6 4 11 5 3 8 12 2 9 15 7 16 1 14
##
## $bestmodel
## $bestmodel[[1]]
## [1] 0 1 3
##
## $bestmodel[[2]]
## [1] 0 1 3 4
##
## $bestmodel[[3]]
## [1] 0 1
##
## $bestmodel[[4]]
## [1] 0 1 2 3
##
## $bestmodel[[5]]
## [1] 0 1 4
##
## $bestmodel[[6]]
## [1] 0 1 2 3 4
##
## $bestmodel[[7]]
## [1] 0 1 2
##
## $bestmodel[[8]]
## [1] 0 1 2 4
##
## $bestmodel[[9]]
## [1] 0 2 3
##
## $bestmodel[[10]]
## [1] 0 2 3 4
##
## $bestmodel[[11]]
## [1] 0 3
##
## $bestmodel[[12]]
## [1] 0 3 4
##
## $bestmodel[[13]]
## [1] 0 2 4
##
## $bestmodel[[14]]
## [1] 0 2
##
## $bestmodel[[15]]
## [1] 0
##
## $bestmodel[[16]]
## [1] 0 4
##
##
## $prediction
## [1] FALSE
##
## $estimator
## [1] "BMA"
##
## attr(,"class")
## [1] "pred.bas"
The predictive model presented here is used to predict the audience scores for a movie.
Using Bayesian model average many models can be constructed to perform better predictions. The proposed linear model shows a somewhat accurate prediction rate, but it is not dead accurate.
It may be helpful to use this model along with the expert oppinion of a well practiced movie critic to create a better prior expectation and ulimatly a stronger Bayseian prediction machine.