##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: survival
## Loading required package: leaps
## Loading required package: robustbase
##
## Attaching package: 'robustbase'
## The following object is masked from 'package:survival':
##
## heart
## Loading required package: inline
## Loading required package: rrcov
## Scalable Robust Estimators with High Breakdown Point (version 1.3-11)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: coda
## ##
## ## Markov Chain Monte Carlo Package (MCMCpack)
## ## Copyright (C) 2003-2016 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park
## ##
## ## Support provided by the U.S. National Science Foundation
## ## (Grants SES-0350646 and SES-0350613)
## ##
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## Skipping install of 'statsr' from a github remote, the SHA1 (a9717aa9) has not changed since last install.
## Use `force = TRUE` to force installation
##
## Attaching package: 'shiny'
## The following object is masked from 'package:inline':
##
## code
setwd("C:/R Data Files")
load("C:/R Data Files/movies.RData")
The data set “movies.RData” was downloaded from Coursera Course page into working directory of laptop. The data set consists of 651 randomly sampled movies produced and released before 2016. Some of these variables are only there for informational purposes and do not make any sense to include in a statistical analysis. This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies. Since this data is collected from randomly selected movies (random sampling) for a large movie sets. Thus these results can only be generalized, instead of establish a causal relationship.
The Bayesian approach to inference differs from the frequentist approach in that probabilities are used directly to quantify anything that is uncertain. Parameters are random variables. The posterior density is proportional to the product of the likelihood and prior density. inference is based on Bayes Theorem which states
P(A|B) = P(B|A)P(A)/P(B)
Estimate is done on the model parameters based on the observed data and any prior belief about the parameters, which is setup as follows
P(theta|X) = P(X|theta)P(X)pi(theta)P(X|theta) pi(theta)
pi(theta) - Prior distribution - This distribution reflects any preexisting
information / belief about the distribution of the parameter(s).
P(X|theta) -Likelihood / Sampling distribution - Distribution of the data
given the parameters, which is the probability model believed
to have generated the data.
P(X) - Marginal distribution of the data - Distribution of the observed
data marginalized over all possible values of the parameter(s).
P(theta|X) -Posterior distribution - Distribution of the parameter(s)
after taking the observed data into account.
Simplified description:
Bayes’ theorem is nothing more than a generalization into algebra a way to work out the likelihood of something in the face of some particular piece, or pieces, of evidence. The theorem is usually written like this:
p(A|B) = p(B|A) p(A) / p(B)
p(A|B) (usually read 'probability of A given B')is the probability of finding
observation A, given that some piece of evidence B is present.
A and B are the two evidences of an event.
From a conceptual point of view, the Bayesian approach to decision theory implies viewing probabilities as measures of knowledge or belief, the so-called personal, or subjective, probabilities.
For inference about a single population proportion, the Bayesian approach to estimate on the the posterior density and then cut out a given percentage on each end to state that there is, say, a 95% probability that the unknown p is in the given interval.
A confidence interval from the same data will have slightly different end points, but the practical difference between the two approaches gets small as the sample size increases, at least if a somewhat vague prior density is used for the Bayesian approach.
A sound knowledge on prior might help to produce a better posterior probability distribution. To sum up the bayesian approach we can say that a deciding criteria of bayes factor is more plausible than the probabilities approach.
As suggested by the references, I will create new five variables and add to the data set.
1.variable feature_film with levels yes (movies that are feature films) and no.
2.variable based on genre: New variable should be called drama with levels yes and no.
3.variable based on mpaa_rating: New variable should be called mpaa_rating_R with levels yes and no.
4.variables based on thtr_rel_month: New variable called oscar_season with levels yes and no.
5.variable called summer_season with levels yes and no.
Then transform the whole dataset to include only the columns needed for the future prediction model.
I shall draw plots showing various estimation of posterior probabilities and bayfactors of hypotheses. In this part, I will study the influence of these newly added variables and how they will influence the audience score. The audience score is much higher than the feature films. There is strong evidence against H1, which means that there is a significant difference of mean audience score for non-feature and feature films.
We do similar comparisons for drama films, mpaa_rating_R films, oscar season films and summer season films.The comparisons show that there is difference in the mean audience score for drama and non-drama films. The Bayesian inference provides us relatively strong evidence against the H1.
The data provides us positive evidence that there is no difference between audience scores for rest of the new variables(mpaa_rating R or not R films, oscar season or not films, summer season or not films). I assume that only feature_film and drama variables may be very important predictors for final audience score.
movies <- mutate(movies, mpaa_rating_R = ifelse(mpaa_rating=="R","yes", "no"))
movies <- mutate(movies,oscar_season=ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no"))
movies <- mutate(movies,drama=ifelse(genre=='Drama', 'yes', 'no'))
movies<-mutate(movies, feature_film=ifelse(title_type=='Feature Film', 'yes', 'no'))
movies<-mutate(movies, summer_season=ifelse(thtr_rel_month %in% c(5,6,7,8), 'yes', 'no'))
movies$feature_film<-as.factor(movies$feature_film)
movies$drama<-as.factor(movies$drama)
movies$mpaa_rating_R<-as.factor(movies$mpaa_rating_R)
movies$oscar_season<-as.factor(movies$oscar_season)
movies$summer_season<-as.factor(movies$summer_season)
vars <- names(movies)%in% c("feature_film","drama","runtime","mpaa_rating",
"mpaa_rating_R","thtr_rel_year","oscar_season","summer_season","imdb_rating",
"imdb_num_votes","critics_score","best_pic_win","best_actor_win",
"best_actress_win","best_dir_win","top200_box","audience_score")
tahir <- movies[vars]
glimpse(tahir)
## Observations: 651
## Variables: 17
## $ runtime <dbl> 80, 101, 84, 139, 90, 78, 142, 93, 88, 119, 1...
## $ mpaa_rating <fctr> R, PG-13, R, PG, R, Unrated, PG-13, R, Unrat...
## $ thtr_rel_year <dbl> 2013, 2001, 1996, 1993, 2004, 2009, 1986, 199...
## $ imdb_rating <dbl> 5.5, 7.3, 7.6, 7.2, 5.1, 7.8, 7.2, 5.5, 7.5, ...
## $ imdb_num_votes <int> 899, 12285, 22381, 35096, 2386, 333, 5016, 22...
## $ critics_score <dbl> 45, 96, 91, 80, 33, 91, 57, 17, 90, 83, 89, 6...
## $ audience_score <dbl> 73, 81, 91, 76, 27, 86, 76, 47, 89, 66, 75, 4...
## $ best_pic_win <fctr> no, no, no, no, no, no, no, no, no, no, no, ...
## $ best_actor_win <fctr> no, no, no, yes, no, no, no, yes, no, no, ye...
## $ best_actress_win <fctr> no, no, no, no, no, no, no, no, no, no, no, ...
## $ best_dir_win <fctr> no, no, no, yes, no, no, no, no, no, no, no,...
## $ top200_box <fctr> no, no, no, no, no, no, no, no, no, no, yes,...
## $ mpaa_rating_R <fctr> yes, no, yes, no, yes, no, no, yes, no, no, ...
## $ oscar_season <fctr> no, no, no, yes, no, no, no, yes, no, no, no...
## $ drama <fctr> yes, yes, no, yes, no, no, yes, yes, no, yes...
## $ feature_film <fctr> yes, yes, yes, yes, yes, no, yes, yes, no, y...
## $ summer_season <fctr> no, no, yes, no, no, no, no, no, no, no, yes...
df <- arrange(tahir,audience_score,critics_score,runtime,imdb_rating,imdb_num_votes,feature_film,
drama,mpaa_rating,mpaa_rating_R,thtr_rel_year,oscar_season,summer_season,best_pic_win,
best_actor_win,best_dir_win,top200_box)
arrange(df, desc(audience_score))
## # A tibble: 651 × 17
## runtime mpaa_rating thtr_rel_year imdb_rating imdb_num_votes
## <dbl> <fctr> <dbl> <dbl> <int>
## 1 202 R 1974 9.0 749783
## 2 106 Unrated 2002 8.5 2362
## 3 121 R 1991 7.5 2551
## 4 133 R 1993 8.1 103378
## 5 154 PG-13 1985 7.8 58668
## 6 113 R 2000 8.5 806911
## 7 126 R 1997 8.3 572236
## 8 137 R 1986 8.4 466400
## 9 94 Unrated 2010 7.4 1935
## 10 94 R 1996 8.2 448434
## # ... with 641 more rows, and 12 more variables: critics_score <dbl>,
## # audience_score <dbl>, best_pic_win <fctr>, best_actor_win <fctr>,
## # best_actress_win <fctr>, best_dir_win <fctr>, top200_box <fctr>,
## # mpaa_rating_R <fctr>, oscar_season <fctr>, drama <fctr>,
## # feature_film <fctr>, summer_season <fctr>
rm(tahir)
movies_grouped <- gather(movies, 'variable', 'flag', 33:37)
p1=ggplot(movies_grouped, aes(x=variable, y=audience_score, fill=flag)) + geom_boxplot()
p2=ggplot(df, aes(x=feature_film, y=audience_score, fill=feature_film))+ geom_boxplot()
p3=ggplot(df,aes(x=drama, y=audience_score, fill=drama)) + geom_boxplot()
p4=ggplot(df,aes(x=mpaa_rating_R, y=audience_score, fill=mpaa_rating_R)) + geom_boxplot()
p5=ggplot(df, aes(x=df$oscar_season, y=df$audience_score, fill=df$oscar_season)) + geom_boxplot()
p6=ggplot(df, aes(x=summer_season, y=audience_score, fill=summer_season)) + geom_boxplot()
grid.arrange(p1, p2, p3, p4,p5,p6, ncol=2)
movies_grouped %>%
group_by(variable, flag) %>%
summarise(mean=mean(audience_score), median=median(audience_score), min=min(audience_score), max=max(audience_score), IQR=IQR(audience_score))
## Source: local data frame [10 x 7]
## Groups: variable [?]
##
## variable flag mean median min max IQR
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 drama no 59.73121 61.0 11 97 38.00
## 2 drama yes 65.34754 70.0 13 95 28.00
## 3 feature_film no 81.05000 85.5 19 96 12.50
## 4 feature_film yes 60.46531 62.0 11 97 33.50
## 5 mpaa_rating_R no 62.68944 65.5 11 96 31.75
## 6 mpaa_rating_R yes 62.04255 64.0 14 97 35.00
## 7 oscar_season no 61.81304 64.0 11 96 33.00
## 8 oscar_season yes 63.68586 69.0 13 97 33.50
## 9 summer_season no 62.62302 66.0 13 97 34.00
## 10 summer_season yes 61.80769 65.0 11 94 33.25
source("http://bit.ly/dasi_inference")
par(mfrow = c(3, 2))
bayes_inference(y = audience_score, x = feature_film, data = movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 60, y_bar_no = 81.05, s_no = 13.5764
## n_yes = 591, y_bar_yes = 60.4653, s_yes = 19.824
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no = mu_yes
## H2: mu_no != mu_yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 4.221452e+13
## P(H1|data) = 0
## P(H2|data) = 1
bayes_inference(y = audience_score, x = drama, data = movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 346, y_bar_no = 59.7312, s_no = 21.2775
## n_yes = 305, y_bar_yes = 65.3475, s_yes = 18.5418
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no = mu_yes
## H2: mu_no != mu_yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H2:H1] = 22.6567
## P(H1|data) = 0.0423
## P(H2|data) = 0.9577
bayes_inference(y = audience_score, x = mpaa_rating_R, data = movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 322, y_bar_no = 62.6894, s_no = 20.3167
## n_yes = 329, y_bar_yes = 62.0426, s_yes = 20.1559
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no = mu_yes
## H2: mu_no != mu_yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 23.9679
## P(H1|data) = 0.9599
## P(H2|data) = 0.0401
bayes_inference(y = audience_score, x = oscar_season, data = movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 460, y_bar_no = 61.813, s_no = 20.1196
## n_yes = 191, y_bar_yes = 63.6859, s_yes = 20.4612
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no = mu_yes
## H2: mu_no != mu_yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 13.3993
## P(H1|data) = 0.9306
## P(H2|data) = 0.0694
bayes_inference(y = audience_score, x = summer_season, data = movies, statistic = "mean", type = "ht", null = 0, alternative = "twosided")
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 443, y_bar_no = 62.623, s_no = 20.3857
## n_yes = 208, y_bar_yes = 61.8077, s_yes = 19.9083
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no = mu_yes
## H2: mu_no != mu_yes
##
## Priors:
## P(H1) = 0.5
## P(H2) = 0.5
##
## Results:
## BF[H1:H2] = 21.7118
## P(H1|data) = 0.956
## P(H2|data) = 0.044
I will use a step wise BACKWARDS ELIMINATION Method to construct final model.Starting with full model and removing variables till BIC can not be lowered further. I will try to find the best model according to BIC (in which case k=log(n)k=loga(n)).
Using BMA:
I will use both BMS and BMA for model averaging. In Bayesian model averaging (BMA) multiple models are averaged to obtain posteriors of coefficients and predictions from new data.
From the output below I deduce can that lot of variables have been eliminated from the full model. Only mpaa_rating_R variable is left out of five new variables.
summary command gives us both the posterior model inclusion probability for each variable and the most probable models. Model averaging is shown below. By looking at the image, we can see that the most probable model is the one with only intercept.
The most likely model has posterior probability of 0.269(intercept).The posterior probability of 0.269 is greater than the uniform prior probability assigned to it, since there can be 2^k = 2^17 possible models.
Monte Carlo simulation with prior with ZS g function can be used. We find that the only model with interacept has the highest probability.
Using BMS:
The second important function is bms to perform bayesian model sampling for Bayesian model Averaging or Bayesian Model Selection. It basically offers to sample data according to different g-priors and model priors, and leaves the choice of different samplers (MCMC samplers, full or partial enumeration,and interaction samplers). The results provide analysis into models according to MCMC frequencies, and according to the posterior likelihood of the best nmodel models (cf. bms). The functions coef.bma and summary.bma summarize the most important results. The plotting functions plot.bma, image.bma, density.bma, pred.density, and gdensity are the most important plotting functions (inter alia). Most of them also produce numerical output.
Two different models have been constructed with same parameters and results have been compared. The results are closer and this shows that the inference drawn from Bayesian approach can be more accurate. Starting with Initial Model with all variables I proceed to Reduced model. From there I select five variables to study the comparisons of different methods shown in log posterior odds model rank plot. Various regression plots are drawn for visualization.
library(MASS)
lm1 <- lm(audience_score ~ ., data = df)
score_step <- stepAIC(lm1, trace = FALSE)
score_step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## audience_score ~ runtime + mpaa_rating + thtr_rel_year + imdb_rating +
## imdb_num_votes + critics_score + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win + top200_box + mpaa_rating_R +
## oscar_season + drama + feature_film + summer_season
##
## Final Model:
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes +
## critics_score + best_actress_win
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 630 63172.60 3014.825
## 2 - mpaa_rating_R 0 0.000000 630 63172.60 3014.825
## 3 - mpaa_rating 5 563.905967 635 63736.51 3010.601
## 4 - best_pic_win 1 1.948927 636 63738.46 3008.621
## 5 - oscar_season 1 9.743905 637 63748.20 3006.721
## 6 - top200_box 1 16.676369 638 63764.88 3004.891
## 7 - best_dir_win 1 90.517834 639 63855.39 3003.813
## 8 - best_actor_win 1 119.151581 640 63974.55 3003.024
## 9 - summer_season 1 168.616474 641 64143.16 3002.735
## 10 - drama 1 167.988533 642 64311.15 3002.435
## 11 - feature_film 1 166.871295 643 64478.02 3002.120
Reduced Model
vars1 <- names(movies) %in% c("audience_score","critics_score","runtime","imdb_rating",
"imdb_num_votes")
df1 <- movies[vars1]
df1 <- na.omit(df1)
mf= bms(df1,mprior="unif",g=1700, mcmc="enumeration", user.int=FALSE)
image(mf[1:5],order=FALSE)
coef(mf,std.coefs=TRUE)
## PIP Post Mean Post SD Cond.Pos.Sign Idx
## imdb_num_votes 1.00000000 0.2904043264 0.03840702 1.00000000 2
## imdb_rating 0.99783015 0.2769392485 0.10743518 1.00000000 1
## audience_score 0.58918197 -0.1211751214 0.11536300 0.00052768 4
## critics_score 0.03040774 -0.0009824434 0.01220151 0.03343085 3
density(mf,reg="critics_score",std.coefs=TRUE)
density(mf,reg="imdb_rating",std.coefs=TRUE)
density(mf,reg="imdb_num_votes",std.coefs=TRUE)
plotModelsize(mf)
Comparisons of different Prediction Methods:
Using BMA,BMP and MPM with reduced dataset(df1 with 5 numerical variables.)
movies_no_na = na.omit(df1)
bma_aud = bas.lm(audience_score ~ ., data = movies_no_na,
prior = "BIC",
modelprior = uniform())
bma_aud
##
## Call:
## bas.lm(formula = audience_score ~ ., data = movies_no_na, prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept runtime imdb_rating imdb_num_votes
## 1.00000 0.51826 1.00000 0.05658
## critics_score
## 0.90562
summary(bma_aud)
## Intercept runtime imdb_rating imdb_num_votes critics_score BF
## [1,] 1 1 1 0 1 1.00000000
## [2,] 1 0 1 0 1 0.99684895
## [3,] 1 1 1 0 0 0.12557073
## [4,] 1 0 1 0 0 0.08390939
## [5,] 1 1 1 1 1 0.07856349
## PostProbs R2 dim logmarg
## [1,] 0.4276 0.7549 4 -3615.279
## [2,] 0.4262 0.7525 3 -3615.282
## [3,] 0.0537 0.7509 3 -3617.354
## [4,] 0.0359 0.7481 2 -3617.757
## [5,] 0.0336 0.7554 5 -3617.823
par(mfrow = c(1,2))
coef_aud = coefficients(bma_aud)
coef_aud
##
## Marginal Posterior Summaries of Coefficients:
## post mean post SD post p(B != 0)
## Intercept 6.235e+01 3.949e-01 1.000e+00
## runtime -2.826e-02 3.125e-02 5.183e-01
## imdb_rating 1.496e+01 7.215e-01 1.000e+00
## imdb_num_votes 1.951e-07 1.262e-06 5.658e-02
## critics_score 6.506e-02 2.942e-02 9.056e-01
plot(coef_aud, subset = c(1,5), ask=FALSE)
confint(coef_aud)
## 2.5 % 97.5 % beta
## Intercept 6.157266e+01 6.311475e+01 6.234769e+01
## runtime -8.220001e-02 0.000000e+00 -2.825721e-02
## imdb_rating 1.373318e+01 1.657685e+01 1.495818e+01
## imdb_num_votes -2.189901e-08 2.021179e-06 1.951166e-07
## critics_score 0.000000e+00 1.069073e-01 6.506226e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
Prediction with BMP
BPM_pred_aud = predict(bma_aud, estimator="BPM", se.fit=TRUE)
bma_aud$namesx[BPM_pred_aud$bestmodel+1]
## [1] "Intercept" "runtime" "imdb_rating" "critics_score"
Highest probability model vs Median Probability Model (MPM)
MPM_pred_aud = predict(bma_aud, estimator="MPM")
bma_aud$namesx[MPM_pred_aud$bestmodel+1]
## [1] "Intercept" "runtime" "imdb_rating" "critics_score"
opt = which.max(BPM_pred_aud$fit)
t(movies_no_na[opt, ])
## [,1]
## runtime 202
## imdb_rating 9
## imdb_num_votes 749783
## critics_score 97
## audience_score 97
ci_aud = confint(BPM_pred_aud, parm="pred") #95% credible interval
ci_aud[opt,]
## 2.5 % 97.5 % pred
## 77.42143 117.65459 97.53801
exp(ci_aud[opt,])#expotential interval
## 2.5 % 97.5 % pred
## 4.204341e+33 1.249513e+51 2.292025e+42
For BMA method, the interval would be
BMA_pred_aud = predict(bma_aud, estimator="BMA", se.fit=TRUE)
ci_bma_aud = confint(BMA_pred_aud, estimator="BMA") # 95% credible interval
opt_bma = which.max(BMA_pred_aud$fit)
exp(ci_bma_aud[opt_bma,])
## 2.5 % 97.5 % pred
## 4.509762e+34 1.083399e+52 2.306718e+43
** Regression with BMA**
library(MASS)
full_model <- bas.lm(data=na.omit(movies), audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes
+ critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win
+ best_dir_win + top200_box, prior = 'BIC', modelprior = uniform())
full_model
##
## Call:
## bas.lm(formula = audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = na.omit(movies), prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes
## 1.00000 0.05876 0.04509
## runtime mpaa_rating_Ryes thtr_rel_year
## 0.51400 0.16498 0.08089
## oscar_seasonyes summer_seasonyes imdb_rating
## 0.06526 0.07935 1.00000
## imdb_num_votes critics_score best_pic_nomyes
## 0.06242 0.92016 0.13201
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.04077 0.11565 0.14770
## best_dir_winyes top200_boxyes
## 0.06701 0.04876
coef_aud = coefficients(full_model)
confint(coef_aud)
## 2.5 % 97.5 % beta
## Intercept 61.41378718 6.299937e+01 6.221002e+01
## feature_filmyes -0.74118331 7.187630e-03 -9.162435e-02
## dramayes 0.00000000 0.000000e+00 1.956484e-02
## runtime -0.09000381 0.000000e+00 -3.042190e-02
## mpaa_rating_Ryes -1.96884341 0.000000e+00 -2.409288e-01
## thtr_rel_year -0.04830590 0.000000e+00 -3.878883e-03
## oscar_seasonyes -0.80215522 3.540429e-03 -6.293171e-02
## summer_seasonyes -0.01004381 9.938248e-01 8.660919e-02
## imdb_rating 13.58058810 1.655107e+01 1.490680e+01
## imdb_num_votes 0.00000000 2.348443e-06 2.466254e-07
## critics_score 0.00000000 1.116010e-01 6.951964e-02
## best_pic_nomyes 0.00000000 4.901124e+00 5.130600e-01
## best_pic_winyes 0.00000000 0.000000e+00 -6.383140e-03
## best_actor_winyes -2.23086473 1.104825e-05 -2.123363e-01
## best_actress_winyes -2.94829437 0.000000e+00 -3.323225e-01
## best_dir_winyes -1.32813192 0.000000e+00 -1.188402e-01
## top200_boxyes -0.02269258 5.192369e-02 8.972019e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
summary(full_model)
## Intercept feature_filmyes dramayes runtime mpaa_rating_Ryes
## [1,] 1 0 0 1 0
## [2,] 1 0 0 0 0
## [3,] 1 0 0 0 0
## [4,] 1 0 0 1 1
## [5,] 1 0 0 1 0
## thtr_rel_year oscar_seasonyes summer_seasonyes imdb_rating
## [1,] 0 0 0 1
## [2,] 0 0 0 1
## [3,] 0 0 0 1
## [4,] 0 0 0 1
## [5,] 0 0 0 1
## imdb_num_votes critics_score best_pic_nomyes best_pic_winyes
## [1,] 0 1 0 0
## [2,] 0 1 0 0
## [3,] 0 1 0 0
## [4,] 0 1 0 0
## [5,] 0 1 1 0
## best_actor_winyes best_actress_winyes best_dir_winyes top200_boxyes
## [1,] 0 0 0 0
## [2,] 0 0 0 0
## [3,] 0 1 0 0
## [4,] 0 0 0 0
## [5,] 0 0 0 0
## BF PostProbs R2 dim logmarg
## [1,] 1.0000000 0.1558 0.7483 4 -3434.752
## [2,] 0.8715404 0.1358 0.7455 3 -3434.889
## [3,] 0.2048238 0.0319 0.7470 4 -3436.338
## [4,] 0.2039916 0.0318 0.7496 5 -3436.342
## [5,] 0.1851908 0.0289 0.7495 5 -3436.438
image(full_model,rotate=FALSE)
Reduced_model <- lm(data=na.omit(movies), audience_score ~ imdb_rating + critics_score + runtime)
plot(Reduced_model)
final_reduced <- bas.lm(data=na.omit(movies), audience_score ~ imdb_rating + critics_score + runtime,
prior= 'BIC', modelprior = uniform())
coef_reduced = coefficients(final_reduced)
confint(coef_reduced)
## 2.5 % 97.5 % beta
## Intercept 61.430560 63.0316820 62.21001616
## imdb_rating 13.542469 16.5125165 14.89079350
## critics_score 0.000000 0.1122759 0.07128748
## runtime -0.088141 0.0000000 -0.03138428
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
summary(final_reduced)
## Intercept imdb_rating critics_score runtime BF PostProbs
## [1,] 1 1 1 1 1.000000e+00 0.4982
## [2,] 1 1 1 0 8.715404e-01 0.4342
## [3,] 1 1 0 1 8.330447e-02 0.0415
## [4,] 1 1 0 0 5.252575e-02 0.0262
## [5,] 1 0 1 0 2.820632e-92 0.0000
## R2 dim logmarg
## [1,] 0.7483 4 -3434.752
## [2,] 0.7455 3 -3434.889
## [3,] 0.7436 3 -3437.237
## [4,] 0.7405 2 -3437.698
## [5,] 0.4921 2 -3645.553
image(final_reduced,rotate=FALSE)
plot(final_reduced)
I will do the prediction for’Kubo and the two strings’, the movie is released in this summer. Since at this moment there is no information about wether this movie will be nominated with oscar award, I will suppose that this movie will not be nominated as best pic and it will not win any oscar award, like best pic, directory, actor or actress.
The movie reviews references on Rotten Tomatoes and IMDB are:
http://www.imdb.com/title/tt4302938/
https://www.rottentomatoes.com/m/kubo_and_the_two_strings_2016
KUBO AND THE TWO STRINGS(2016)
http://www.imdb.com/title/tt4302938/
https://www.rottentomatoes.com/m/kubo_and_the_two_strings_2016
The 4th film created by Laika studios. TOOK FIVE YEARS TO PRODUCE.Release date 19 August(USA).
In the epic fantasy, scruffy, kindhearted Kubo ekes out a humble living while devotedly caring for his mother in their sleepy shoreside village. It is a quiet existence - until a spirit from the past catches up with him to enforce an age-old vendetta. Suddenly on the run from gods and monsters, Kubo’s chance for survival rests on finding the magical suit of armor once worn by his fallen father, the greatest samurai the world has ever known. Summoning courage, Kubo embarks on a thrilling odyssey as he faces his family’s history, navigates the elements, and bravely fights for the earth and the stars. Critics Consensus: Kubo and the Two Strings matches its incredible animation with an absorbing – and bravely melancholy – story that has something to offer audiences of all ages.PG RATING
ROTTEN TOMATO audience rating 91,critics score 96 average rating 8.3/10 viewers vote 14541, fresh 133 ,rotten 5 ,IMDB RATINGS:RUNTIME 101 RATING 8.4/10,ANIMATION ADVENTURE,FAMILY CLASS AS DRAMA,TOP RATED FILM.
df2 <- data.frame(runtime=101,mpaa_rating="R",thtr_rel_year=2016,imdb_rating=8.3,
imdb_num_votes=14541,critics_score=96,audience_score=93,
best_pic_win="no",best_actor_win="no",best_actress_win="no",
best_dir_win="no",top200_box="yes",mpaa_rating_R="yes",oscar_season="no",
drama="no",feature_film="yes",summer_season="yes")
df<-rbind(df,df2)
df2<-tail(df,1)
names(df2)
## [1] "runtime" "mpaa_rating" "thtr_rel_year"
## [4] "imdb_rating" "imdb_num_votes" "critics_score"
## [7] "audience_score" "best_pic_win" "best_actor_win"
## [10] "best_actress_win" "best_dir_win" "top200_box"
## [13] "mpaa_rating_R" "oscar_season" "drama"
## [16] "feature_film" "summer_season"
#new_movie <- data.frame(audience_score=93,runtime=101, imdb_rating=8.3, critics_score=96)
#new_movie
# audience_score runtime imdb_rating critics_score
# 1 93 101 8.3 96
#BMA_pred_aud = predict(bma_aud, newdata= df2, estimator="BMA", se.fit=T)
# predict(score_step,df2, interval="predict")
# fit lwr upr
# 1 90.92774 71.13529 110.7202
# predict(final_model, new_movie)
# 92.56143
There are several R packages which are used to quantify and validate modeling and predictions. Some well known packages are: MCMCPackage,BMA,BAS,MPM,JAGS,STAN and BayesValidate.Some of the packages were utilized to run modeling and prediction of ‘movies’ popularity(BMA BMS,and MCMC). Another package for hypotheses test and ploting called “inference” was also used.An algorithm by Harold Jeffreys is also often used for evaluation purposes. This is Bayesian advanced topic and is currently out of scope of this report.
In this project, I studied the audience score for a movie and how it depends on other variables in the sample. Using Bayesian model average many models can be constructed to perform a better prediction.
The fact that imdb_rating has the highest posterior probability, and that two of our five newly created variables which seem to have predictive power are not included in the best model.
A few personal notes can be added like the Drama films are more popular, feature/non-feature films are different and length of runtime has no bearing on the popularity of a movie.The critics score and IMDB rating are similar. One of these could be ignored.
The used approach was unable to show our full sentiments “prior” knowledge. Perhaps some other (other than IMDB and Rotten Tomatoes) ratings with more priors density variables could be used to utilize the power of Bayesian prediction. A knowledgable prior can be better.