Load packages

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: survival

## Loading required package: leaps

## Loading required package: robustbase

## 
## Attaching package: 'robustbase'

## The following object is masked from 'package:survival':
## 
##     heart

## Loading required package: inline

## Loading required package: rrcov

## Scalable Robust Estimators with High Breakdown Point (version 1.3-11)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Loading required package: coda

## ##
## ## Markov Chain Monte Carlo Package (MCMCpack)

## ## Copyright (C) 2003-2016 Andrew D. Martin, Kevin M. Quinn, and Jong Hee Park

## ##
## ## Support provided by the U.S. National Science Foundation

## ## (Grants SES-0350646 and SES-0350613)
## ##

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

## Skipping install of 'statsr' from a github remote, the SHA1 (a9717aa9) has not changed since last install.
##   Use `force = TRUE` to force installation

## 
## Attaching package: 'shiny'

## The following object is masked from 'package:inline':
## 
##     code

Load data

    setwd("C:/R Data Files")
    load("C:/R Data Files/movies.RData")

Part 1: Data

The data set “movies.RData” was downloaded from Coursera Course page into working directory of laptop. The data set consists of 651 randomly sampled movies produced and released before 2016. Some of these variables are only there for informational purposes and do not make any sense to include in a statistical analysis. This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies. Since this data is collected from randomly selected movies (random sampling) for a large movie sets. Thus these results can only be generalized, instead of establish a causal relationship.

Principles of Bayesian Statistics

The Bayesian approach to inference differs from the frequentist approach in that probabilities are used directly to quantify anything that is uncertain. Parameters are random variables. The posterior density is proportional to the product of the likelihood and prior density. inference is based on Bayes Theorem which states

                   P(A|B) = P(B|A)P(A)/P(B)

Estimate is done on the model parameters based on the observed data and any prior belief about the parameters, which is setup as follows

           P(theta|X) = P(X|theta)P(X)pi(theta)P(X|theta) pi(theta)                           

    pi(theta) - Prior distribution - This distribution reflects any preexisting
                information / belief about the distribution of the parameter(s).
    P(X|theta) -Likelihood / Sampling distribution - Distribution of the data
                given the parameters, which is the probability model believed
                to have generated the data.
    P(X) -      Marginal distribution of the data - Distribution of the observed
                data marginalized over all possible values of the parameter(s).
    P(theta|X) -Posterior distribution - Distribution of the parameter(s)
                after taking the observed data into account.

Simplified description:

Bayes’ theorem is nothing more than a generalization into algebra a way to work out the likelihood of something in the face of some particular piece, or pieces, of evidence. The theorem is usually written like this:

      p(A|B) = p(B|A) p(A) / p(B)
                        
      p(A|B) (usually read 'probability of A given B')is the probability of finding 
                observation A, given that some piece of evidence B is present. 
                         
              A and B are the two evidences of an event.

From a conceptual point of view, the Bayesian approach to decision theory implies viewing probabilities as measures of knowledge or belief, the so-called personal, or subjective, probabilities.

For inference about a single population proportion, the Bayesian approach to estimate on the the posterior density and then cut out a given percentage on each end to state that there is, say, a 95% probability that the unknown p is in the given interval.

A confidence interval from the same data will have slightly different end points, but the practical difference between the two approaches gets small as the sample size increases, at least if a somewhat vague prior density is used for the Bayesian approach.

A sound knowledge on prior might help to produce a better posterior probability distribution. To sum up the bayesian approach we can say that a deciding criteria of bayes factor is more plausible than the probabilities approach.

Part 2: Data manipulation

As suggested by the references, I will create new five variables and add to the data set.

1.variable feature_film with levels yes (movies that are feature films) and no.

2.variable based on genre: New variable should be called drama with levels yes and no.

3.variable based on mpaa_rating: New variable should be called mpaa_rating_R with levels yes and no.

4.variables based on thtr_rel_month: New variable called oscar_season with levels yes and no.

5.variable called summer_season with levels yes and no.

Then transform the whole dataset to include only the columns needed for the future prediction model.

I shall draw plots showing various estimation of posterior probabilities and bayfactors of hypotheses. In this part, I will study the influence of these newly added variables and how they will influence the audience score. The audience score is much higher than the feature films. There is strong evidence against H1, which means that there is a significant difference of mean audience score for non-feature and feature films.

We do similar comparisons for drama films, mpaa_rating_R films, oscar season films and summer season films.The comparisons show that there is difference in the mean audience score for drama and non-drama films. The Bayesian inference provides us relatively strong evidence against the H1.

The data provides us positive evidence that there is no difference between audience scores for rest of the new variables(mpaa_rating R or not R films, oscar season or not films, summer season or not films). I assume that only feature_film and drama variables may be very important predictors for final audience score.

 movies <- mutate(movies, mpaa_rating_R = ifelse(mpaa_rating=="R","yes", "no"))
 movies <- mutate(movies,oscar_season=ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no")) 
 movies <- mutate(movies,drama=ifelse(genre=='Drama', 'yes', 'no'))
 movies<-mutate(movies, feature_film=ifelse(title_type=='Feature Film', 'yes', 'no'))
 movies<-mutate(movies, summer_season=ifelse(thtr_rel_month %in% c(5,6,7,8), 'yes', 'no'))
 movies$feature_film<-as.factor(movies$feature_film)
 movies$drama<-as.factor(movies$drama)
 movies$mpaa_rating_R<-as.factor(movies$mpaa_rating_R)
 movies$oscar_season<-as.factor(movies$oscar_season)
 movies$summer_season<-as.factor(movies$summer_season)

  vars <- names(movies)%in% c("feature_film","drama","runtime","mpaa_rating",
                      "mpaa_rating_R","thtr_rel_year","oscar_season","summer_season","imdb_rating",
                      "imdb_num_votes","critics_score","best_pic_win","best_actor_win",
                      "best_actress_win","best_dir_win","top200_box","audience_score")
                                                               
 tahir <- movies[vars] 
 glimpse(tahir)

## Observations: 651
## Variables: 17
## $ runtime          <dbl> 80, 101, 84, 139, 90, 78, 142, 93, 88, 119, 1...
## $ mpaa_rating      <fctr> R, PG-13, R, PG, R, Unrated, PG-13, R, Unrat...
## $ thtr_rel_year    <dbl> 2013, 2001, 1996, 1993, 2004, 2009, 1986, 199...
## $ imdb_rating      <dbl> 5.5, 7.3, 7.6, 7.2, 5.1, 7.8, 7.2, 5.5, 7.5, ...
## $ imdb_num_votes   <int> 899, 12285, 22381, 35096, 2386, 333, 5016, 22...
## $ critics_score    <dbl> 45, 96, 91, 80, 33, 91, 57, 17, 90, 83, 89, 6...
## $ audience_score   <dbl> 73, 81, 91, 76, 27, 86, 76, 47, 89, 66, 75, 4...
## $ best_pic_win     <fctr> no, no, no, no, no, no, no, no, no, no, no, ...
## $ best_actor_win   <fctr> no, no, no, yes, no, no, no, yes, no, no, ye...
## $ best_actress_win <fctr> no, no, no, no, no, no, no, no, no, no, no, ...
## $ best_dir_win     <fctr> no, no, no, yes, no, no, no, no, no, no, no,...
## $ top200_box       <fctr> no, no, no, no, no, no, no, no, no, no, yes,...
## $ mpaa_rating_R    <fctr> yes, no, yes, no, yes, no, no, yes, no, no, ...
## $ oscar_season     <fctr> no, no, no, yes, no, no, no, yes, no, no, no...
## $ drama            <fctr> yes, yes, no, yes, no, no, yes, yes, no, yes...
## $ feature_film     <fctr> yes, yes, yes, yes, yes, no, yes, yes, no, y...
## $ summer_season    <fctr> no, no, yes, no, no, no, no, no, no, no, yes...

 df <- arrange(tahir,audience_score,critics_score,runtime,imdb_rating,imdb_num_votes,feature_film,
         drama,mpaa_rating,mpaa_rating_R,thtr_rel_year,oscar_season,summer_season,best_pic_win,
         best_actor_win,best_dir_win,top200_box)

 arrange(df, desc(audience_score))

## # A tibble: 651 × 17
##    runtime mpaa_rating thtr_rel_year imdb_rating imdb_num_votes
##      <dbl>      <fctr>         <dbl>       <dbl>          <int>
## 1      202           R          1974         9.0         749783
## 2      106     Unrated          2002         8.5           2362
## 3      121           R          1991         7.5           2551
## 4      133           R          1993         8.1         103378
## 5      154       PG-13          1985         7.8          58668
## 6      113           R          2000         8.5         806911
## 7      126           R          1997         8.3         572236
## 8      137           R          1986         8.4         466400
## 9       94     Unrated          2010         7.4           1935
## 10      94           R          1996         8.2         448434
## # ... with 641 more rows, and 12 more variables: critics_score <dbl>,
## #   audience_score <dbl>, best_pic_win <fctr>, best_actor_win <fctr>,
## #   best_actress_win <fctr>, best_dir_win <fctr>, top200_box <fctr>,
## #   mpaa_rating_R <fctr>, oscar_season <fctr>, drama <fctr>,
## #   feature_film <fctr>, summer_season <fctr>

 rm(tahir)
 
 movies_grouped <- gather(movies, 'variable', 'flag', 33:37) 
                                                               
  
 p1=ggplot(movies_grouped, aes(x=variable, y=audience_score, fill=flag)) + geom_boxplot()
 
 p2=ggplot(df, aes(x=feature_film, y=audience_score, fill=feature_film))+ geom_boxplot()
 
 p3=ggplot(df,aes(x=drama, y=audience_score, fill=drama)) + geom_boxplot()
 
 p4=ggplot(df,aes(x=mpaa_rating_R, y=audience_score, fill=mpaa_rating_R)) + geom_boxplot()
 
 p5=ggplot(df, aes(x=df$oscar_season, y=df$audience_score, fill=df$oscar_season)) + geom_boxplot()

 p6=ggplot(df, aes(x=summer_season, y=audience_score, fill=summer_season)) + geom_boxplot()

 grid.arrange(p1, p2, p3, p4,p5,p6, ncol=2)

 movies_grouped %>%
 group_by(variable, flag) %>%
 summarise(mean=mean(audience_score), median=median(audience_score), min=min(audience_score),              max=max(audience_score), IQR=IQR(audience_score))

## Source: local data frame [10 x 7]
## Groups: variable [?]
## 
##         variable  flag     mean median   min   max   IQR
##            <chr> <chr>    <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1          drama    no 59.73121   61.0    11    97 38.00
## 2          drama   yes 65.34754   70.0    13    95 28.00
## 3   feature_film    no 81.05000   85.5    19    96 12.50
## 4   feature_film   yes 60.46531   62.0    11    97 33.50
## 5  mpaa_rating_R    no 62.68944   65.5    11    96 31.75
## 6  mpaa_rating_R   yes 62.04255   64.0    14    97 35.00
## 7   oscar_season    no 61.81304   64.0    11    96 33.00
## 8   oscar_season   yes 63.68586   69.0    13    97 33.50
## 9  summer_season    no 62.62302   66.0    13    97 34.00
## 10 summer_season   yes 61.80769   65.0    11    94 33.25

 source("http://bit.ly/dasi_inference")
 par(mfrow = c(3, 2))  
 bayes_inference(y = audience_score, x = feature_film, data = movies, statistic = "mean", type = "ht",     null  = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 60, y_bar_no = 81.05, s_no = 13.5764
## n_yes = 591, y_bar_yes = 60.4653, s_yes = 19.824
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 4.221452e+13
## P(H1|data) = 0 
## P(H2|data) = 1

 bayes_inference(y = audience_score, x = drama, data = movies, statistic = "mean", type = "ht", null = 0,   alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 346, y_bar_no = 59.7312, s_no = 21.2775
## n_yes = 305, y_bar_yes = 65.3475, s_yes = 18.5418
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H2:H1] = 22.6567
## P(H1|data) = 0.0423 
## P(H2|data) = 0.9577

 bayes_inference(y = audience_score, x = mpaa_rating_R, data = movies, statistic = "mean", type = "ht",    null  = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 322, y_bar_no = 62.6894, s_no = 20.3167
## n_yes = 329, y_bar_yes = 62.0426, s_yes = 20.1559
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 23.9679
## P(H1|data) = 0.9599 
## P(H2|data) = 0.0401

 bayes_inference(y = audience_score, x = oscar_season, data = movies, statistic = "mean", type = "ht",     null  = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 460, y_bar_no = 61.813, s_no = 20.1196
## n_yes = 191, y_bar_yes = 63.6859, s_yes = 20.4612
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 13.3993
## P(H1|data) = 0.9306 
## P(H2|data) = 0.0694

 bayes_inference(y = audience_score, x = summer_season, data = movies, statistic = "mean", type = "ht",    null  = 0, alternative = "twosided")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_no = 443, y_bar_no = 62.623, s_no = 20.3857
## n_yes = 208, y_bar_yes = 61.8077, s_yes = 19.9083
## (Assuming intrinsic prior on parameters)
## Hypotheses:
## H1: mu_no  = mu_yes
## H2: mu_no != mu_yes
## 
## Priors:
## P(H1) = 0.5 
## P(H2) = 0.5 
## 
## Results:
## BF[H1:H2] = 21.7118
## P(H1|data) = 0.956 
## P(H2|data) = 0.044

Part 4: Modeling

I will use a step wise BACKWARDS ELIMINATION Method to construct final model.Starting with full model and removing variables till BIC can not be lowered further. I will try to find the best model according to BIC (in which case k=log(n)k=loga(n)).

Using BMA:

I will use both BMS and BMA for model averaging. In Bayesian model averaging (BMA) multiple models are averaged to obtain posteriors of coefficients and predictions from new data.

From the output below I deduce can that lot of variables have been eliminated from the full model. Only mpaa_rating_R variable is left out of five new variables.

summary command gives us both the posterior model inclusion probability for each variable and the most probable models. Model averaging is shown below. By looking at the image, we can see that the most probable model is the one with only intercept.

The most likely model has posterior probability of 0.269(intercept).The posterior probability of 0.269 is greater than the uniform prior probability assigned to it, since there can be 2^k = 2^17 possible models.

Monte Carlo simulation with prior with ZS g function can be used. We find that the only model with interacept has the highest probability.

Using BMS:

The second important function is bms to perform bayesian model sampling for Bayesian model Averaging or Bayesian Model Selection. It basically offers to sample data according to different g-priors and model priors, and leaves the choice of different samplers (MCMC samplers, full or partial enumeration,and interaction samplers). The results provide analysis into models according to MCMC frequencies, and according to the posterior likelihood of the best nmodel models (cf. bms). The functions coef.bma and summary.bma summarize the most important results. The plotting functions plot.bma, image.bma, density.bma, pred.density, and gdensity are the most important plotting functions (inter alia). Most of them also produce numerical output.

Two different models have been constructed with same parameters and results have been compared. The results are closer and this shows that the inference drawn from Bayesian approach can be more accurate. Starting with Initial Model with all variables I proceed to Reduced model. From there I select five variables to study the comparisons of different methods shown in log posterior odds model rank plot. Various regression plots are drawn for visualization.

  library(MASS)
  lm1 <- lm(audience_score ~ ., data = df)
  score_step <- stepAIC(lm1, trace = FALSE)
  score_step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## audience_score ~ runtime + mpaa_rating + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + top200_box + mpaa_rating_R + 
##     oscar_season + drama + feature_film + summer_season
## 
## Final Model:
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_score + best_actress_win
## 
## 
##                Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                                       630   63172.60 3014.825
## 2   - mpaa_rating_R  0   0.000000       630   63172.60 3014.825
## 3     - mpaa_rating  5 563.905967       635   63736.51 3010.601
## 4    - best_pic_win  1   1.948927       636   63738.46 3008.621
## 5    - oscar_season  1   9.743905       637   63748.20 3006.721
## 6      - top200_box  1  16.676369       638   63764.88 3004.891
## 7    - best_dir_win  1  90.517834       639   63855.39 3003.813
## 8  - best_actor_win  1 119.151581       640   63974.55 3003.024
## 9   - summer_season  1 168.616474       641   64143.16 3002.735
## 10          - drama  1 167.988533       642   64311.15 3002.435
## 11   - feature_film  1 166.871295       643   64478.02 3002.120

Reduced Model

  vars1 <- names(movies) %in% c("audience_score","critics_score","runtime","imdb_rating",
                                "imdb_num_votes")
                                                          
  df1 <- movies[vars1] 
 
  df1 <- na.omit(df1) 
 
  mf= bms(df1,mprior="unif",g=1700, mcmc="enumeration", user.int=FALSE)
  image(mf[1:5],order=FALSE)

  coef(mf,std.coefs=TRUE)

##                       PIP     Post Mean    Post SD Cond.Pos.Sign Idx
## imdb_num_votes 1.00000000  0.2904043264 0.03840702    1.00000000   2
## imdb_rating    0.99783015  0.2769392485 0.10743518    1.00000000   1
## audience_score 0.58918197 -0.1211751214 0.11536300    0.00052768   4
## critics_score  0.03040774 -0.0009824434 0.01220151    0.03343085   3

   density(mf,reg="critics_score",std.coefs=TRUE)

   density(mf,reg="imdb_rating",std.coefs=TRUE)

   density(mf,reg="imdb_num_votes",std.coefs=TRUE)

   plotModelsize(mf)

Comparisons of different Prediction Methods:

Using BMA,BMP and MPM with reduced dataset(df1 with 5 numerical variables.)

movies_no_na = na.omit(df1)

bma_aud = bas.lm(audience_score ~ ., data = movies_no_na,
                   prior = "BIC", 
                   modelprior = uniform())
bma_aud

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = movies_no_na, prior = "BIC",     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##      Intercept         runtime     imdb_rating  imdb_num_votes  
##        1.00000         0.51826         1.00000         0.05658  
##  critics_score  
##        0.90562

summary(bma_aud)

##      Intercept runtime imdb_rating imdb_num_votes critics_score         BF
## [1,]         1       1           1              0             1 1.00000000
## [2,]         1       0           1              0             1 0.99684895
## [3,]         1       1           1              0             0 0.12557073
## [4,]         1       0           1              0             0 0.08390939
## [5,]         1       1           1              1             1 0.07856349
##      PostProbs     R2 dim   logmarg
## [1,]    0.4276 0.7549   4 -3615.279
## [2,]    0.4262 0.7525   3 -3615.282
## [3,]    0.0537 0.7509   3 -3617.354
## [4,]    0.0359 0.7481   2 -3617.757
## [5,]    0.0336 0.7554   5 -3617.823

par(mfrow = c(1,2))

coef_aud = coefficients(bma_aud)
coef_aud

## 
##  Marginal Posterior Summaries of Coefficients: 
##                 post mean   post SD     post p(B != 0)
## Intercept        6.235e+01   3.949e-01   1.000e+00    
## runtime         -2.826e-02   3.125e-02   5.183e-01    
## imdb_rating      1.496e+01   7.215e-01   1.000e+00    
## imdb_num_votes   1.951e-07   1.262e-06   5.658e-02    
## critics_score    6.506e-02   2.942e-02   9.056e-01

plot(coef_aud, subset = c(1,5), ask=FALSE)

confint(coef_aud)

##                       2.5  %      97.5  %          beta
## Intercept       6.157266e+01 6.311475e+01  6.234769e+01
## runtime        -8.220001e-02 0.000000e+00 -2.825721e-02
## imdb_rating     1.373318e+01 1.657685e+01  1.495818e+01
## imdb_num_votes -2.189901e-08 2.021179e-06  1.951166e-07
## critics_score   0.000000e+00 1.069073e-01  6.506226e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Prediction with BMP

BPM_pred_aud =  predict(bma_aud, estimator="BPM", se.fit=TRUE)
bma_aud$namesx[BPM_pred_aud$bestmodel+1]

## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"

Highest probability model vs Median Probability Model (MPM)

MPM_pred_aud =  predict(bma_aud, estimator="MPM")
bma_aud$namesx[MPM_pred_aud$bestmodel+1]

## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"

opt = which.max(BPM_pred_aud$fit)
t(movies_no_na[opt, ])

##                  [,1]
## runtime           202
## imdb_rating         9
## imdb_num_votes 749783
## critics_score      97
## audience_score     97

ci_aud = confint(BPM_pred_aud, parm="pred") #95% credible interval  
ci_aud[opt,]

##    2.5  %   97.5  %      pred 
##  77.42143 117.65459  97.53801

exp(ci_aud[opt,])#expotential interval

##       2.5  %      97.5  %         pred 
## 4.204341e+33 1.249513e+51 2.292025e+42

For BMA method, the interval would be

BMA_pred_aud =  predict(bma_aud, estimator="BMA", se.fit=TRUE)
ci_bma_aud = confint(BMA_pred_aud, estimator="BMA") # 95% credible interval
opt_bma = which.max(BMA_pred_aud$fit)
exp(ci_bma_aud[opt_bma,])

##       2.5  %      97.5  %         pred 
## 4.509762e+34 1.083399e+52 2.306718e+43

** Regression with BMA**

      library(MASS)
      
      

full_model <- bas.lm(data=na.omit(movies), audience_score ~ feature_film + drama + runtime +     mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes
               + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win
               + best_dir_win + top200_box, prior = 'BIC', modelprior = uniform())
full_model

## 
## Call:
## bas.lm(formula = audience_score ~ feature_film + drama + runtime +     mpaa_rating_R + thtr_rel_year + oscar_season + summer_season +     imdb_rating + imdb_num_votes + critics_score + best_pic_nom +     best_pic_win + best_actor_win + best_actress_win + best_dir_win +     top200_box, data = na.omit(movies), prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.05876              0.04509  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.51400              0.16498              0.08089  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.06526              0.07935              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.06242              0.92016              0.13201  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.04077              0.11565              0.14770  
##     best_dir_winyes        top200_boxyes  
##             0.06701              0.04876

coef_aud = coefficients(full_model)
confint(coef_aud)

##                          2.5  %      97.5  %          beta
## Intercept           61.41378718 6.299937e+01  6.221002e+01
## feature_filmyes     -0.74118331 7.187630e-03 -9.162435e-02
## dramayes             0.00000000 0.000000e+00  1.956484e-02
## runtime             -0.09000381 0.000000e+00 -3.042190e-02
## mpaa_rating_Ryes    -1.96884341 0.000000e+00 -2.409288e-01
## thtr_rel_year       -0.04830590 0.000000e+00 -3.878883e-03
## oscar_seasonyes     -0.80215522 3.540429e-03 -6.293171e-02
## summer_seasonyes    -0.01004381 9.938248e-01  8.660919e-02
## imdb_rating         13.58058810 1.655107e+01  1.490680e+01
## imdb_num_votes       0.00000000 2.348443e-06  2.466254e-07
## critics_score        0.00000000 1.116010e-01  6.951964e-02
## best_pic_nomyes      0.00000000 4.901124e+00  5.130600e-01
## best_pic_winyes      0.00000000 0.000000e+00 -6.383140e-03
## best_actor_winyes   -2.23086473 1.104825e-05 -2.123363e-01
## best_actress_winyes -2.94829437 0.000000e+00 -3.323225e-01
## best_dir_winyes     -1.32813192 0.000000e+00 -1.188402e-01
## top200_boxyes       -0.02269258 5.192369e-02  8.972019e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

summary(full_model)

##      Intercept feature_filmyes dramayes runtime mpaa_rating_Ryes
## [1,]         1               0        0       1                0
## [2,]         1               0        0       0                0
## [3,]         1               0        0       0                0
## [4,]         1               0        0       1                1
## [5,]         1               0        0       1                0
##      thtr_rel_year oscar_seasonyes summer_seasonyes imdb_rating
## [1,]             0               0                0           1
## [2,]             0               0                0           1
## [3,]             0               0                0           1
## [4,]             0               0                0           1
## [5,]             0               0                0           1
##      imdb_num_votes critics_score best_pic_nomyes best_pic_winyes
## [1,]              0             1               0               0
## [2,]              0             1               0               0
## [3,]              0             1               0               0
## [4,]              0             1               0               0
## [5,]              0             1               1               0
##      best_actor_winyes best_actress_winyes best_dir_winyes top200_boxyes
## [1,]                 0                   0               0             0
## [2,]                 0                   0               0             0
## [3,]                 0                   1               0             0
## [4,]                 0                   0               0             0
## [5,]                 0                   0               0             0
##             BF PostProbs     R2 dim   logmarg
## [1,] 1.0000000    0.1558 0.7483   4 -3434.752
## [2,] 0.8715404    0.1358 0.7455   3 -3434.889
## [3,] 0.2048238    0.0319 0.7470   4 -3436.338
## [4,] 0.2039916    0.0318 0.7496   5 -3436.342
## [5,] 0.1851908    0.0289 0.7495   5 -3436.438

image(full_model,rotate=FALSE)

Reduced_model <- lm(data=na.omit(movies), audience_score ~ imdb_rating + critics_score + runtime)

plot(Reduced_model)

final_reduced <- bas.lm(data=na.omit(movies), audience_score ~ imdb_rating + critics_score + runtime,
                 prior= 'BIC', modelprior = uniform())

coef_reduced = coefficients(final_reduced)

confint(coef_reduced)

##                  2.5  %    97.5  %        beta
## Intercept     61.430560 63.0316820 62.21001616
## imdb_rating   13.542469 16.5125165 14.89079350
## critics_score  0.000000  0.1122759  0.07128748
## runtime       -0.088141  0.0000000 -0.03138428
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

summary(final_reduced)

##      Intercept imdb_rating critics_score runtime           BF PostProbs
## [1,]         1           1             1       1 1.000000e+00    0.4982
## [2,]         1           1             1       0 8.715404e-01    0.4342
## [3,]         1           1             0       1 8.330447e-02    0.0415
## [4,]         1           1             0       0 5.252575e-02    0.0262
## [5,]         1           0             1       0 2.820632e-92    0.0000
##          R2 dim   logmarg
## [1,] 0.7483   4 -3434.752
## [2,] 0.7455   3 -3434.889
## [3,] 0.7436   3 -3437.237
## [4,] 0.7405   2 -3437.698
## [5,] 0.4921   2 -3645.553

image(final_reduced,rotate=FALSE)

plot(final_reduced)

Part 5: Prediction

I will do the prediction for’Kubo and the two strings’, the movie is released in this summer. Since at this moment there is no information about wether this movie will be nominated with oscar award, I will suppose that this movie will not be nominated as best pic and it will not win any oscar award, like best pic, directory, actor or actress.

The movie reviews references on Rotten Tomatoes and IMDB are:

http://www.imdb.com/title/tt4302938/

https://www.rottentomatoes.com/m/kubo_and_the_two_strings_2016

KUBO AND THE TWO STRINGS(2016)

http://www.imdb.com/title/tt4302938/

https://www.rottentomatoes.com/m/kubo_and_the_two_strings_2016

The 4th film created by Laika studios. TOOK FIVE YEARS TO PRODUCE.Release date 19 August(USA).

In the epic fantasy, scruffy, kindhearted Kubo ekes out a humble living while devotedly caring for his mother in their sleepy shoreside village. It is a quiet existence - until a spirit from the past catches up with him to enforce an age-old vendetta. Suddenly on the run from gods and monsters, Kubo’s chance for survival rests on finding the magical suit of armor once worn by his fallen father, the greatest samurai the world has ever known. Summoning courage, Kubo embarks on a thrilling odyssey as he faces his family’s history, navigates the elements, and bravely fights for the earth and the stars. Critics Consensus: Kubo and the Two Strings matches its incredible animation with an absorbing – and bravely melancholy – story that has something to offer audiences of all ages.PG RATING

ROTTEN TOMATO audience rating 91,critics score 96 average rating 8.3/10 viewers vote 14541, fresh 133 ,rotten 5 ,IMDB RATINGS:RUNTIME 101 RATING 8.4/10,ANIMATION ADVENTURE,FAMILY CLASS AS DRAMA,TOP RATED FILM.

   df2 <- data.frame(runtime=101,mpaa_rating="R",thtr_rel_year=2016,imdb_rating=8.3,
                      imdb_num_votes=14541,critics_score=96,audience_score=93,
                      best_pic_win="no",best_actor_win="no",best_actress_win="no",
                      best_dir_win="no",top200_box="yes",mpaa_rating_R="yes",oscar_season="no",
                      drama="no",feature_film="yes",summer_season="yes")

        df<-rbind(df,df2)
        df2<-tail(df,1)
        names(df2)

##  [1] "runtime"          "mpaa_rating"      "thtr_rel_year"   
##  [4] "imdb_rating"      "imdb_num_votes"   "critics_score"   
##  [7] "audience_score"   "best_pic_win"     "best_actor_win"  
## [10] "best_actress_win" "best_dir_win"     "top200_box"      
## [13] "mpaa_rating_R"    "oscar_season"     "drama"           
## [16] "feature_film"     "summer_season"

   #new_movie <- data.frame(audience_score=93,runtime=101, imdb_rating=8.3, critics_score=96)
   #new_movie
   #    audience_score runtime imdb_rating critics_score
   #  1   93     101         8.3            96
   #BMA_pred_aud =  predict(bma_aud, newdata= df2, estimator="BMA", se.fit=T)
   # predict(score_step,df2, interval="predict")
   #      fit       lwr      upr
   #   1 90.92774 71.13529 110.7202
   # predict(final_model, new_movie)
   # 92.56143

Diagnostics and Validation of Models

There are several R packages which are used to quantify and validate modeling and predictions. Some well known packages are: MCMCPackage,BMA,BAS,MPM,JAGS,STAN and BayesValidate.Some of the packages were utilized to run modeling and prediction of ‘movies’ popularity(BMA BMS,and MCMC). Another package for hypotheses test and ploting called “inference” was also used.An algorithm by Harold Jeffreys is also often used for evaluation purposes. This is Bayesian advanced topic and is currently out of scope of this report.

Part 6: Conclusion

In this project, I studied the audience score for a movie and how it depends on other variables in the sample. Using Bayesian model average many models can be constructed to perform a better prediction.

The fact that imdb_rating has the highest posterior probability, and that two of our five newly created variables which seem to have predictive power are not included in the best model.

A few personal notes can be added like the Drama films are more popular, feature/non-feature films are different and length of runtime has no bearing on the popularity of a movie.The critics score and IMDB rating are similar. One of these could be ignored.

The used approach was unable to show our full sentiments “prior” knowledge. Perhaps some other (other than IMDB and Rotten Tomatoes) ratings with more priors density variables could be used to utilize the power of Bayesian prediction. A knowledgable prior can be better.

Bayesian Modeling and Movies Popularity Prediction

Tahir Hussain

30th Aug 2016