Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

Load data

load("movies.Rdata")

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016.

The results are not possible for causality. Why not? Since there is no random assignment of sample subjects, the data cannot be used to answer causation questions.

Since the sample is big, It is possible to generalize with great cautions.

We must use, according to the course guidelines, some variables and others you must transform.

Below, see the entire process of transformation and selection of variables:

 movies <- mutate(movies, mpaa_rating_R = ifelse(mpaa_rating=="R","yes", "no"))
 movies <- mutate(movies,oscar_season=ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no")) 
 movies <- mutate(movies,drama=ifelse(genre=='Drama', 'yes', 'no'))
 movies<-mutate(movies, feature_film=ifelse(title_type=='Feature Film', 'yes', 'no'))
 movies<-mutate(movies, summer_season=ifelse(thtr_rel_month %in% c(5,6,7,8), 'yes', 'no'))
 movies$feature_film<-as.factor(movies$feature_film)
 movies$drama<-as.factor(movies$drama)
 movies$mpaa_rating_R<-as.factor(movies$mpaa_rating_R)
 movies$oscar_season<-as.factor(movies$oscar_season)
 movies$summer_season<-as.factor(movies$summer_season)

  vars <- names(movies)%in% c("feature_film","drama","runtime","mpaa_rating",
                      "mpaa_rating_R","thtr_rel_year","oscar_season","summer_season","imdb_rating",
                      "imdb_num_votes","critics_score","best_pic_win","best_actor_win",
                      "best_actress_win","best_dir_win","top200_box","audience_score")
                                                               
 subset <- movies[vars]

Part 2: Data manipulation

As seen above, it is called ‘subset’, to be selected to follow the analysis. A function ‘summary’ helps us to see that we select the same variables requested not exercise by COUSERA / DUKE

 summary(subset)

##     runtime       mpaa_rating  thtr_rel_year   imdb_rating    imdb_num_votes  
##  Min.   : 39.0   G      : 19   Min.   :1970   Min.   :1.900   Min.   :   180  
##  1st Qu.: 92.0   NC-17  :  2   1st Qu.:1990   1st Qu.:5.900   1st Qu.:  4546  
##  Median :103.0   PG     :118   Median :2000   Median :6.600   Median : 15116  
##  Mean   :105.8   PG-13  :133   Mean   :1998   Mean   :6.493   Mean   : 57533  
##  3rd Qu.:115.8   R      :329   3rd Qu.:2007   3rd Qu.:7.300   3rd Qu.: 58301  
##  Max.   :267.0   Unrated: 50   Max.   :2014   Max.   :9.000   Max.   :893008  
##  NA's   :1                                                                    
##  critics_score    audience_score  best_pic_win best_actor_win best_actress_win
##  Min.   :  1.00   Min.   :11.00   no :644      no :558        no :579         
##  1st Qu.: 33.00   1st Qu.:46.00   yes:  7      yes: 93        yes: 72         
##  Median : 61.00   Median :65.00                                               
##  Mean   : 57.69   Mean   :62.36                                               
##  3rd Qu.: 83.00   3rd Qu.:80.00                                               
##  Max.   :100.00   Max.   :97.00                                               
##                                                                               
##  best_dir_win top200_box mpaa_rating_R oscar_season drama     feature_film
##  no :608      no :636    no :322       no :460      no :346   no : 60     
##  yes: 43      yes: 15    yes:329       yes:191      yes:305   yes:591     
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  summer_season
##  no :443      
##  yes:208      
##               
##               
##               
##               
##

To finalize, is necessary to do more adjusted in data, using function ‘arrange’. The dplyr function arrange() can be used to reorder (or sort) rows by one or more variables. Instead of using the function desc(), you can prepend the sorting variable by a minus sign to indicate descending order, as follow. If the data contain missing values, they will always come at the end.

 df <- arrange(subset,audience_score,critics_score,runtime,imdb_rating,imdb_num_votes,feature_film,
         drama,mpaa_rating,mpaa_rating_R,thtr_rel_year,oscar_season,summer_season,best_pic_win,
         best_actor_win, best_actress_win, best_dir_win,top200_box)

 arrange(df, desc(audience_score))

## # A tibble: 651 x 17
##    runtime mpaa_rating thtr_rel_year imdb_rating imdb_num_votes critics_score
##      <dbl> <fct>               <dbl>       <dbl>          <int>         <dbl>
##  1     202 R                    1974         9           749783            97
##  2     106 Unrated              2002         8.5           2362            96
##  3     121 R                    1991         7.5           2551            38
##  4     133 R                    1993         8.1         103378            94
##  5     154 PG-13                1985         7.8          58668            88
##  6     113 R                    2000         8.5         806911            92
##  7     126 R                    1997         8.3         572236            97
##  8     137 R                    1986         8.4         466400            98
##  9      94 Unrated              2010         7.4           1935           100
## 10      94 R                    1996         8.2         448434            89
## # ... with 641 more rows, and 11 more variables: audience_score <dbl>,
## #   best_pic_win <fct>, best_actor_win <fct>, best_actress_win <fct>,
## #   best_dir_win <fct>, top200_box <fct>, mpaa_rating_R <fct>,
## #   oscar_season <fct>, drama <fct>, feature_film <fct>, summer_season <fct>

With this last adjustment, ‘df’ becomes the name of our base to be used.

Remember that the specific modeling task you need to complete is as follows: Develop a Bayesian regression model to predict audience_score from the following explanatory variables.

Part 3: Exploratory data analysis

The first thing to do is to analyze ‘audience score’. Our variable to be explained in this activity.

We’ve seen it before during the transformation process, let’s review the basic data below:

summary(df$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.36   80.00   97.00

Looking at the above, we can see that it is slightly rigth skewed. A histogram of this variable is also a good idea.

hist(df$audience_score)

Confirm that analyse. Next, you will observe the exploratory analysis of all explanatory variables in relation to the ‘audience score’.

3.1 EDA with feature_film

 ggplot(df, aes(x=feature_film, y=audience_score, fill=feature_film))+ geom_boxplot()

Does being a feature film influence public criticism? Apparently so, as short films, as we see above, seem to get better grades in general. Two possible outliers are found in short films below 25.

3.2 EDA with runtime

ggplot(df, aes(x = runtime  , y = audience_score)) +
  geom_point()

## Warning: Removed 1 rows containing missing values (geom_point).

There does not seem to be an association between the variables, the film time does not seem to clearly increase or decrease the audience’s rating.

3.3 EDA with critics score and imbd_rating

ggplot(df, aes(x = critics_score  , y = audience_score)) +
  geom_point()

At first contact, there seems to be a positive association. That is, the better the specialized criticism, the better the reception of the public.

df %>%
  summarize(cor(critics_score, audience_score))

## # A tibble: 1 x 1
##   `cor(critics_score, audience_score)`
##                                  <dbl>
## 1                                0.704

The 0.70 correlation seems to point in the same direction. Let’s see below, if another critical measure has the same meaning (imdb_rating).

ggplot(df, aes(x =imdb_rating  , y = audience_score)) +
  geom_point()

df %>%
  summarize(cor(imdb_rating, audience_score))

## # A tibble: 1 x 1
##   `cor(imdb_rating, audience_score)`
##                                <dbl>
## 1                              0.865

The relationship is even stronger. (0.84) A suspicion that should be analyzed in the future is if the two variables analyzed here (critics_score and imdb_rating) are collinear, or can disturb the models, the tests will be able to verify this more accurately. However, below a simple correlation test makes us pay attention to this fact.

df %>%
  summarize(cor(imdb_rating, critics_score))

## # A tibble: 1 x 1
##   `cor(imdb_rating, critics_score)`
##                               <dbl>
## 1                             0.765

Yellow sign for possible risk.

3.4 EDA with thtr_rel_year

For the variable year of release of the film, we chose to treat it as a factorial variable.

ggplot(df, aes(x=as.factor(thtr_rel_year), y=audience_score, fill=thtr_rel_year))+ geom_boxplot()

The launch year does seem to have some influence. This can be seen in the graph above, in which some years appear to have a long distribution, others are more concentrated in fewer grades. Another factor that deserves attention is that the median values are extremely variable over the years. It is not yet possible to say with certainty, but it is possible to speculate that we had “good” and “bad” years according to the American public.

3.5 EDA with OSCAR WINS

In this section 3.5 we briefly analyze them, the relationship of the variable to be explained with those of nomination to oscar. Which are:

Won Oscar´s best film? (Yes or no)
Won Oscar´s best actor? (Yes or no)
Won OScar´s best actress? (Yes or no)
Won OScar´s best director? (yes or no)

ggplot(df, aes(x=best_pic_win, y=audience_score, fill=best_pic_win))+ geom_boxplot()

ggplot(df, aes(x=best_actor_win, y=audience_score, fill=best_actor_win))+ geom_boxplot()

ggplot(df, aes(x=best_actress_win, y=audience_score, fill=best_actress_win))+ geom_boxplot()

ggplot(df, aes(x=best_dir_win, y=audience_score, fill=best_dir_win))+ geom_boxplot()

The relationship seems to occur strongly only if the film is nominated as the best film. In other cases, not so visually, although in the case of the director there seems to be a slight predisposition that the films with the most nominees for the award have better grades.

3.6 EDA with number of voters on IMDB

ggplot(df, aes(x =imdb_num_votes  , y = audience_score)) +
  geom_point()

Does the number of voters on IMDB influence? Apparently not much. What does the correlation data tell us?

df %>%
  summarize(cor(imdb_num_votes, audience_score))

## # A tibble: 1 x 1
##   `cor(imdb_num_votes, audience_score)`
##                                   <dbl>
## 1                                 0.290

0.28 is a low number, but it proved to be a predictor that should be investigated in the Bayesian tests that will appear in part 4 of this report.

3.7 EDA with TOP 200 box

The ‘by’ function is very interesting, it allows you to observe statistical data such as median, average, min, max and the quartiles of films that are in the top 200 box and those that are not.

by(df$audience_score, df$top200_box, summary)

## df$top200_box: no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.08   79.00   97.00 
## ------------------------------------------------------------ 
## df$top200_box: yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34.00   71.00   81.00   74.53   83.50   92.00

Does being part of the 200-film BOX have any influence? Apparently, yes. The numbers indicate that being in the TOP 200 box raises the rating.See especially the differences in value of the first quartile.

3.8 EDA with MPAA rating

MPAA rating of the movie (G, PG, PG-13, R, Unrated) is the rate a film’s suitability for certain audiences based on its content. The classifications are:

G - General Audiences

PG - Parental Guidance Suggested

PG-13 - Parents Strongly Cautioned

R - Restricted

UNRATED - NC-17 - Adults Only

A plot view is a good way for this initial exploration

ggplot(df, aes(x=mpaa_rating, y=audience_score, fill=mpaa_rating))+ geom_boxplot()

At first glance, Unrated films seem to be the ones that get the best score.

For this exercise, as you may remember, we created a variable mpaa_rating_R with “yes” if mpaa_rating is R, “no” otherwise. Following the analysis:

ggplot(df, aes(x=mpaa_rating_R, y=audience_score, fill=mpaa_rating_R))+ geom_boxplot()

Seems there ano no association.The fact that Unrated is in the NO category, may have mixed up this data. But this is just a suspicion.

3.9 EDA with OSCAR_SEASON

by(df$audience_score, df$oscar_season, summary)

## df$oscar_season: no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   64.00   61.81   79.00   96.00 
## ------------------------------------------------------------ 
## df$oscar_season: yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   47.50   69.00   63.69   81.00   97.00

The numbers are close, so at first glance, there seems to be no association.

3.10 EDA with Summer_season

by(df$audience_score, df$summer_season, summary)

## df$summer_season: no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   46.00   66.00   62.62   80.00   97.00 
## ------------------------------------------------------------ 
## df$summer_season: yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   44.75   65.00   61.81   78.00   94.00

The numbers are close, so at first glance, there does not seem to be any association as with the ´Oscar_Season´ variable.

3.11 EDA with DRAMA or NOt DRAMA

Our last variable above:

ggplot(df, aes(x=drama, y=audience_score, fill=drama))+ geom_boxplot()

by(df$audience_score, df$drama, summary)

## df$drama: no
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   41.00   61.00   59.73   79.00   97.00 
## ------------------------------------------------------------ 
## df$drama: yes
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   52.00   70.00   65.35   80.00   95.00

There seems to be an association between the film of genre drama having better grades with the public. The median value displayed on the plot and seen in the by function also indicates this.

Part 4: Modeling

NOTE: The purpose of this document is to complete the data analysis project required during week 5 of the Bayesian Statistics course by Duke University (Coursera.). BMA will be used to find the best model. Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences. Bayesian model averaging (BMA) provides a coherent mechanism for accounting for this model uncertainty when deriving parameter estimates. This work uses stepwise backward elimination process to construct final model. Starting with full model and removing variables.

This will be done until the BIC cannot be lowered further.

The BIC in statistics is the BAYESIAN INFORMATION CRITERIA, which we learned to use in this course unit.

The first step is to present the complete model below:

 full_m <- lm(audience_score ~ ., data = df)
  score_step <- MASS::stepAIC(full_m, trace = FALSE)
  score_step$anova

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## audience_score ~ runtime + mpaa_rating + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + top200_box + mpaa_rating_R + 
##     oscar_season + drama + feature_film + summer_season
## 
## Final Model:
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes + 
##     critics_score + best_actress_win
## 
## 
##                Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                                       630   63172.60 3014.825
## 2   - mpaa_rating_R  0   0.000000       630   63172.60 3014.825
## 3     - mpaa_rating  5 563.905967       635   63736.51 3010.601
## 4    - best_pic_win  1   1.948927       636   63738.46 3008.621
## 5    - oscar_season  1   9.743905       637   63748.20 3006.721
## 6      - top200_box  1  16.676369       638   63764.88 3004.891
## 7    - best_dir_win  1  90.517834       639   63855.39 3003.813
## 8  - best_actor_win  1 119.151581       640   63974.55 3003.024
## 9   - summer_season  1 168.616474       641   64143.16 3002.735
## 10          - drama  1 167.988533       642   64311.15 3002.435
## 11   - feature_film  1 166.871295       643   64478.02 3002.120

   vars1 <- names(movies) %in% c("audience_score","thtr_rel_year","critics_score","imdb_rating",
                                "imdb_num_votes","best_actress_win")
                                                          
  df1 <- movies[vars1] 
 
  df1 <- na.omit(df1)

Now, we will use BMA, BMP and MPM with our reduced dataset created above. This is df1.

movies_no_na = na.omit(df1)

bm = bas.lm(audience_score ~ ., data = movies_no_na,
                   prior = "BIC", 
                   modelprior = uniform())
bm

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = movies_no_na, prior = "BIC", 
##     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept        thtr_rel_year          imdb_rating  
##             1.00000              0.07369              1.00000  
##      imdb_num_votes        critics_score  best_actress_winyes  
##             0.04217              0.92100              0.17944

In our model generated above, imdb_rating seems to be the best predictor, critics_score is also good value. The others don’t seem to be as impactful.

summary(bm)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000  1.000000e+00
## thtr_rel_year          0.07368732     0.0000     0.0000000  0.000000e+00
## imdb_rating            1.00000000     1.0000     1.0000000  1.000000e+00
## imdb_num_votes         0.04216649     0.0000     0.0000000  0.000000e+00
## critics_score          0.92100167     1.0000     1.0000000  0.000000e+00
## best_actress_winyes    0.17944346     0.0000     1.0000000  0.000000e+00
## BF                             NA     1.0000     0.2178048  8.469154e-02
## PostProbs                      NA     0.6727     0.1465000  5.700000e-02
## R2                             NA     0.7524     0.7537000  7.480000e-01
## dim                            NA     3.0000     4.0000000  2.000000e+00
## logmarg                        NA -3621.0579 -3622.5820102 -3.623527e+03
##                           model 4       model 5
## Intercept            1.000000e+00  1.000000e+00
## thtr_rel_year        1.000000e+00  0.000000e+00
## imdb_rating          1.000000e+00  1.000000e+00
## imdb_num_votes       0.000000e+00  1.000000e+00
## critics_score        1.000000e+00  1.000000e+00
## best_actress_winyes  0.000000e+00  0.000000e+00
## BF                   7.534253e-02  4.304024e-02
## PostProbs            5.070000e-02  2.900000e-02
## R2                   7.529000e-01  7.524000e-01
## dim                  4.000000e+00  4.000000e+00
## logmarg             -3.623644e+03 -3.624203e+03

It is possible to infer that from the list above it is possible that model 1 and model 2 together account for about 81.8% of being the best model. Model 1 alone obtained 67.2% of this percentage. Model 1, does not obtain the best R2, but obtains a great value from this measure.

We forward to the end with the ‘confint’ function so that we can observe the confidence interval.

Reduced_model <- lm(data=na.omit(movies), audience_score ~ imdb_rating + imdb_num_votes + best_actress_win + critics_score + thtr_rel_year )

final_reduced <- bas.lm(data=na.omit(movies), audience_score ~ imdb_rating + imdb_num_votes + best_actress_win +critics_score + thtr_rel_year,prior= 'BIC', modelprior = uniform())

coef_reduced = coefficients(final_reduced)

confint(coef_reduced)

##                            2.5%      97.5%          beta
## Intercept           61.42260768 63.0393841  6.221002e+01
## imdb_rating         13.36671545 16.1519246  1.468989e+01
## imdb_num_votes       0.00000000  0.0000000  8.880979e-08
## best_actress_winyes -3.42381004  0.0000000 -4.628500e-01
## critics_score        0.00000000  0.1138073  7.328937e-02
## thtr_rel_year       -0.02327671  0.0000000 -2.525369e-03
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Most of the variables, including the newly created variables, were been eliminated from the full model to create the reduced model. By looking at the table directly above this, we can see that the highest rated model contains the intercept, imdb_rating, imdb_num_vote and critics score. Imdb_rating is the single biggest predictor of popularity.

However, it is worth testing a simple model, with the best correlation variable in the EDA, which is imdb_rating. See this new model, created in the ‘lm’ function, below:

m_audience_imdb<- lm(audience_score ~ imdb_rating, data = df)

The model created above was named ‘m_audience_imdb’. To verify the data of this model, we will use the functions ‘tidy’ and ‘confint’ below.

confint(m_audience_imdb)

##                 2.5 %    97.5 %
## (Intercept) -47.07704 -37.57969
## imdb_rating  15.40207  16.84480

For each additional Imdb_rating, there is a 95% chance that audience_critics will increase by 15.4 to 16.8.

Final note: It is worth noting that we chose to do this simple regression in order to separately analyze the variable that in Bayesian diagnoses proved to be more predictive. * * *

Part 5: Prediction

Our chosen film was the same from Unit 3, ‘joker’. For this step we will use the model reduced and m_audience_imdb

#We first created a reduced_model_2 without the ‘year’, because the movie is from 2019.
reduced_model_2 <- lm(data=na.omit(movies), audience_score ~ imdb_rating + imdb_num_votes + best_actress_win + critics_score)

Source: https://www.imdb.com/title/tt7286456/?ref_=fn_al_tt_1

AND

https://www.rottentomatoes.com/m/joker_2019

newdata <- data.frame(audience_score=88, imdb_rating=8.5, imdb_num_votes=844507,critics_score=68, best_actress_win="no")

joker <- round(predict(reduced_model_2, newdata), digit = 0)

c(joker , newdata$audience_score)

##  1    
## 95 88

The predicted value of 95 is greater than the real of 88. The prediction is higher then expected. We also contruct a prediction interval around this prediction which will provide the accuracy of the prediction.

predict(reduced_model_2, newdata, interval = 'prediction', level = 0.95)

##        fit      lwr      upr
## 1 94.62117 73.82273 115.4196

We are 95% confident that the Joker movie will have audience_score range from 73.8 to 115.41 (the highest possible score is 100) . And that proved to be satisfactory as it is within the range.

We also tested the simple regression model.

newdata2 <- data.frame(audience_score=88, imdb_rating=8.5)

joker_2 <- round(predict(m_audience_imdb, newdata2), digit = 0)

c(joker_2 , newdata2$audience_score)

##  1    
## 95 88

predict(m_audience_imdb, newdata2, interval = 'prediction', level = 0.95)

##        fit      lwr      upr
## 1 94.72084 74.70333 114.7384

The prediction values of the chosen film did not vary from one model to another.

Part 6: Conclusion

As an exercise it was great, however, expert opinions could improve this data. The Bayesian statistics proposal is excellent because it allows changing the priors. Another relevant final note is that the two proposed models (simple and final_reduced) show a somewhat accurate prediction rate, but it is not dead accurate.

The prediction values of “joker” did not vary from the simple model to another the reduced_model_2, which sheds light on the fact of the great predictive capacity, at least in this case, of the variable imdb_rating.

It was my first experience with Bayesian methods, I feel that I need more experience to have more audacity in data analysis. Always open to criticism, thank you!