library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
load("movies.Rdata")
The data set is comprised of 651 randomly sampled movies produced and released before 2016.
The results are not possible for causality. Why not? Since there is no random assignment of sample subjects, the data cannot be used to answer causation questions.
Since the sample is big, It is possible to generalize with great cautions.
We must use, according to the course guidelines, some variables and others you must transform.
Below, see the entire process of transformation and selection of variables:
movies <- mutate(movies, mpaa_rating_R = ifelse(mpaa_rating=="R","yes", "no"))
movies <- mutate(movies,oscar_season=ifelse(thtr_rel_month %in% c(10, 11, 12), "yes", "no"))
movies <- mutate(movies,drama=ifelse(genre=='Drama', 'yes', 'no'))
movies<-mutate(movies, feature_film=ifelse(title_type=='Feature Film', 'yes', 'no'))
movies<-mutate(movies, summer_season=ifelse(thtr_rel_month %in% c(5,6,7,8), 'yes', 'no'))
movies$feature_film<-as.factor(movies$feature_film)
movies$drama<-as.factor(movies$drama)
movies$mpaa_rating_R<-as.factor(movies$mpaa_rating_R)
movies$oscar_season<-as.factor(movies$oscar_season)
movies$summer_season<-as.factor(movies$summer_season)
vars <- names(movies)%in% c("feature_film","drama","runtime","mpaa_rating",
"mpaa_rating_R","thtr_rel_year","oscar_season","summer_season","imdb_rating",
"imdb_num_votes","critics_score","best_pic_win","best_actor_win",
"best_actress_win","best_dir_win","top200_box","audience_score")
subset <- movies[vars]
As seen above, it is called ‘subset’, to be selected to follow the analysis. A function ‘summary’ helps us to see that we select the same variables requested not exercise by COUSERA / DUKE
summary(subset)
## runtime mpaa_rating thtr_rel_year imdb_rating imdb_num_votes
## Min. : 39.0 G : 19 Min. :1970 Min. :1.900 Min. : 180
## 1st Qu.: 92.0 NC-17 : 2 1st Qu.:1990 1st Qu.:5.900 1st Qu.: 4546
## Median :103.0 PG :118 Median :2000 Median :6.600 Median : 15116
## Mean :105.8 PG-13 :133 Mean :1998 Mean :6.493 Mean : 57533
## 3rd Qu.:115.8 R :329 3rd Qu.:2007 3rd Qu.:7.300 3rd Qu.: 58301
## Max. :267.0 Unrated: 50 Max. :2014 Max. :9.000 Max. :893008
## NA's :1
## critics_score audience_score best_pic_win best_actor_win best_actress_win
## Min. : 1.00 Min. :11.00 no :644 no :558 no :579
## 1st Qu.: 33.00 1st Qu.:46.00 yes: 7 yes: 93 yes: 72
## Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_dir_win top200_box mpaa_rating_R oscar_season drama feature_film
## no :608 no :636 no :322 no :460 no :346 no : 60
## yes: 43 yes: 15 yes:329 yes:191 yes:305 yes:591
##
##
##
##
##
## summer_season
## no :443
## yes:208
##
##
##
##
##
To finalize, is necessary to do more adjusted in data, using function ‘arrange’. The dplyr function arrange() can be used to reorder (or sort) rows by one or more variables. Instead of using the function desc(), you can prepend the sorting variable by a minus sign to indicate descending order, as follow. If the data contain missing values, they will always come at the end.
df <- arrange(subset,audience_score,critics_score,runtime,imdb_rating,imdb_num_votes,feature_film,
drama,mpaa_rating,mpaa_rating_R,thtr_rel_year,oscar_season,summer_season,best_pic_win,
best_actor_win, best_actress_win, best_dir_win,top200_box)
arrange(df, desc(audience_score))
## # A tibble: 651 x 17
## runtime mpaa_rating thtr_rel_year imdb_rating imdb_num_votes critics_score
## <dbl> <fct> <dbl> <dbl> <int> <dbl>
## 1 202 R 1974 9 749783 97
## 2 106 Unrated 2002 8.5 2362 96
## 3 121 R 1991 7.5 2551 38
## 4 133 R 1993 8.1 103378 94
## 5 154 PG-13 1985 7.8 58668 88
## 6 113 R 2000 8.5 806911 92
## 7 126 R 1997 8.3 572236 97
## 8 137 R 1986 8.4 466400 98
## 9 94 Unrated 2010 7.4 1935 100
## 10 94 R 1996 8.2 448434 89
## # ... with 641 more rows, and 11 more variables: audience_score <dbl>,
## # best_pic_win <fct>, best_actor_win <fct>, best_actress_win <fct>,
## # best_dir_win <fct>, top200_box <fct>, mpaa_rating_R <fct>,
## # oscar_season <fct>, drama <fct>, feature_film <fct>, summer_season <fct>
With this last adjustment, ‘df’ becomes the name of our base to be used.
Remember that the specific modeling task you need to complete is as follows: Develop a Bayesian regression model to predict audience_score from the following explanatory variables.
The first thing to do is to analyze ‘audience score’. Our variable to be explained in this activity.
We’ve seen it before during the transformation process, let’s review the basic data below:
summary(df$audience_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.36 80.00 97.00
Looking at the above, we can see that it is slightly rigth skewed. A histogram of this variable is also a good idea.
hist(df$audience_score)
Confirm that analyse. Next, you will observe the exploratory analysis of all explanatory variables in relation to the ‘audience score’.
ggplot(df, aes(x=feature_film, y=audience_score, fill=feature_film))+ geom_boxplot()
Does being a feature film influence public criticism? Apparently so, as short films, as we see above, seem to get better grades in general. Two possible outliers are found in short films below 25.
ggplot(df, aes(x = runtime , y = audience_score)) +
geom_point()
## Warning: Removed 1 rows containing missing values (geom_point).
There does not seem to be an association between the variables, the film time does not seem to clearly increase or decrease the audience’s rating.
ggplot(df, aes(x = critics_score , y = audience_score)) +
geom_point()
At first contact, there seems to be a positive association. That is, the better the specialized criticism, the better the reception of the public.
df %>%
summarize(cor(critics_score, audience_score))
## # A tibble: 1 x 1
## `cor(critics_score, audience_score)`
## <dbl>
## 1 0.704
The 0.70 correlation seems to point in the same direction. Let’s see below, if another critical measure has the same meaning (imdb_rating).
ggplot(df, aes(x =imdb_rating , y = audience_score)) +
geom_point()
df %>%
summarize(cor(imdb_rating, audience_score))
## # A tibble: 1 x 1
## `cor(imdb_rating, audience_score)`
## <dbl>
## 1 0.865
The relationship is even stronger. (0.84) A suspicion that should be analyzed in the future is if the two variables analyzed here (critics_score and imdb_rating) are collinear, or can disturb the models, the tests will be able to verify this more accurately. However, below a simple correlation test makes us pay attention to this fact.
df %>%
summarize(cor(imdb_rating, critics_score))
## # A tibble: 1 x 1
## `cor(imdb_rating, critics_score)`
## <dbl>
## 1 0.765
Yellow sign for possible risk.
For the variable year of release of the film, we chose to treat it as a factorial variable.
ggplot(df, aes(x=as.factor(thtr_rel_year), y=audience_score, fill=thtr_rel_year))+ geom_boxplot()
The launch year does seem to have some influence. This can be seen in the graph above, in which some years appear to have a long distribution, others are more concentrated in fewer grades. Another factor that deserves attention is that the median values are extremely variable over the years. It is not yet possible to say with certainty, but it is possible to speculate that we had “good” and “bad” years according to the American public.
In this section 3.5 we briefly analyze them, the relationship of the variable to be explained with those of nomination to oscar. Which are:
Won Oscar´s best film? (Yes or no)
Won Oscar´s best actor? (Yes or no)
Won OScar´s best actress? (Yes or no)
Won OScar´s best director? (yes or no)
ggplot(df, aes(x=best_pic_win, y=audience_score, fill=best_pic_win))+ geom_boxplot()
ggplot(df, aes(x=best_actor_win, y=audience_score, fill=best_actor_win))+ geom_boxplot()
ggplot(df, aes(x=best_actress_win, y=audience_score, fill=best_actress_win))+ geom_boxplot()
ggplot(df, aes(x=best_dir_win, y=audience_score, fill=best_dir_win))+ geom_boxplot()
The relationship seems to occur strongly only if the film is nominated as the best film. In other cases, not so visually, although in the case of the director there seems to be a slight predisposition that the films with the most nominees for the award have better grades.
ggplot(df, aes(x =imdb_num_votes , y = audience_score)) +
geom_point()
Does the number of voters on IMDB influence? Apparently not much. What does the correlation data tell us?
df %>%
summarize(cor(imdb_num_votes, audience_score))
## # A tibble: 1 x 1
## `cor(imdb_num_votes, audience_score)`
## <dbl>
## 1 0.290
0.28 is a low number, but it proved to be a predictor that should be investigated in the Bayesian tests that will appear in part 4 of this report.
The ‘by’ function is very interesting, it allows you to observe statistical data such as median, average, min, max and the quartiles of films that are in the top 200 box and those that are not.
by(df$audience_score, df$top200_box, summary)
## df$top200_box: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.08 79.00 97.00
## ------------------------------------------------------------
## df$top200_box: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.00 71.00 81.00 74.53 83.50 92.00
Does being part of the 200-film BOX have any influence? Apparently, yes. The numbers indicate that being in the TOP 200 box raises the rating.See especially the differences in value of the first quartile.
MPAA rating of the movie (G, PG, PG-13, R, Unrated) is the rate a film’s suitability for certain audiences based on its content. The classifications are:
G - General Audiences
PG - Parental Guidance Suggested
PG-13 - Parents Strongly Cautioned
R - Restricted
UNRATED - NC-17 - Adults Only
A plot view is a good way for this initial exploration
ggplot(df, aes(x=mpaa_rating, y=audience_score, fill=mpaa_rating))+ geom_boxplot()
At first glance, Unrated films seem to be the ones that get the best score.
For this exercise, as you may remember, we created a variable mpaa_rating_R with “yes” if mpaa_rating is R, “no” otherwise. Following the analysis:
ggplot(df, aes(x=mpaa_rating_R, y=audience_score, fill=mpaa_rating_R))+ geom_boxplot()
Seems there ano no association.The fact that Unrated is in the NO category, may have mixed up this data. But this is just a suspicion.
by(df$audience_score, df$oscar_season, summary)
## df$oscar_season: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 64.00 61.81 79.00 96.00
## ------------------------------------------------------------
## df$oscar_season: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 47.50 69.00 63.69 81.00 97.00
The numbers are close, so at first glance, there seems to be no association.
by(df$audience_score, df$summer_season, summary)
## df$summer_season: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 46.00 66.00 62.62 80.00 97.00
## ------------------------------------------------------------
## df$summer_season: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 44.75 65.00 61.81 78.00 94.00
The numbers are close, so at first glance, there does not seem to be any association as with the ´Oscar_Season´ variable.
Our last variable above:
ggplot(df, aes(x=drama, y=audience_score, fill=drama))+ geom_boxplot()
by(df$audience_score, df$drama, summary)
## df$drama: no
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 41.00 61.00 59.73 79.00 97.00
## ------------------------------------------------------------
## df$drama: yes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 52.00 70.00 65.35 80.00 95.00
There seems to be an association between the film of genre drama having better grades with the public. The median value displayed on the plot and seen in the by function also indicates this.
NOTE: The purpose of this document is to complete the data analysis project required during week 5 of the Bayesian Statistics course by Duke University (Coursera.). BMA will be used to find the best model. Standard statistical practice ignores model uncertainty. Data analysts typically select a model from some class of models and then proceed as if the selected model had generated the data. This approach ignores the uncertainty in model selection, leading to over-confident inferences. Bayesian model averaging (BMA) provides a coherent mechanism for accounting for this model uncertainty when deriving parameter estimates. This work uses stepwise backward elimination process to construct final model. Starting with full model and removing variables.
This will be done until the BIC cannot be lowered further.
The BIC in statistics is the BAYESIAN INFORMATION CRITERIA, which we learned to use in this course unit.
The first step is to present the complete model below:
full_m <- lm(audience_score ~ ., data = df)
score_step <- MASS::stepAIC(full_m, trace = FALSE)
score_step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## audience_score ~ runtime + mpaa_rating + thtr_rel_year + imdb_rating +
## imdb_num_votes + critics_score + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win + top200_box + mpaa_rating_R +
## oscar_season + drama + feature_film + summer_season
##
## Final Model:
## audience_score ~ runtime + thtr_rel_year + imdb_rating + imdb_num_votes +
## critics_score + best_actress_win
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 630 63172.60 3014.825
## 2 - mpaa_rating_R 0 0.000000 630 63172.60 3014.825
## 3 - mpaa_rating 5 563.905967 635 63736.51 3010.601
## 4 - best_pic_win 1 1.948927 636 63738.46 3008.621
## 5 - oscar_season 1 9.743905 637 63748.20 3006.721
## 6 - top200_box 1 16.676369 638 63764.88 3004.891
## 7 - best_dir_win 1 90.517834 639 63855.39 3003.813
## 8 - best_actor_win 1 119.151581 640 63974.55 3003.024
## 9 - summer_season 1 168.616474 641 64143.16 3002.735
## 10 - drama 1 167.988533 642 64311.15 3002.435
## 11 - feature_film 1 166.871295 643 64478.02 3002.120
vars1 <- names(movies) %in% c("audience_score","thtr_rel_year","critics_score","imdb_rating",
"imdb_num_votes","best_actress_win")
df1 <- movies[vars1]
df1 <- na.omit(df1)
Now, we will use BMA, BMP and MPM with our reduced dataset created above. This is df1.
movies_no_na = na.omit(df1)
bm = bas.lm(audience_score ~ ., data = movies_no_na,
prior = "BIC",
modelprior = uniform())
bm
##
## Call:
## bas.lm(formula = audience_score ~ ., data = movies_no_na, prior = "BIC",
## modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept thtr_rel_year imdb_rating
## 1.00000 0.07369 1.00000
## imdb_num_votes critics_score best_actress_winyes
## 0.04217 0.92100 0.17944
In our model generated above, imdb_rating seems to be the best predictor, critics_score is also good value. The others don’t seem to be as impactful.
summary(bm)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.000000e+00
## thtr_rel_year 0.07368732 0.0000 0.0000000 0.000000e+00
## imdb_rating 1.00000000 1.0000 1.0000000 1.000000e+00
## imdb_num_votes 0.04216649 0.0000 0.0000000 0.000000e+00
## critics_score 0.92100167 1.0000 1.0000000 0.000000e+00
## best_actress_winyes 0.17944346 0.0000 1.0000000 0.000000e+00
## BF NA 1.0000 0.2178048 8.469154e-02
## PostProbs NA 0.6727 0.1465000 5.700000e-02
## R2 NA 0.7524 0.7537000 7.480000e-01
## dim NA 3.0000 4.0000000 2.000000e+00
## logmarg NA -3621.0579 -3622.5820102 -3.623527e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## thtr_rel_year 1.000000e+00 0.000000e+00
## imdb_rating 1.000000e+00 1.000000e+00
## imdb_num_votes 0.000000e+00 1.000000e+00
## critics_score 1.000000e+00 1.000000e+00
## best_actress_winyes 0.000000e+00 0.000000e+00
## BF 7.534253e-02 4.304024e-02
## PostProbs 5.070000e-02 2.900000e-02
## R2 7.529000e-01 7.524000e-01
## dim 4.000000e+00 4.000000e+00
## logmarg -3.623644e+03 -3.624203e+03
It is possible to infer that from the list above it is possible that model 1 and model 2 together account for about 81.8% of being the best model. Model 1 alone obtained 67.2% of this percentage. Model 1, does not obtain the best R2, but obtains a great value from this measure.
We forward to the end with the ‘confint’ function so that we can observe the confidence interval.
Reduced_model <- lm(data=na.omit(movies), audience_score ~ imdb_rating + imdb_num_votes + best_actress_win + critics_score + thtr_rel_year )
final_reduced <- bas.lm(data=na.omit(movies), audience_score ~ imdb_rating + imdb_num_votes + best_actress_win +critics_score + thtr_rel_year,prior= 'BIC', modelprior = uniform())
coef_reduced = coefficients(final_reduced)
confint(coef_reduced)
## 2.5% 97.5% beta
## Intercept 61.42260768 63.0393841 6.221002e+01
## imdb_rating 13.36671545 16.1519246 1.468989e+01
## imdb_num_votes 0.00000000 0.0000000 8.880979e-08
## best_actress_winyes -3.42381004 0.0000000 -4.628500e-01
## critics_score 0.00000000 0.1138073 7.328937e-02
## thtr_rel_year -0.02327671 0.0000000 -2.525369e-03
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
Most of the variables, including the newly created variables, were been eliminated from the full model to create the reduced model. By looking at the table directly above this, we can see that the highest rated model contains the intercept, imdb_rating, imdb_num_vote and critics score. Imdb_rating is the single biggest predictor of popularity.
However, it is worth testing a simple model, with the best correlation variable in the EDA, which is imdb_rating. See this new model, created in the ‘lm’ function, below:
m_audience_imdb<- lm(audience_score ~ imdb_rating, data = df)
The model created above was named ‘m_audience_imdb’. To verify the data of this model, we will use the functions ‘tidy’ and ‘confint’ below.
confint(m_audience_imdb)
## 2.5 % 97.5 %
## (Intercept) -47.07704 -37.57969
## imdb_rating 15.40207 16.84480
For each additional Imdb_rating, there is a 95% chance that audience_critics will increase by 15.4 to 16.8.
Final note: It is worth noting that we chose to do this simple regression in order to separately analyze the variable that in Bayesian diagnoses proved to be more predictive. * * *
Our chosen film was the same from Unit 3, ‘joker’. For this step we will use the model reduced and m_audience_imdb
#We first created a reduced_model_2 without the ‘year’, because the movie is from 2019.
reduced_model_2 <- lm(data=na.omit(movies), audience_score ~ imdb_rating + imdb_num_votes + best_actress_win + critics_score)
Source: https://www.imdb.com/title/tt7286456/?ref_=fn_al_tt_1
AND
https://www.rottentomatoes.com/m/joker_2019
newdata <- data.frame(audience_score=88, imdb_rating=8.5, imdb_num_votes=844507,critics_score=68, best_actress_win="no")
joker <- round(predict(reduced_model_2, newdata), digit = 0)
c(joker , newdata$audience_score)
## 1
## 95 88
The predicted value of 95 is greater than the real of 88. The prediction is higher then expected. We also contruct a prediction interval around this prediction which will provide the accuracy of the prediction.
predict(reduced_model_2, newdata, interval = 'prediction', level = 0.95)
## fit lwr upr
## 1 94.62117 73.82273 115.4196
We are 95% confident that the Joker movie will have audience_score range from 73.8 to 115.41 (the highest possible score is 100) . And that proved to be satisfactory as it is within the range.
We also tested the simple regression model.
newdata2 <- data.frame(audience_score=88, imdb_rating=8.5)
joker_2 <- round(predict(m_audience_imdb, newdata2), digit = 0)
c(joker_2 , newdata2$audience_score)
## 1
## 95 88
predict(m_audience_imdb, newdata2, interval = 'prediction', level = 0.95)
## fit lwr upr
## 1 94.72084 74.70333 114.7384
The prediction values of the chosen film did not vary from one model to another.
As an exercise it was great, however, expert opinions could improve this data. The Bayesian statistics proposal is excellent because it allows changing the priors. Another relevant final note is that the two proposed models (simple and final_reduced) show a somewhat accurate prediction rate, but it is not dead accurate.
The prediction values of “joker” did not vary from the simple model to another the reduced_model_2, which sheds light on the fact of the great predictive capacity, at least in this case, of the variable imdb_rating.
It was my first experience with Bayesian methods, I feel that I need more experience to have more audacity in data analysis. Always open to criticism, thank you!