library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.1
## Warning: package 'dplyr' was built under R version 4.2.2
## Warning: package 'forcats' was built under R version 4.2.1
library(statsr)
## Warning: package 'statsr' was built under R version 4.2.2
## Warning: package 'BayesFactor' was built under R version 4.2.1
library(BAS)
## Warning: package 'BAS' was built under R version 4.2.2
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.2.1
set.seed(17463)
load("movies.Rdata")
The objective: Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality).
Result: Data are a random sample of movies produced and released before 2016, rated at the Rotten Tomatoes and IMBD, therefore results are generalizable to worldwide movies produced and released before 2016, which were rated and reviewed at these resources. As it is expected that movies from local production (e.g, alternative, not promoted worldwide) are not evaluated at these sites, results from upcoming analysis should not be applied to them. At the same time, results from regression model should be extrapolated very cautiously to movies which were released 2016 and afterwards. As this is observational research, causal association cannot be established (this could be possible in an experiment, which this study is not). Generally, causality is difficult to be established in epidemiology studies. However, associations might be explored in this type of research. Generally, observational research might be subject of bias and confounding.
Create new variables using the mutate function in the dplyr package
following these guidelines: Create new variable based on
title_type: New variable should be called
feature_film with levels yes (movies that are feature
films) and no. Create new variable based on genre: New
variable should be called drama with levels yes (movies
that are dramas) and no. Create new variable based on
mpaa_rating: New variable should be called
mpaa_rating_R with levels yes (movies that are R rated) and
no. Create two new variables based on thtr_rel_month: New
variable called oscar_season with levels yes (if movie is
released in November, October, or December) and no New variable called
summer_season with levels yes (if movie is released in May,
June, July, or August) and no ¨
movies_new<-movies %>%
mutate(feature_film = title_type == "Feature Film", feature_film = if_else(feature_film == TRUE, "yes", "no")) %>%
mutate(drama = genre == "Drama", drama = if_else(drama == TRUE, "yes", "no")) %>%
mutate(mpaa_rating_R = mpaa_rating == "R", mpaa_rating_R = if_else(mpaa_rating_R == TRUE, "yes", "no")) %>%
mutate(oscar_season = thtr_rel_month == 10| thtr_rel_month==11|thtr_rel_month==12, oscar_season = if_else (oscar_season == TRUE, "yes", "no")) %>%
mutate(summer_season = thtr_rel_month == 5|thtr_rel_month==6|thtr_rel_month==7|thtr_rel_month==8, summer_season = if_else (summer_season == TRUE, "yes","no")) %>%
glimpse()
## Rows: 651
## Columns: 37
## $ title <chr> "Filly Brown", "The Dish", "Waiting for Guffman", "Th…
## $ title_type <fct> Feature Film, Feature Film, Feature Film, Feature Fil…
## $ genre <fct> Drama, Drama, Comedy, Drama, Horror, Documentary, Dra…
## $ runtime <dbl> 80, 101, 84, 139, 90, 78, 142, 93, 88, 119, 127, 108,…
## $ mpaa_rating <fct> R, PG-13, R, PG, R, Unrated, PG-13, R, Unrated, Unrat…
## $ studio <fct> Indomina Media Inc., Warner Bros. Pictures, Sony Pict…
## $ thtr_rel_year <dbl> 2013, 2001, 1996, 1993, 2004, 2009, 1986, 1996, 2012,…
## $ thtr_rel_month <dbl> 4, 3, 8, 10, 9, 1, 1, 11, 9, 3, 6, 12, 1, 9, 6, 8, 3,…
## $ thtr_rel_day <dbl> 19, 14, 21, 1, 10, 15, 1, 8, 7, 2, 19, 18, 4, 23, 20,…
## $ dvd_rel_year <dbl> 2013, 2001, 2001, 2001, 2005, 2010, 2003, 2004, 2013,…
## $ dvd_rel_month <dbl> 7, 8, 8, 11, 4, 4, 2, 3, 1, 8, 5, 9, 7, 2, 3, 12, 8, …
## $ dvd_rel_day <dbl> 30, 28, 21, 6, 19, 20, 18, 2, 21, 14, 1, 23, 9, 13, 2…
## $ imdb_rating <dbl> 5.5, 7.3, 7.6, 7.2, 5.1, 7.8, 7.2, 5.5, 7.5, 6.6, 6.8…
## $ imdb_num_votes <int> 899, 12285, 22381, 35096, 2386, 333, 5016, 2272, 880,…
## $ critics_rating <fct> Rotten, Certified Fresh, Certified Fresh, Certified F…
## $ critics_score <dbl> 45, 96, 91, 80, 33, 91, 57, 17, 90, 83, 89, 67, 80, 2…
## $ audience_rating <fct> Upright, Upright, Upright, Upright, Spilled, Upright,…
## $ audience_score <dbl> 73, 81, 91, 76, 27, 86, 76, 47, 89, 66, 75, 46, 89, 5…
## $ best_pic_nom <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, n…
## $ best_pic_win <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, n…
## $ best_actor_win <fct> no, no, no, yes, no, no, no, yes, no, no, yes, no, ye…
## $ best_actress_win <fct> no, no, no, no, no, no, no, no, no, no, no, no, yes, …
## $ best_dir_win <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, …
## $ top200_box <fct> no, no, no, no, no, no, no, no, no, no, yes, no, no, …
## $ director <chr> "Michael D. Olmos", "Rob Sitch", "Christopher Guest",…
## $ actor1 <chr> "Gina Rodriguez", "Sam Neill", "Christopher Guest", "…
## $ actor2 <chr> "Jenni Rivera", "Kevin Harrington", "Catherine O'Hara…
## $ actor3 <chr> "Lou Diamond Phillips", "Patrick Warburton", "Parker …
## $ actor4 <chr> "Emilio Rivera", "Tom Long", "Eugene Levy", "Richard …
## $ actor5 <chr> "Joseph Julian Soria", "Genevieve Mooy", "Bob Balaban…
## $ imdb_url <chr> "http://www.imdb.com/title/tt1869425/", "http://www.i…
## $ rt_url <chr> "//www.rottentomatoes.com/m/filly_brown_2012/", "//ww…
## $ feature_film <chr> "yes", "yes", "yes", "yes", "yes", "no", "yes", "yes"…
## $ drama <chr> "yes", "yes", "no", "yes", "no", "no", "yes", "yes", …
## $ mpaa_rating_R <chr> "yes", "no", "yes", "no", "yes", "no", "no", "yes", "…
## $ oscar_season <chr> "no", "no", "no", "yes", "no", "no", "no", "yes", "no…
## $ summer_season <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
Perform exploratory data analysis (EDA) of the relationship between audience_score and the new variables constructed in the previous part. Your EDA should contain numerical summaries and visualizations. This might mean you initially create a lot more visualizations and summary statistics than what you finally choose to include in your paper. Each R output and plot should be accompanied by a brief interpretation.
EDA: Explanatory variables:
feature_film,drama,
mpaa_rating_R, oscar_season,
summer_season. Response variable:
audience_score.
At first, only individual selected variables will be presented: Continuous variables will be described using summary statistics with median, mean. Categorical variables will be described using frequencies and proportions.
describe (movies_new$audience_score)
## movies_new$audience_score
## n missing distinct Info Mean Gmd .05 .10
## 651 0 84 1 62.36 23.14 28.0 34.0
## .25 .50 .75 .90 .95
## 46.0 65.0 80.0 87.0 89.5
##
## lowest : 11 13 14 15 17, highest: 93 94 95 96 97
describe (movies_new$feature_film)
## movies_new$feature_film
## n missing distinct
## 651 0 2
##
## Value no yes
## Frequency 60 591
## Proportion 0.092 0.908
describe (movies_new$drama)
## movies_new$drama
## n missing distinct
## 651 0 2
##
## Value no yes
## Frequency 346 305
## Proportion 0.531 0.469
describe (movies_new$mpaa_rating_R)
## movies_new$mpaa_rating_R
## n missing distinct
## 651 0 2
##
## Value no yes
## Frequency 322 329
## Proportion 0.495 0.505
describe (movies_new$oscar_season)
## movies_new$oscar_season
## n missing distinct
## 651 0 2
##
## Value no yes
## Frequency 460 191
## Proportion 0.707 0.293
describe (movies_new$summer_season)
## movies_new$summer_season
## n missing distinct
## 651 0 2
##
## Value no yes
## Frequency 443 208
## Proportion 0.68 0.32
Distributions of audience score is as follows:
movies_new %>%
ggplot(aes(x=audience_score))+geom_histogram(binwidth = 10) + labs(title = "Audience score distribution", x = "Audience score")
Summary statistics - comment: “Audience_score” has nearly uniform distribution with mean = 62.36 and median = 65.00 being close to each other. Most of the movies were feature films (90.8%), not released in theaters in Oscar or summer season (70.7% and 68% respectively) and half of the movies were dramas and received “R” Mpaa rating.
Relationships between response and category variables
movies_new %>%
ggplot(aes(x=feature_film, y=audience_score)) + geom_boxplot() + labs(title = "Audience score in feature film ", x = "Feature film", y = "Audience score")
movies_new %>%
ggplot(aes(x=drama, y=audience_score ))+geom_boxplot()+ labs(title = "Audience score in drama film ", x = "Drama", y = "Audience score")
movies_new %>%
ggplot(aes(x=mpaa_rating_R, y=audience_score))+geom_boxplot()+ labs(title = "Audience score in films with R MPAA rating", x = "R MPAA rated movie", y = "Audience score")
movies_new %>%
ggplot(aes(x=oscar_season, y=audience_score))+geom_boxplot()+ labs(title = "Audience score in movies released in Oscar season ", x = "Oscar season", y = "Audience score")
movies_new %>%
ggplot(aes(x=summer_season, y=audience_score))+geom_boxplot()+ labs(title = "Audience score in movies released in summer season", x = "Summer season", y = "Audience score")
Comment Median of audience score in feature films is below median of films which were shorter. Distribution of audience score in feature film category is left skewed and variance is lower compared to shorter films. Drama films have slightly higher median of audience score compared to “non-drama” films. Audience score is distributed normally in non drama movies and slightly left skewed in drama movies. Distribution of audience score is normal with close medians in movies getting/not getting Mpaa rating R and similarly, the fact, that movies were released in Oscar/summer season does not seem to influence audience score much.
Develop a Bayesian regression model to predict
audience_score from the following explanatory variables.
Note that some of these variables are in the original dataset provided,
and others are new variables you constructed
earlier:feature_film,drama,
runtime, mpaa_rating_R,
thtr_rel_year, oscar_season,
summer_season, imdb_rating,
imdb_num_votes, critics_score,
best_pic_nom, best_pic_win,
best_actor_win, best_actress_win,
best_dir_win, top200_box. Complete Bayesian
model selection and report the final model. Also perform model
diagnostics and interpret coefficients of your final model in context of
the data.
At first linear relationships between continuous explanatory variables and response variable will be investigated. There are following assumptions to linear regression models: + Each numerical explanatory variable has linear relationship to the response variable + Normal distribution of residuals with a mean of 0 + Nearly constant variability of residuals + The residuals are independent
Interaction between continuous exploratory variables and the response variable (audience score):
movies_new %>%
ggplot(aes(x=runtime, y=audience_score))+geom_point()+labs(title = "Relationship between audience score and runtime (in minutes)", x = "Runtime (in minutes)", y = "Audience score")
## Warning: Removed 1 rows containing missing values (geom_point).
movies_new %>%
ggplot(aes(x=imdb_rating, y=audience_score))+geom_point()+labs(title = "Relationship between audience score and rating on IMDB", x = "Rating on IMDB", y = "Audience score")
movies_new %>%
ggplot(aes(x=imdb_num_votes, y=audience_score))+geom_point()+labs(title = "Relationship between audience score and number of votes on IMDB", x = "Number of votes on IMDB", y = "Audience score")
movies_new %>%
ggplot(aes(x=critics_score, y=audience_score))+geom_point()+labs(title = "Relationship between audience score and critics score on Rotten Tomatoes", x = "Critics score", y = "Audience score")
Comment: Strong linear relationship can be seen between following variables: “imdb_rating” and “critics_score”, respectively, and audience score as the response variable. The other variables show some linear trend in relationship to response variable, however this is weak.
Linear regression models with each of the explanatory continuous variable provide deeper insight into the relationships:
m1<-lm(audience_score~runtime, data=movies_new)
summary(m1)
##
## Call:
## lm(formula = audience_score ~ runtime, data = movies_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.641 -15.626 3.008 17.080 34.950
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.4203 4.3256 9.807 < 2e-16 ***
## runtime 0.1883 0.0402 4.684 3.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.92 on 648 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.03275, Adjusted R-squared: 0.03125
## F-statistic: 21.94 on 1 and 648 DF, p-value: 3.431e-06
m2<-lm(audience_score~imdb_rating, data=movies_new)
summary(m2)
##
## Call:
## lm(formula = audience_score ~ imdb_rating, data = movies_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.800 -6.567 0.649 5.689 52.896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -42.3284 2.4183 -17.50 <2e-16 ***
## imdb_rating 16.1234 0.3674 43.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.16 on 649 degrees of freedom
## Multiple R-squared: 0.748, Adjusted R-squared: 0.7476
## F-statistic: 1926 on 1 and 649 DF, p-value: < 2.2e-16
m3<-lm(audience_score~imdb_num_votes, data=movies_new)
summary(m3)
##
## Call:
## lm(formula = audience_score ~ imdb_num_votes, data = movies_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.707 -15.085 1.866 16.245 36.521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.936e+01 8.534e-01 69.552 < 2e-16 ***
## imdb_num_votes 5.227e-05 6.776e-06 7.714 4.6e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.37 on 649 degrees of freedom
## Multiple R-squared: 0.08399, Adjusted R-squared: 0.08258
## F-statistic: 59.51 on 1 and 649 DF, p-value: 4.602e-14
m4<-lm(audience_score~critics_score, data=movies_new)
summary(m4)
##
## Call:
## lm(formula = audience_score ~ critics_score, data = movies_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.043 -9.571 0.504 10.422 43.544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.43551 1.27561 26.21 <2e-16 ***
## critics_score 0.50144 0.01984 25.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.37 on 649 degrees of freedom
## Multiple R-squared: 0.496, Adjusted R-squared: 0.4952
## F-statistic: 638.7 on 1 and 649 DF, p-value: < 2.2e-16
Comment: As can be seen, each explanatory continuous variable shows statistically significant relationship to response variable (audience score at Rotten Tomatoes).
As mentioned previously, before developing multiple regression model, assumptions for individual linear regression models should be checked. Therefore following diagnostic plots were conducted: Residuals versus Fitted values - to check linearity, Normal Q-Q plot checking assumption of normal distributions of residuals and Scale-location plot checking constant variance in residuals (homoscedasticity/heteroscedasticity).
plot(m1, which = 1)
plot(m1, which = 2)
plot(m1, which = 3)
plot(m2, which = 1)
plot(m2, which = 2)
plot(m2, which = 3)
plot(m3, which = 1)
plot(m3, which = 2)
plot(m3, which = 3)
plot(m4, which = 1)
plot(m4, which = 2)
plot(m4, which = 3)
Comment: In terms of “imdb_rating” and “critics_score”, Residuals versus fitted, Normal Q-Q and Scale location plots show linear relationship, nearly normal distribution of residuals and pattern of homoscedasticity. In terms of “runtime” and “imdb_num_votes” variables, there is a violation to linear relationship and assumption of normal distribution of residuals. Therefore, log transformation of these variables will be performed later.
To inspect collinearity, correlation matrix is produced:
pairs(~runtime + imdb_rating + imdb_num_votes + critics_score, data=movies_new)
Comment: As can be seen, there is a strong linear relationship between imbd_rating and critics_score variable. As this means, that both variables explain the same variance, one of them should be excluded. Based on the linear model assumptions check (see above), it is decided to keep “critics score” variable.
Comment Model selection will be performed using Bayesian Information Criterion (BIC) for coefficients. At the same time, the same prior probability will be considered for all models. The model with the smallest BIC will be selected. This method tends to select parsimonious model.
aud_score_BIC<-bas.lm(audience_score ~ feature_film + drama + log(runtime)+ mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + log(imdb_num_votes) + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = na.omit(movies_new), prior = "BIC", modelprior = uniform())
Using summary function, we can see the model with the posterior probability of the first five models and posterior inclusion probability of individual coefficients.
summary(aud_score_BIC)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.99999920 1.0000 1.0000000 1.0000000
## dramayes 0.91463714 1.0000 1.0000000 1.0000000
## log(runtime) 0.04056141 0.0000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.03969448 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.92625510 1.0000 1.0000000 1.0000000
## oscar_seasonyes 0.03970086 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.07993610 0.0000 0.0000000 0.0000000
## log(imdb_num_votes) 1.00000000 1.0000 1.0000000 1.0000000
## critics_score 1.00000000 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13070264 0.0000 1.0000000 0.0000000
## best_pic_winyes 0.04202052 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.09389569 0.0000 0.0000000 1.0000000
## best_actress_winyes 0.09008581 0.0000 0.0000000 0.0000000
## best_dir_winyes 0.07095996 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04444803 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.1268991 0.1046471
## PostProbs NA 0.4295 0.0545000 0.0449000
## R2 NA 0.5503 0.5519000 0.5517000
## dim NA 6.0000 7.0000000 7.0000000
## logmarg NA -3620.7848 -3622.8491552 -3623.0419543
## model 4 model 5
## Intercept 1.000000e+00 1.0000000
## feature_filmyes 1.000000e+00 1.0000000
## dramayes 1.000000e+00 1.0000000
## log(runtime) 0.000000e+00 0.0000000
## mpaa_rating_Ryes 0.000000e+00 0.0000000
## thtr_rel_year 1.000000e+00 1.0000000
## oscar_seasonyes 0.000000e+00 0.0000000
## summer_seasonyes 0.000000e+00 1.0000000
## log(imdb_num_votes) 1.000000e+00 1.0000000
## critics_score 1.000000e+00 1.0000000
## best_pic_nomyes 0.000000e+00 0.0000000
## best_pic_winyes 0.000000e+00 0.0000000
## best_actor_winyes 0.000000e+00 0.0000000
## best_actress_winyes 1.000000e+00 0.0000000
## best_dir_winyes 0.000000e+00 0.0000000
## top200_boxyes 0.000000e+00 0.0000000
## BF 9.956471e-02 0.0836377
## PostProbs 4.280000e-02 0.0359000
## R2 5.516000e-01 0.5513000
## dim 7.000000e+00 7.0000000
## logmarg -3.623092e+03 -3623.2660533
As can be seen, the first model has posterior probability of approx. 43% which is much higher than prior probability of the model (1/2^15).
To visualize which explanatory variables are included in the model with the highest probability, image function will be used.
image(aud_score_BIC, rotate = F)
The model with highest posterior probability includes following variables: feature film, drama, thtr_rel_year, imbd_num_votes, critics score
To create 95% credible intervals for beta coefficients:
coef_aud_sc_BIC<-coefficients(aud_score_BIC)
coef_aud_sc_BIC %>%
confint()
## 2.5% 97.5% beta
## Intercept 61.091017368 63.224453225 62.210016155
## feature_filmyes -23.081760037 -11.893883969 -17.408692467
## dramayes 0.000000000 5.834507194 3.596834334
## log(runtime) 0.000000000 0.000000000 -0.003016658
## mpaa_rating_Ryes 0.000000000 0.000000000 0.009497066
## thtr_rel_year -0.276125738 0.000000000 -0.172885604
## oscar_seasonyes 0.000000000 0.000000000 0.005458698
## summer_seasonyes -1.426765143 0.001153714 -0.117036836
## log(imdb_num_votes) 2.406953935 4.067062422 3.215958779
## critics_score 0.366423855 0.458337818 0.414414892
## best_pic_nomyes -0.001662418 6.688480990 0.669006704
## best_pic_winyes 0.000000000 0.000000000 -0.072263885
## best_actor_winyes -2.415769366 0.000000000 -0.203299757
## best_actress_winyes -2.491499648 0.000000000 -0.211064335
## best_dir_winyes -2.210858954 0.000000000 -0.178838749
## top200_boxyes 0.000000000 0.000000000 -0.086601494
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
Comment: There is a 95% chance that feature film category lowers audience score by -23.08 to -11.89 points. For each additional point in critics score there is a 95% chance of increase in audience score by 0.37 to 0.46 points. Number of “imdb_num_votes” was log transformed, therefore there is 95% chance that each 2.72 votes increase audience score by 2.41 to 4.07 points. The other variables contain 0 value in the 95% credible interval.
``
plot(aud_score_BIC, which = 1)
plot(aud_score_BIC, which = 2)
plot(aud_score_BIC, which = 3)
plot(aud_score_BIC, which = 4)
Residuals lie around the dash line y = 0 and have nearly constant variance, observations 308, 357 and 440 might be potential outliers. The highest cumulative probability was reached with about 30,000 unique models. Models with the highest Bayes factors have around 6 predictors. In the last plot, importance of different predictors can be seen. Variables: Feature film, Drama, Year the movie is released in theaters, Number of votes on IMDB and Critics score on Rotten Tomatoes have marginal posterior probability greater than 0.5, which means these are considered to be important for predicting audience score on Rotten Tomatoes.
Pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using your the model you developed and the “predict” function in R.
Comment For the purpose of this exercise was selected: The Magnificent Seven
entry<-data.frame(feature_film = "yes", drama = "no", runtime = 128, mpaa_rating_R = "no", thtr_rel_year = 2016, oscar_season = "no", summer_season = "no", imdb_rating = 6.8, imdb_num_votes = 217289, critics_score = 63, best_pic_nom = "no", best_pic_win = "no", best_actor_win = "no", best_actress_win = "no", best_dir_win ="no", top200_box = "no")
predicted_score<-predict(aud_score_BIC, newdata=entry,estimator="BMA", se.fit=TRUE)
confint(predicted_score)
## 2.5% 97.5% pred
## [1,] 39.29069 93.10702 66.48134
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
Comment: The model predicts 66.5 and there is a 95% chance that the score lies in the interval from 39.3 to 93.1. As the current audience score is 71, the model is underestimating true value.
Based on the model using BIC as the prior distribution for regression coefficients a Bayesian multiple regression model predicting audience score based on several characteristics was developed. Using BIC as a prior should tend to select parsimonious model. The other strength of the research is that this is a random sample of a population and therefore, there should not be a selection bias. However, as this is observational research, no final causal conclusions should be drawn. Usually this should confirmed with other research, using different data set or perform an experiment evaluating effect of some action (intervention,e.g. feature film - movies could be randomly selected to be made as feature film and after some time the audience score would be measured and compared with shorter movies (not feature films). The population of films would have some defined characteristics, groups - feature film yes/no would be comparable and therefore outcome could be assigned to an intervention). Of course, this is a hypothetical situation, however this reflects difference between observational and experimental research. A model developed was able to predict audience score with some underestimation. Therefore, another models could have been tested whether its prediction capacity is better. There is still a limitation that we predict outcome with data which are beyond data included in the model, therefore if some characteristic behavior changed dramatically in 2016, it would not be reflected by the model.