Characteristics of movies predicting audience popularity on Rotten Tomatoes

Setup

Load packages

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.2

## Warning: package 'readr' was built under R version 4.2.1

## Warning: package 'dplyr' was built under R version 4.2.2

## Warning: package 'forcats' was built under R version 4.2.1

library(statsr)

## Warning: package 'statsr' was built under R version 4.2.2

## Warning: package 'BayesFactor' was built under R version 4.2.1

library(BAS)

## Warning: package 'BAS' was built under R version 4.2.2

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.2.1

set.seed(17463)

Load data

load("movies.Rdata")

Part 1: Data

The objective: Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality).

Result: Data are a random sample of movies produced and released before 2016, rated at the Rotten Tomatoes and IMBD, therefore results are generalizable to worldwide movies produced and released before 2016, which were rated and reviewed at these resources. As it is expected that movies from local production (e.g, alternative, not promoted worldwide) are not evaluated at these sites, results from upcoming analysis should not be applied to them. At the same time, results from regression model should be extrapolated very cautiously to movies which were released 2016 and afterwards. As this is observational research, causal association cannot be established (this could be possible in an experiment, which this study is not). Generally, causality is difficult to be established in epidemiology studies. However, associations might be explored in this type of research. Generally, observational research might be subject of bias and confounding.

Part 2: Data manipulation

Create new variables using the mutate function in the dplyr package following these guidelines: Create new variable based on title_type: New variable should be called feature_film with levels yes (movies that are feature films) and no. Create new variable based on genre: New variable should be called drama with levels yes (movies that are dramas) and no. Create new variable based on mpaa_rating: New variable should be called mpaa_rating_R with levels yes (movies that are R rated) and no. Create two new variables based on thtr_rel_month: New variable called oscar_season with levels yes (if movie is released in November, October, or December) and no New variable called summer_season with levels yes (if movie is released in May, June, July, or August) and no ¨

movies_new<-movies %>% 
  mutate(feature_film = title_type == "Feature Film", feature_film = if_else(feature_film == TRUE, "yes", "no")) %>% 
  mutate(drama = genre == "Drama", drama = if_else(drama == TRUE, "yes", "no")) %>% 
  mutate(mpaa_rating_R = mpaa_rating == "R", mpaa_rating_R = if_else(mpaa_rating_R == TRUE, "yes", "no")) %>%
  mutate(oscar_season = thtr_rel_month == 10| thtr_rel_month==11|thtr_rel_month==12, oscar_season = if_else (oscar_season == TRUE, "yes", "no")) %>% 
  mutate(summer_season = thtr_rel_month == 5|thtr_rel_month==6|thtr_rel_month==7|thtr_rel_month==8, summer_season = if_else (summer_season == TRUE, "yes","no")) %>% 
  glimpse()

## Rows: 651
## Columns: 37
## $ title            <chr> "Filly Brown", "The Dish", "Waiting for Guffman", "Th…
## $ title_type       <fct> Feature Film, Feature Film, Feature Film, Feature Fil…
## $ genre            <fct> Drama, Drama, Comedy, Drama, Horror, Documentary, Dra…
## $ runtime          <dbl> 80, 101, 84, 139, 90, 78, 142, 93, 88, 119, 127, 108,…
## $ mpaa_rating      <fct> R, PG-13, R, PG, R, Unrated, PG-13, R, Unrated, Unrat…
## $ studio           <fct> Indomina Media Inc., Warner Bros. Pictures, Sony Pict…
## $ thtr_rel_year    <dbl> 2013, 2001, 1996, 1993, 2004, 2009, 1986, 1996, 2012,…
## $ thtr_rel_month   <dbl> 4, 3, 8, 10, 9, 1, 1, 11, 9, 3, 6, 12, 1, 9, 6, 8, 3,…
## $ thtr_rel_day     <dbl> 19, 14, 21, 1, 10, 15, 1, 8, 7, 2, 19, 18, 4, 23, 20,…
## $ dvd_rel_year     <dbl> 2013, 2001, 2001, 2001, 2005, 2010, 2003, 2004, 2013,…
## $ dvd_rel_month    <dbl> 7, 8, 8, 11, 4, 4, 2, 3, 1, 8, 5, 9, 7, 2, 3, 12, 8, …
## $ dvd_rel_day      <dbl> 30, 28, 21, 6, 19, 20, 18, 2, 21, 14, 1, 23, 9, 13, 2…
## $ imdb_rating      <dbl> 5.5, 7.3, 7.6, 7.2, 5.1, 7.8, 7.2, 5.5, 7.5, 6.6, 6.8…
## $ imdb_num_votes   <int> 899, 12285, 22381, 35096, 2386, 333, 5016, 2272, 880,…
## $ critics_rating   <fct> Rotten, Certified Fresh, Certified Fresh, Certified F…
## $ critics_score    <dbl> 45, 96, 91, 80, 33, 91, 57, 17, 90, 83, 89, 67, 80, 2…
## $ audience_rating  <fct> Upright, Upright, Upright, Upright, Spilled, Upright,…
## $ audience_score   <dbl> 73, 81, 91, 76, 27, 86, 76, 47, 89, 66, 75, 46, 89, 5…
## $ best_pic_nom     <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, n…
## $ best_pic_win     <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, n…
## $ best_actor_win   <fct> no, no, no, yes, no, no, no, yes, no, no, yes, no, ye…
## $ best_actress_win <fct> no, no, no, no, no, no, no, no, no, no, no, no, yes, …
## $ best_dir_win     <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, …
## $ top200_box       <fct> no, no, no, no, no, no, no, no, no, no, yes, no, no, …
## $ director         <chr> "Michael D. Olmos", "Rob Sitch", "Christopher Guest",…
## $ actor1           <chr> "Gina Rodriguez", "Sam Neill", "Christopher Guest", "…
## $ actor2           <chr> "Jenni Rivera", "Kevin Harrington", "Catherine O'Hara…
## $ actor3           <chr> "Lou Diamond Phillips", "Patrick Warburton", "Parker …
## $ actor4           <chr> "Emilio Rivera", "Tom Long", "Eugene Levy", "Richard …
## $ actor5           <chr> "Joseph Julian Soria", "Genevieve Mooy", "Bob Balaban…
## $ imdb_url         <chr> "http://www.imdb.com/title/tt1869425/", "http://www.i…
## $ rt_url           <chr> "//www.rottentomatoes.com/m/filly_brown_2012/", "//ww…
## $ feature_film     <chr> "yes", "yes", "yes", "yes", "yes", "no", "yes", "yes"…
## $ drama            <chr> "yes", "yes", "no", "yes", "no", "no", "yes", "yes", …
## $ mpaa_rating_R    <chr> "yes", "no", "yes", "no", "yes", "no", "no", "yes", "…
## $ oscar_season     <chr> "no", "no", "no", "yes", "no", "no", "no", "yes", "no…
## $ summer_season    <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "no"…

Part 3: Exploratory data analysis

Perform exploratory data analysis (EDA) of the relationship between audience_score and the new variables constructed in the previous part. Your EDA should contain numerical summaries and visualizations. This might mean you initially create a lot more visualizations and summary statistics than what you finally choose to include in your paper. Each R output and plot should be accompanied by a brief interpretation.

EDA: Explanatory variables: feature_film,drama, mpaa_rating_R, oscar_season, summer_season. Response variable: audience_score.

At first, only individual selected variables will be presented: Continuous variables will be described using summary statistics with median, mean. Categorical variables will be described using frequencies and proportions.

describe (movies_new$audience_score)

## movies_new$audience_score 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      651        0       84        1    62.36    23.14     28.0     34.0 
##      .25      .50      .75      .90      .95 
##     46.0     65.0     80.0     87.0     89.5 
## 
## lowest : 11 13 14 15 17, highest: 93 94 95 96 97

describe (movies_new$feature_film)

## movies_new$feature_film 
##        n  missing distinct 
##      651        0        2 
##                       
## Value         no   yes
## Frequency     60   591
## Proportion 0.092 0.908

describe (movies_new$drama)

## movies_new$drama 
##        n  missing distinct 
##      651        0        2 
##                       
## Value         no   yes
## Frequency    346   305
## Proportion 0.531 0.469

describe (movies_new$mpaa_rating_R)

## movies_new$mpaa_rating_R 
##        n  missing distinct 
##      651        0        2 
##                       
## Value         no   yes
## Frequency    322   329
## Proportion 0.495 0.505

describe (movies_new$oscar_season)

## movies_new$oscar_season 
##        n  missing distinct 
##      651        0        2 
##                       
## Value         no   yes
## Frequency    460   191
## Proportion 0.707 0.293

describe (movies_new$summer_season)

## movies_new$summer_season 
##        n  missing distinct 
##      651        0        2 
##                     
## Value        no  yes
## Frequency   443  208
## Proportion 0.68 0.32

Distributions of audience score is as follows:

movies_new %>% 
  ggplot(aes(x=audience_score))+geom_histogram(binwidth = 10) + labs(title = "Audience score distribution", x = "Audience score")

Summary statistics - comment: “Audience_score” has nearly uniform distribution with mean = 62.36 and median = 65.00 being close to each other. Most of the movies were feature films (90.8%), not released in theaters in Oscar or summer season (70.7% and 68% respectively) and half of the movies were dramas and received “R” Mpaa rating.

Relationships between response and category variables

movies_new %>% 
  ggplot(aes(x=feature_film, y=audience_score)) + geom_boxplot() + labs(title = "Audience score in feature film ", x = "Feature film", y = "Audience score")

movies_new %>% 
  ggplot(aes(x=drama, y=audience_score ))+geom_boxplot()+ labs(title = "Audience score in drama film ", x = "Drama", y = "Audience score")

movies_new %>% 
  ggplot(aes(x=mpaa_rating_R, y=audience_score))+geom_boxplot()+ labs(title = "Audience score in films with R MPAA rating", x = "R MPAA rated movie", y = "Audience score")

movies_new %>% 
  ggplot(aes(x=oscar_season, y=audience_score))+geom_boxplot()+ labs(title = "Audience score in movies released in Oscar season ", x = "Oscar season", y = "Audience score")

movies_new %>% 
  ggplot(aes(x=summer_season, y=audience_score))+geom_boxplot()+ labs(title = "Audience score in movies released in summer season", x = "Summer season", y = "Audience score")

Comment Median of audience score in feature films is below median of films which were shorter. Distribution of audience score in feature film category is left skewed and variance is lower compared to shorter films. Drama films have slightly higher median of audience score compared to “non-drama” films. Audience score is distributed normally in non drama movies and slightly left skewed in drama movies. Distribution of audience score is normal with close medians in movies getting/not getting Mpaa rating R and similarly, the fact, that movies were released in Oscar/summer season does not seem to influence audience score much.

Part 4: Modeling

Develop a Bayesian regression model to predict audience_score from the following explanatory variables. Note that some of these variables are in the original dataset provided, and others are new variables you constructed earlier:feature_film,drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box. Complete Bayesian model selection and report the final model. Also perform model diagnostics and interpret coefficients of your final model in context of the data.

Simple linear regression

At first linear relationships between continuous explanatory variables and response variable will be investigated. There are following assumptions to linear regression models: + Each numerical explanatory variable has linear relationship to the response variable + Normal distribution of residuals with a mean of 0 + Nearly constant variability of residuals + The residuals are independent

Interaction between continuous exploratory variables and the response variable (audience score):

movies_new %>% 
  ggplot(aes(x=runtime, y=audience_score))+geom_point()+labs(title = "Relationship between audience score and runtime (in minutes)", x = "Runtime (in minutes)", y = "Audience score")

## Warning: Removed 1 rows containing missing values (geom_point).

movies_new %>% 
  ggplot(aes(x=imdb_rating, y=audience_score))+geom_point()+labs(title = "Relationship between audience score and rating on IMDB", x = "Rating on IMDB", y = "Audience score")

movies_new %>% 
  ggplot(aes(x=imdb_num_votes, y=audience_score))+geom_point()+labs(title = "Relationship between audience score and number of votes on IMDB", x = "Number of votes on IMDB", y = "Audience score")

movies_new %>% 
  ggplot(aes(x=critics_score, y=audience_score))+geom_point()+labs(title = "Relationship between audience score and critics score on Rotten Tomatoes", x = "Critics score", y = "Audience score")

Comment: Strong linear relationship can be seen between following variables: “imdb_rating” and “critics_score”, respectively, and audience score as the response variable. The other variables show some linear trend in relationship to response variable, however this is weak.

Linear regression models with each of the explanatory continuous variable provide deeper insight into the relationships:

m1<-lm(audience_score~runtime, data=movies_new)
summary(m1)

## 
## Call:
## lm(formula = audience_score ~ runtime, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.641 -15.626   3.008  17.080  34.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.4203     4.3256   9.807  < 2e-16 ***
## runtime       0.1883     0.0402   4.684 3.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.92 on 648 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03275,    Adjusted R-squared:  0.03125 
## F-statistic: 21.94 on 1 and 648 DF,  p-value: 3.431e-06

m2<-lm(audience_score~imdb_rating, data=movies_new)
summary(m2)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.800  -6.567   0.649   5.689  52.896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -42.3284     2.4183  -17.50   <2e-16 ***
## imdb_rating  16.1234     0.3674   43.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.16 on 649 degrees of freedom
## Multiple R-squared:  0.748,  Adjusted R-squared:  0.7476 
## F-statistic:  1926 on 1 and 649 DF,  p-value: < 2.2e-16

m3<-lm(audience_score~imdb_num_votes, data=movies_new)
summary(m3)

## 
## Call:
## lm(formula = audience_score ~ imdb_num_votes, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.707 -15.085   1.866  16.245  36.521 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    5.936e+01  8.534e-01  69.552  < 2e-16 ***
## imdb_num_votes 5.227e-05  6.776e-06   7.714  4.6e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.37 on 649 degrees of freedom
## Multiple R-squared:  0.08399,    Adjusted R-squared:  0.08258 
## F-statistic: 59.51 on 1 and 649 DF,  p-value: 4.602e-14

m4<-lm(audience_score~critics_score, data=movies_new)
summary(m4)

## 
## Call:
## lm(formula = audience_score ~ critics_score, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.043  -9.571   0.504  10.422  43.544 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   33.43551    1.27561   26.21   <2e-16 ***
## critics_score  0.50144    0.01984   25.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.37 on 649 degrees of freedom
## Multiple R-squared:  0.496,  Adjusted R-squared:  0.4952 
## F-statistic: 638.7 on 1 and 649 DF,  p-value: < 2.2e-16

Comment: As can be seen, each explanatory continuous variable shows statistically significant relationship to response variable (audience score at Rotten Tomatoes).

Linear Regression Model diagnostics

As mentioned previously, before developing multiple regression model, assumptions for individual linear regression models should be checked. Therefore following diagnostic plots were conducted: Residuals versus Fitted values - to check linearity, Normal Q-Q plot checking assumption of normal distributions of residuals and Scale-location plot checking constant variance in residuals (homoscedasticity/heteroscedasticity).

plot(m1, which = 1)

plot(m1, which = 2)

plot(m1, which = 3)

plot(m2, which = 1)

plot(m2, which = 2)

plot(m2, which = 3)

plot(m3, which = 1)

plot(m3, which = 2)

plot(m3, which = 3)

plot(m4, which = 1)

plot(m4, which = 2)

plot(m4, which = 3)

Comment: In terms of “imdb_rating” and “critics_score”, Residuals versus fitted, Normal Q-Q and Scale location plots show linear relationship, nearly normal distribution of residuals and pattern of homoscedasticity. In terms of “runtime” and “imdb_num_votes” variables, there is a violation to linear relationship and assumption of normal distribution of residuals. Therefore, log transformation of these variables will be performed later.

To inspect collinearity, correlation matrix is produced:

pairs(~runtime + imdb_rating + imdb_num_votes + critics_score, data=movies_new)

Comment: As can be seen, there is a strong linear relationship between imbd_rating and critics_score variable. As this means, that both variables explain the same variance, one of them should be excluded. Based on the linear model assumptions check (see above), it is decided to keep “critics score” variable.

Bayesian multiple regression

Data transformation

Log transformation “runtime” and “imdb_num_votes”.
Exclude “imdb_rating” from the final model due to collinearity with critics score.

Comment Model selection will be performed using Bayesian Information Criterion (BIC) for coefficients. At the same time, the same prior probability will be considered for all models. The model with the smallest BIC will be selected. This method tends to select parsimonious model.

aud_score_BIC<-bas.lm(audience_score ~ feature_film + drama + log(runtime)+ mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + log(imdb_num_votes) + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = na.omit(movies_new), prior = "BIC", modelprior = uniform())

Using summary function, we can see the model with the posterior probability of the first five models and posterior inclusion probability of individual coefficients.

summary(aud_score_BIC)

##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.99999920     1.0000     1.0000000     1.0000000
## dramayes               0.91463714     1.0000     1.0000000     1.0000000
## log(runtime)           0.04056141     0.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.03969448     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.92625510     1.0000     1.0000000     1.0000000
## oscar_seasonyes        0.03970086     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.07993610     0.0000     0.0000000     0.0000000
## log(imdb_num_votes)    1.00000000     1.0000     1.0000000     1.0000000
## critics_score          1.00000000     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13070264     0.0000     1.0000000     0.0000000
## best_pic_winyes        0.04202052     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.09389569     0.0000     0.0000000     1.0000000
## best_actress_winyes    0.09008581     0.0000     0.0000000     0.0000000
## best_dir_winyes        0.07095996     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04444803     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.1268991     0.1046471
## PostProbs                      NA     0.4295     0.0545000     0.0449000
## R2                             NA     0.5503     0.5519000     0.5517000
## dim                            NA     6.0000     7.0000000     7.0000000
## logmarg                        NA -3620.7848 -3622.8491552 -3623.0419543
##                           model 4       model 5
## Intercept            1.000000e+00     1.0000000
## feature_filmyes      1.000000e+00     1.0000000
## dramayes             1.000000e+00     1.0000000
## log(runtime)         0.000000e+00     0.0000000
## mpaa_rating_Ryes     0.000000e+00     0.0000000
## thtr_rel_year        1.000000e+00     1.0000000
## oscar_seasonyes      0.000000e+00     0.0000000
## summer_seasonyes     0.000000e+00     1.0000000
## log(imdb_num_votes)  1.000000e+00     1.0000000
## critics_score        1.000000e+00     1.0000000
## best_pic_nomyes      0.000000e+00     0.0000000
## best_pic_winyes      0.000000e+00     0.0000000
## best_actor_winyes    0.000000e+00     0.0000000
## best_actress_winyes  1.000000e+00     0.0000000
## best_dir_winyes      0.000000e+00     0.0000000
## top200_boxyes        0.000000e+00     0.0000000
## BF                   9.956471e-02     0.0836377
## PostProbs            4.280000e-02     0.0359000
## R2                   5.516000e-01     0.5513000
## dim                  7.000000e+00     7.0000000
## logmarg             -3.623092e+03 -3623.2660533

As can be seen, the first model has posterior probability of approx. 43% which is much higher than prior probability of the model (1/2^15).

To visualize which explanatory variables are included in the model with the highest probability, image function will be used.

image(aud_score_BIC, rotate = F)

The model with highest posterior probability includes following variables: feature film, drama, thtr_rel_year, imbd_num_votes, critics score

To create 95% credible intervals for beta coefficients:

coef_aud_sc_BIC<-coefficients(aud_score_BIC)
coef_aud_sc_BIC %>% 
  confint()

##                              2.5%         97.5%          beta
## Intercept            61.091017368  63.224453225  62.210016155
## feature_filmyes     -23.081760037 -11.893883969 -17.408692467
## dramayes              0.000000000   5.834507194   3.596834334
## log(runtime)          0.000000000   0.000000000  -0.003016658
## mpaa_rating_Ryes      0.000000000   0.000000000   0.009497066
## thtr_rel_year        -0.276125738   0.000000000  -0.172885604
## oscar_seasonyes       0.000000000   0.000000000   0.005458698
## summer_seasonyes     -1.426765143   0.001153714  -0.117036836
## log(imdb_num_votes)   2.406953935   4.067062422   3.215958779
## critics_score         0.366423855   0.458337818   0.414414892
## best_pic_nomyes      -0.001662418   6.688480990   0.669006704
## best_pic_winyes       0.000000000   0.000000000  -0.072263885
## best_actor_winyes    -2.415769366   0.000000000  -0.203299757
## best_actress_winyes  -2.491499648   0.000000000  -0.211064335
## best_dir_winyes      -2.210858954   0.000000000  -0.178838749
## top200_boxyes         0.000000000   0.000000000  -0.086601494
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Comment: There is a 95% chance that feature film category lowers audience score by -23.08 to -11.89 points. For each additional point in critics score there is a 95% chance of increase in audience score by 0.37 to 0.46 points. Number of “imdb_num_votes” was log transformed, therefore there is 95% chance that each 2.72 votes increase audience score by 2.41 to 4.07 points. The other variables contain 0 value in the 95% credible interval.

Model diagnostics, cumulative sampled probability, model complexity and marginal inclusion probabilty

plot(aud_score_BIC, which = 1)

plot(aud_score_BIC, which = 2)

plot(aud_score_BIC, which = 3)

plot(aud_score_BIC, which = 4)

Residuals lie around the dash line y = 0 and have nearly constant variance, observations 308, 357 and 440 might be potential outliers. The highest cumulative probability was reached with about 30,000 unique models. Models with the highest Bayes factors have around 6 predictors. In the last plot, importance of different predictors can be seen. Variables: Feature film, Drama, Year the movie is released in theaters, Number of votes on IMDB and Critics score on Rotten Tomatoes have marginal posterior probability greater than 0.5, which means these are considered to be important for predicting audience score on Rotten Tomatoes.

Part 5: Prediction

Pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using your the model you developed and the “predict” function in R.

Comment For the purpose of this exercise was selected: The Magnificent Seven

entry<-data.frame(feature_film = "yes", drama = "no", runtime = 128, mpaa_rating_R = "no", thtr_rel_year = 2016, oscar_season = "no", summer_season = "no", imdb_rating = 6.8, imdb_num_votes = 217289, critics_score = 63, best_pic_nom = "no", best_pic_win = "no", best_actor_win = "no", best_actress_win = "no", best_dir_win ="no", top200_box = "no")

predicted_score<-predict(aud_score_BIC, newdata=entry,estimator="BMA", se.fit=TRUE)
confint(predicted_score)

##          2.5%    97.5%     pred
## [1,] 39.29069 93.10702 66.48134
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Comment: The model predicts 66.5 and there is a 95% chance that the score lies in the interval from 39.3 to 93.1. As the current audience score is 71, the model is underestimating true value.

Part 6: Conclusion

Based on the model using BIC as the prior distribution for regression coefficients a Bayesian multiple regression model predicting audience score based on several characteristics was developed. Using BIC as a prior should tend to select parsimonious model. The other strength of the research is that this is a random sample of a population and therefore, there should not be a selection bias. However, as this is observational research, no final causal conclusions should be drawn. Usually this should confirmed with other research, using different data set or perform an experiment evaluating effect of some action (intervention,e.g. feature film - movies could be randomly selected to be made as feature film and after some time the audience score would be measured and compared with shorter movies (not feature films). The population of films would have some defined characteristics, groups - feature film yes/no would be comparable and therefore outcome could be assigned to an intervention). Of course, this is a hypothetical situation, however this reflects difference between observational and experimental research. A model developed was able to predict audience score with some underestimation. Therefore, another models could have been tested whether its prediction capacity is better. There is still a limitation that we predict outcome with data which are beyond data included in the model, therefore if some characteristic behavior changed dramatically in 2016, it would not be reflected by the model.

Characteristics of movies predicting audience popularity on Rotten Tomatoes - Bayesian Multiple Regression model

Petra Matoulková

2023-04-12