Setup

This project is for a Bayesian regression of the movies dataset. The packages needed are as follows:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.3
library(statsr)
library(BAS)
## Warning: package 'BAS' was built under R version 3.4.3

And the data itself:

#load("movies")
load("movie.Rdata")
head(movies)
## # A tibble: 6 x 32
##                  title   title_type       genre runtime mpaa_rating
##                  <chr>       <fctr>      <fctr>   <dbl>      <fctr>
## 1          Filly Brown Feature Film       Drama      80           R
## 2             The Dish Feature Film       Drama     101       PG-13
## 3  Waiting for Guffman Feature Film      Comedy      84           R
## 4 The Age of Innocence Feature Film       Drama     139          PG
## 5          Malevolence Feature Film      Horror      90           R
## 6          Old Partner  Documentary Documentary      78     Unrated
## # ... with 27 more variables: studio <fctr>, thtr_rel_year <dbl>,
## #   thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## #   dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## #   imdb_num_votes <int>, critics_rating <fctr>, critics_score <dbl>,
## #   audience_rating <fctr>, audience_score <dbl>, best_pic_nom <fctr>,
## #   best_pic_win <fctr>, best_actor_win <fctr>, best_actress_win <fctr>,
## #   best_dir_win <fctr>, top200_box <fctr>, director <chr>, actor1 <chr>,
## #   actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>,
## #   imdb_url <chr>, rt_url <chr>
dim(movies)
## [1] 651  32

Part 1: Data

The data frame has 651 rows and 32 columns indicating the 651 randomly sampled movies and 32 different features of those movies. There is no random assignment or contol procedures implemented in this data so there is *no basis for making a causal relationship or inference. However the ransom sampling does mean the results are generalizable. Possible sources of bias are lurking variables such as both coinciding and oppositional scores given by the public in response to the critics or how one culture view a movies as opposed to anothr. A infamous case of this was Bridges of Madison County which received different responses in America as opposed to Japan for cultural reasons. Extending this model and its results should take such factors into account.


Part 2: Data manipulation

To proceed we will need new columns for the variables of interest and the regressian model. The first is feature_film which says if a movie is a feature film or not. Next are drama and mpaa_rating_R which respectively indicate if a movie is a drama and rated R. The next two are calendar based; oscar_season and summer_season which cover the Oscar months of November through December and the summer months Of May to August. This is five new categories for the data set.

movies <- movies %>%
  mutate(feature_film = ifelse(title_type == "Feature Film", "yes", "no"))

movies <- movies %>%
  mutate(drama = ifelse(genre == "Drama", "yes", "no"))

movies <- movies %>%
  mutate(mpaa_rating_R = ifelse(mpaa_rating == "R", "yes", "no"))

movies <- movies %>%
  mutate(oscar_season = ifelse(thtr_rel_month >= 10, "yes", "no"))

movies <- movies %>%
  mutate(summer_season = ifelse(thtr_rel_month >=5 & thtr_rel_month <=8, "yes", "no"))

With the new colums added we can see the data columns have increased to 37.

dim(movies)
## [1] 651  37

Part 3: Exploratory data analysis

Our question of interest is: What makes a movie popular? For this model the audience score will be the response variable. Therefore before the regression analysis begins we should examine the data and the relationship between audience score and our new variables as well as the nature of the varaibles themselves.

Before any examination, the data must be cleaned. It must be determined if any NA’s exist in the data.

any(is.na(movies))
## [1] TRUE

There are so they will be taken out and accompanying rows.

movies <- na.omit(movies)
any(is.na(movies))
## [1] FALSE
dim(movies)
## [1] 619  37

Cleaning has cost us 32 rows or abour 5% of our data so the set is mostly intact. We can begin with the EDA.

Starting with audience score:

ggplot(data = movies, aes(x = audience_score)) +
  geom_histogram(binwidth = 6.16, color = "blue", fill = "yellow") +
  labs(title = "Audience Score") 

Above we can see that the audience scores are nearly bimodal and skewed to the left so we wouldnt expect normality to be exact.

qqnorm(movies$audience_score)

As we can see above the normality plot fails to conform to normality though it adheres for portions of the distribution.

summary(movies$audience_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.21   80.00   97.00
boxplot(movies$audience_score)

ggplot(data = movies, aes(x = audience_score, fill= audience_score)) +
  geom_bar()+
  coord_flip()

The boxplot, barplot and the summary indicate encouraging news. The socres are higher than the midpoint of possible scores, with a full 25% higher than 80, the third quartile. This is positive information because while it may not signal a popular movie it does demonstarte that is not difficult to produce a likeable movie. And likeable creates revenue.

ggplot(data = movies, aes(x = feature_film, fill= feature_film)) +
  geom_bar()+
  coord_flip()+
   labs(title = "Feature Film Count") 

Here we have our first indication of a possible plan of action for popularity. With so many movies being of the feature film category it seems a good place for labeling of the movie to ensure good ratings.

ggplot(data = movies, aes(x = drama, fill= drama)) +
  geom_bar()+
  coord_flip()+
  labs(title = "Drama Count")

Drama is different. There is almost a 50-50 split but this is slightly encouraging as well since drama is one category of several and almost half the movies made fall into this one section. While the close 1/2 ratio means that the other genres hold an important percentage it also means that drama is a significant category.

ggplot(data = movies, aes(x = mpaa_rating_R, fill= mpaa_rating_R)) +
  geom_bar()+
  coord_flip()+
  labs(title = "R Rating Count")

The R rating ##like drama## comprises about half the movies produced. As such both categories, being for adults, imply that revenue for a movie house lay within the grownup sector and should be directed there. A clearer look will show this:

ggplot(movies, aes(x = genre, fill = genre)) + 
  geom_bar() + 
  coord_flip() +
  labs(title = ("Movie Genres"))

ggplot(movies, aes(x = mpaa_rating, fill = mpaa_rating)) + 
  geom_bar() + 
  coord_flip() +
  labs(title = ("MPAA Rating"))

Its clear from the four plots above of the importance of rated R and drama movies on the general population of movies made.

ggplot(data = movies, aes(x = oscar_season, fill= oscar_season)) +
  geom_bar()+
  coord_flip()+
  labs(title = "Oscar Season Count")

Both Oscar season and summer season ##NO’s## are approximately 2/3rds of their respective leaving 1/3 each for yes meaning that combined for the year they are almost 2/3 of annually made movied fall in these two seasons, from May to August and then from October to November. ##Clearly these are the months to focus on.##

ggplot(data = movies, aes(x = summer_season, fill= summer_season)) +
  geom_bar()+
  coord_flip()+
  labs(title = "Summer Season Count")

ggplot(movies, aes(x = thtr_rel_month, fill = thtr_rel_month, colour = "yellow")) + 
  geom_histogram(binwidth = 1, color = "blue", fill = "green") + 
  labs(title = ("Theater Release Month"))

There are prominent peaks during the middle of the year(summer) and the end(oscar season). Agaon points of consideration for the relaese of a movie. With some counts for additional clarity.

movies%>%
  group_by(thtr_rel_month)%>%
  count(thtr_rel_month)
## # A tibble: 12 x 2
## # Groups:   thtr_rel_month [12]
##    thtr_rel_month     n
##             <dbl> <int>
##  1              1    64
##  2              2    32
##  3              3    49
##  4              4    43
##  5              5    43
##  6              6    71
##  7              7    45
##  8              8    42
##  9              9    51
## 10             10    65
## 11             11    48
## 12             12    66

The next step is to ##examine the relationship between audience score and the new variables##

ggplot(aes(y = audience_score, x = feature_film, fill = feature_film), data = movies) +
  geom_boxplot() +
  labs(title = "Feature Film vs Audience Score") +
  coord_flip()

Here is some surprising information. It appears that on average feature films do NOT score as highly as non feature films. This being the case our earlier decision regarding feature films has to be modified and feature films should NOT be considered an wim because of their numbers or perhaps in spite of it.

ggplot(aes(y = audience_score, x = drama, fill = drama), data = movies) +
  geom_boxplot() +
  labs(title = "Drama vs Audience Score") +
  coord_flip()

Dramas however are consistent with our previous intuition and they do in fact score higher than other genres when combined and even regardless of how many, have a smaller spread in the second to third quartiles scores as well. This suggests making a good drama with its higher median and smaller IQR is a more certain endeavor.

ggplot(aes(y = audience_score, x = mpaa_rating_R, fill = mpaa_rating_R), data = movies) +
  geom_boxplot() +
  labs(title = "MPAA Rating R vs Audience Score") +
  coord_flip()

R and other ratings (as a group) have the same median and same variability implyomg there may not be an inherent advantage to this rating except that all things being equal it wikk because its flooding the market garner some of the higher scores.

ggplot(aes(y = audience_score, x = summer_season, fill = summer_season), data = movies) +
  geom_boxplot() +
  labs(title = "Summer Season vs Audience Score") +
  coord_flip()

Here is another surprise. With all the fanfare of the summer “blockbuster” one might expect there to be higher scores in the summer. However that seems to not be the case. As with rated R movies summer movies fare no better on averagethan their other year counterparts.

ggplot(aes(y = audience_score, x = oscar_season, fill = oscar_season), data = movies) +
  geom_boxplot() +
  labs(title = "Oscar Season vs Audience Score") +
  coord_flip()

With Oscar season often bringing the very best of movies the expectation would be greater popularity. That is indeed that case. However the degree is not as great as one might suspect. Oscar season movies only do slightly better thsn others. again surprising.

Part 4: Modeling

While all of this is informative a probablistic approach would give a more palatable way of assessing the risk involved in a movie venture. The interplay of the variables mentioned as well as other variables in the table are better assessed through a regression analysis.

This is a multiple regression model that we will use Bayesian techniques on to determine the best model for prediction. All the predictor variables are listed in the regression equation with audience score as the response variable. We will let the prior probability depend on n and for the expected value of n over the parameter g to equal one. This is the Zellners-Siow prior. Assuming or having no knowledge of variable impact on the response variable a uniform distribution for the model prior will be chosen. The model sampling and averaging will be done by Markov Chain Monte Carlo. The regression model will go by the name bayes.movie.

Creating and running the model:

bayes.movie = bas.lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_month + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = movies, prior = "ZS-null", modelprior = uniform(), method = "MCMC" )
## Warning in bas.lm(audience_score ~ feature_film + drama + runtime +
## mpaa_rating_R + : We recommend using the implementation using the Jeffreys-
## Zellner-Siow prior (prior='JZS') which uses numerical integration rahter
## than the Laplace approximation

We can examine the marginal posterior inclusion probabilities of the various models sampled and used by MCMC. The graph below indicates that most of the models have low pip except for two or three. There are two models generated and they are fairly similar. This reveals that MCMC has run long enough on the unique models generated.

diagnostics(bayes.movie)

The residual graph below shows an interesting outcome. The residuals are heteroscedastic revealing a difference in the variance of some of the variables under BAyesian modeling averaging.Regression is based on the assumption that the variance of the features in the regression model have the same variance (homoscedasticty). This is not the case with this set of variables. Consequently they may be non-linear terms or different scale factors in our feature set. This could complicate the regression or force a overall simplification. Notice three values are numbered that would show the possiblity of outliers.

plot(bayes.movie, which = 1)

The model cumulative probabilities below show us that that each additional model sampled very quickly converges to one so the models dont add all that much from very early over the *almost 6000 models generated.

plot(bayes.movie, which = 2)

The above information taken together seems to indicate that some variables may have greater impact that others and overshadow the rest. With the possibility of multicollinearity (intervariable correlation) and nonconstant variance (heteroscedasticity) it appears that the final models may be simpler than the original regression equation might indicate.

Below we have the first indication of this in the model complexity graph. The models with the highest marginal probabilities, here expressed as the log of their values, all two to three variables.

plot(bayes.movie, which = 3)

The marginal posterior inclusion probabilities, shown below, are the probabilities that a particular feature is in the model with the best predictive power. From the graph below its cear that imdb rating, critics score and possibily runtime are the most important features for predicting audience score.

plot(bayes.movie, which = 4)

The model rank graph shows the relative strength of each models validity from the strongest model at the far left and the corresponding features that go with it. As in the PIP model above, the intercept, imdb rating and critics score are consistent throughout as variables to be included with runtime coming in by the second model. *The image model rank model shows no obvious correlation regarding multicollinearity.

image(bayes.movie, rotate = FALSE)

Part 5: Prediction

I ran a prediction under the estimator BMA [predict(bayes.movie, estimator = “BMA”)]and the output was unweildly but it did state that the best model contained the intercept, runtime, imdb score and critics score.

lm_movies <- lm(audience_score ~ runtime +imdb_rating + critics_score , data = movies)
summary(lm_movies)
## 
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.996  -6.706   0.738   5.435  52.550 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -32.90825    3.35451  -9.810  < 2e-16 ***
## runtime        -0.05791    0.02238  -2.588 0.009891 ** 
## imdb_rating    14.95043    0.60249  24.814  < 2e-16 ***
## critics_score   0.07531    0.02227   3.381 0.000769 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.15 on 615 degrees of freedom
## Multiple R-squared:  0.7483, Adjusted R-squared:  0.747 
## F-statistic: 609.4 on 3 and 615 DF,  p-value: < 2.2e-16

The first thing of note for this model is that the R-squared is 0.747 which is rather goos as it indicates that about 75% of the variance is explained by these varaibles and is higher than if runtime were omitted. SEE BELOW

lm_movies_a <- lm(audience_score ~ runtime +imdb_rating + critics_score , data = movies)
summary(lm_movies)
## 
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.996  -6.706   0.738   5.435  52.550 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -32.90825    3.35451  -9.810  < 2e-16 ***
## runtime        -0.05791    0.02238  -2.588 0.009891 ** 
## imdb_rating    14.95043    0.60249  24.814  < 2e-16 ***
## critics_score   0.07531    0.02227   3.381 0.000769 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.15 on 615 degrees of freedom
## Multiple R-squared:  0.7483, Adjusted R-squared:  0.747 
## F-statistic: 609.4 on 3 and 615 DF,  p-value: < 2.2e-16

Runtime was possibly a questionable variable as it only had a marginal inclusion probability of about 0.5 but coefficient of determination, R-squared, reveals that it is significant enough to be in the model.

I am going to use the movie ARRIVAL for the prediction model and to see how it fares. Arrival had a critics score of 94%, imdb rating of 8, runtime of 116 minutes and audience score, our response variable of 82%.

arrival<- data.frame(critics_score = 0.94, imdb_rating = 8.0, runtime = 116)
predict(lm_movies, arrival)
##      1 
## 80.049

The model yields 80% which is very good. The next and final step is to create the confidence interval.

predict(lm_movies, arrival, interval = "prediction", level = 0.95)
##      fit      lwr      upr
## 1 80.049 59.71363 100.3844

As we can see this confidence interval tells us ##** we are 95% confident that a drama that is liked 94% by the crirics with a runtime of 116 minutes and rating of 8 will be scored by the audience between 60% and 100% on average.**

Part 6: Conclusion

The model worked well for this movie. At this point it should be applied to others in an effort to collect data on its efficacy and then use more statistical models to examine the results. Care should be taken to moderate over confidence as there are many permutations of the variables and their quantities. As such there is room for variance that would render the model inaccurate and perhaps highly so.