Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)


getwd()

## [1] "C:/Albert/Coursera Statitics with R/Bayesian Statistics/Wk5"

Load data

load("eaca_movies.Rdata")
mov<-data.frame(movies)

Abstract Summary

The research question was stated as: What attributes make a movie popular? In the statitical analysis of this question, a subset of the original data plus five new derived predictors were evaluated to see what influence they have, if any, on the audience score,

A Bayesian Information Criterion backwards elimination process was performed on the model that eliminated twelve predictors. Bayesian posterior diagnostics were performed and the coeeficients were interpreted.

The prediction of a single movie, ‘Rogue One: A Star Wars Story’ was performed using the model with the prediction being the same as the audience score. The model was deemed good with certain limitations.

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016. Some of these variables are only there for informational purposes and do not make any sense to include in a statistical analysis.

IMDb

The Internet Movie Database (abbreviated IMDb) is an online database of information related to films, television programs and video games, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews, operated by IMDb.com, Inc., a subsidiary of Amazon.com.

Actors and crew can post their own résumé and upload photos of themselves for a yearly fee. U.S. users can view over 6,000 movies and television shows from CBS, Sony, and various independent filmmakers.

Launched in 1990 by computer programmer Col Needham, the company was incorporated in the UK as Internet Movie Database Ltd in 1996 with revenue generated through advertising, licensing and partnerships. In 1998 it became a subsidiary of Amazon.com, who were then able to use it as an advertising resource for selling DVDs and videotapes.

The information in IMDb comes from various sources. While we actively gather information from and verify items with studios and filmmakers, the bulk of our information is submitted by people in the industry and visitors like you.[1]

Rotten Tomatoes

Rotten Tomatoes is an American review aggregator website for film and television.

The company was launched in August 1998 by Senh Duong and since January 2010 has been owned by Flixster, which itself was acquired in 2011 by Warner Bros. In February 2016, Rotten Tomatoes and its parent site Flixster were sold to Comcast’s Fandango. Warner Bros. retained a minority stake in the merged entities, including Fandango.[2] Since 2007, the website’s editor-in-chief has been Matt Atchity.[4] The name, Rotten Tomatoes, derives from the practice of audiences’ throwing rotten tomatoes when disapproving of a poor stage performance.[2]

Generalizability and Causality

Generalization is allowable since the data represent 651 randomly sampled movies released between 1970 to 2014 in the Unites States. The data taken from both the IMDb and Rotten Tomatoes database. However, since the study does not make use of random assignment, causality cannot be inferred from this study. The sample may be biased since it is limited to entries contributed by members.

Part 2: Data manipulation

Create new variable based on title_type: New variable should be called feature_film with levels yes (movies that are feature films) and no.
Create new variable based on genre: New variable should be called drama with levels yes (movies that are dramas) and no.
Create new variable based on mpaa_rating: New variable should be called mpaa_rating_R with levels yes (movies that are R rated) and no.
Create two new variables based on thtr_rel_month: New variable called oscar_season with levels yes (if movie is released in November, October, or December) and no.
New variable called summer_season with levels yes (if movie is released in May, June, July, or August) and no.

mov<-mutate(mov,feature_film=ifelse(mov$title_type == 'Feature Film','yes','no'))
mov$feature_film <- as.factor(mov$feature_film)
ff_T <- filter(mov,feature_film=='yes')
ff_F <- filter(mov,feature_film=='no')
mov<-mutate(mov,drama=ifelse(mov$genre == 'Drama','yes','no'))
genD_T <- filter(mov,drama=='yes')
genD_F <- filter(mov,drama=='no')
mov<-mutate(mov,mpaa_rating_R=ifelse(mov$mpaa_rating == 'R','yes','no'))

Rr_T <- filter(mov,mpaa_rating_R=='yes')
Rr_F <- filter(mov,mpaa_rating_R=='no')
mov<-mutate(mov,summer_season=ifelse(mov$thtr_rel_month == 5|
                                     mov$thtr_rel_month == 6|
                                     mov$thtr_rel_month == 7|
                                     mov$thtr_rel_month == 8,'yes','no'))
ss_T <- filter(mov,summer_season=='yes')
ss_F <- filter(mov,summer_season=='no')

mov<-mutate(mov,oscar_season=ifelse(mov$thtr_rel_month == 10| 
                                    mov$thtr_rel_month == 11|
                                    mov$thtr_rel_month == 12,'yes','no'))
os_T <- filter(mov,oscar_season=='yes')
os_F <- filter(mov,oscar_season=='no')

mov$drama <- as.factor(mov$drama)
mov$mpaa_rating_R <- as.factor(mov$mpaa_rating_R)
mov$summer_season <- as.factor(mov$summer_season)
mov$oscar_season <- as.factor(mov$oscar_season)

## Subsets

sSet<-select(mov,audience_score,feature_film,drama,summer_season,oscar_season,mpaa_rating_R)

mSet<-select(mov,audience_score,feature_film,drama,summer_season,oscar_season,mpaa_rating_R,
             runtime,thtr_rel_year,imdb_rating,imdb_num_votes,critics_score,best_pic_nom,
             best_pic_win,best_actor_win,best_actress_win,best_dir_win,top200_box)

Part 3: Exploratory data analysis

Preliminary

Therr are 651 movies being statistically analyzed. The audience score ranges from 11 to 97 based on a 100 point scale. The Interquartile range for the audience score, (IQR), is 34, with mean 62.36 and median 65.

The newly added items tally up as follows:

feature_film - yes 591 no 60
drama - yes 305 no 346
mpaa_rating_R - yes 322 no 329
summer_season - yes 208 no 443
oscar_season - yes 191 no 460

nrow(sSet);summary(sSet)

## [1] 651

##  audience_score  feature_film drama     summer_season oscar_season
##  Min.   :11.00   no : 60      no :346   no :443       no :460     
##  1st Qu.:46.00   yes:591      yes:305   yes:208       yes:191     
##  Median :65.00                                                    
##  Mean   :62.36                                                    
##  3rd Qu.:80.00                                                    
##  Max.   :97.00                                                    
##  mpaa_rating_R
##  no :322      
##  yes:329      
##               
##               
##               
##

################################# FEATURE FILM ##########################

p1<-ggplot(mov, aes(x=feature_film, y=audience_score,fill=summer_season)) + geom_boxplot() +
    stat_summary(fun.y=mean, geom="point", shape=4, size=5)+
    stat_summary(fun.y=sd, geom="point", shape=1, size=2,color='blue')  +
   coord_flip()

p2<-ggplot(mov, aes(x=feature_film, y=audience_score,fill=oscar_season)) + geom_boxplot() +
    stat_summary(fun.y=mean, geom="point", shape=4, size=5)+
    stat_summary(fun.y=sd, geom="point", shape=1, size=2,color='blue')  +
   coord_flip()

b1<-ggplot(ff_T, aes(x = audience_score, y = feature_film)) +
    geom_bar(stat='identity',colour='white')
b2<-ggplot(ff_F, aes(x = audience_score, y = feature_film)) +
    geom_bar(stat='identity',colour='white')

####################### DRAMA ####################################

p3<- ggplot(mov, aes(x=drama, y=audience_score,fill=summer_season)) + geom_boxplot() +
    stat_summary(fun.y=mean, geom="point", shape=4, size=3,color='black')  +
   stat_summary(fun.y=sd, geom="point", shape=1, size=2,color='blue')  +
   coord_flip()
 

p4<- ggplot(mov, aes(x=drama, y=audience_score,fill=oscar_season)) + geom_boxplot() +
    stat_summary(fun.y=mean, geom="point", shape=4, size=3)+
    stat_summary(fun.y=sd, geom="point", shape=1, size=2,color='blue')  +
   coord_flip()

b3<-ggplot(genD_T, aes(x = audience_score, y = drama)) +
    geom_bar(stat='identity')
b4<-ggplot(genD_F, aes(x = audience_score, y = drama)) +
    geom_bar(stat='identity')


####################### MPAA RATING R ####################################

p5<- ggplot(mov, aes(x=mpaa_rating_R, y=audience_score,fill=summer_season)) + geom_boxplot() +
    stat_summary(fun.y=mean, geom="point", shape=4, size=5)+
     stat_summary(fun.y=sd, geom="point", shape=1, size=2,color='blue')  +
    coord_flip()
p6<- ggplot(mov, aes(x=mpaa_rating_R, y=audience_score,fill=oscar_season)) + geom_boxplot() +
    stat_summary(fun.y=mean, geom="point", shape=4, size=5) +
    stat_summary(fun.y=sd, geom="point", shape=1, size=2,color='blue')  +
   coord_flip()

b5<-ggplot(Rr_T, aes(x = audience_score, y = mpaa_rating_R)) +
    geom_bar(stat='identity',color='white')
b6<-ggplot(Rr_F, aes(x = audience_score, y = mpaa_rating_R)) +
    geom_bar(stat='identity',color='white')

Feature Film Summary Statistics

Below are two sets of four feature film boxplots and one set of feature film bar charts. The first set of boxplots correspond to the summer season and the second to the oscar season, as explained in the instructions. The abscissa for all of the boxplots is the audience score, the key indicator of a popular movie. The bar charts will be explained below.

The boxplots are plotted sideways to reveal the information in a form much like a distribution. Each boxplot has the following characteristics:

box width corresponds to the IQR
box center bar is the median value
left and right box whisker corresponds to the min and max values, correspondingly
X in the box is the mean
the blue o corresponds to the box standard deviation

The first box is blue and corresponds to a feature film released during the summer season. The IQR is near 30, the median is 63, the mean is 60 and the standard deviation is about 20. The left whisker (min) is 11 and the right whisker(max) is 93.

The second box, colored pink, has the same mean and standard deviation as the first box. This box corresponds to a feature film not during the summer season. The IRQ is slightly larger say 35 and the median is slightly lower say 61. The left whisker is close to 13 and the right whisker is larger at about 96.

The next two boxplots depict films other than feature films for the summer season (blue) and other than summer season (pink). Immediately, it can be seen that both the medians and means are larger and the IRQs and standard deviations are smaller. This means that non-feature films on the average, have much higher popularity (audience scores) than feature films. There are also two outlier points, one occurs during the oscar season and the other occurs in the months that are neither oscar nor summer season. There are two distinct means and standard deviations for the summer season and other than summer summer season. The values for these statistics can be discerned from the boxplots.

The next four box plots display the same information as the first four with summer season being replaced with the oscar season.

Finally, a set of bar charts are provided for both the feature film movies and the other than feature film movies. This allows the reader to more clearly recognize the similarity between the boxplots and the bar chart distribution. It is clear that boxplots contain much more statistical information than the bar charts.

multiplot(p1, p2, cols=1);multiplot(b1, b2, cols=2)

## Loading required package: grid

Drama Summary Statistics

Using the same type of analysis described in the feature film statitics above, the Drama audience scores behave as follows:

The higest median, median and lowest sd, for audience scores, occurs for drama movies in oscar season (70,65.10)
The smallest mean and median, for audience scorse, occurs for non-drama films other than summer season (58,57)
The largest standard deviation for audience score occurs for non-drama films (20)
The smallest IQR for audience scores occurs for the drama movies during summer season (19)
The largest IQR for audience score occurs for non-drama movies during oscar season (38)
There are no audience score outliers for these boxplots

multiplot(p3, p4, cols=1)#;multiplot(b3, b4, cols=2)

R Rating Summary Statistics

Using the same type of analysis described in the feature film statitics above, the R rating audience scores behave as follows:

The higest median and median, for audience scores, occurs for R movies in summer season (73,68)
The smallest median for audience scorse, occurs for non-R films in summer season (60)
The smallest mean for audience scorse, occurs for non-R films in summer season (58)
The standard deviation for audience score are roughly the same (18)
The smallest IQR for audience scores occurs for the non-R movies during summer season (20)
The largest IQR for audience score occurs for non-drama movies during oscar season (41)
There are no audience score outliers for these boxplots

multiplot(p5, p6, cols=1)#;multiplot(b5, b6, cols=2)

Part 4: Modeling

The model is built using Bayesian Information Criterion (BIC), which uses goodness of fit along with the number of parameters to evaluate the data. The list of columns given in the description of the task are evaluated on the basis of smallest BIC value. The column that has the smallest BIC value when eliminated from the rest of the list is selected for permanent removal from the list.

The algorithm below uses the elimination described above to select the least set of columns required for best modeling results. During each iteration, current data is printed indicating the reduction in BIC for the given column.

m_mov = lm(audience_score ~ . , data = mSet)

for (k in 1:16){
     x<-names(mSet)
     x<-x[2:length(x)]
     mb<-lm(audience_score~.,data=mSet)
     min<-1000000
     ind<-0
     col<-''
     for(i in x){
          j<-which(colnames(mSet)==i)
          ss<- mSet[-c(j)]
          m <- lm(audience_score ~ .,data=ss)
          b<-BIC(m)
          if( b < min ){
               min <- b
               ind <- j
               col<- i
          }
     }
     if(b>BIC(mb))break
    
     h<-sprintf("[%0.0f] BIC previous (%0.0f), BIC new (%0.0f) Column to be reemoved(%s)",k,BIC(mb),min,col)
     print(h)
     mSet<-mSet[-c(ind)]

}

## [1] "[1] BIC previous (4934), BIC new (4928) Column to be reemoved(top200_box)"
## [1] "[2] BIC previous (4928), BIC new (4922) Column to be reemoved(oscar_season)"
## [1] "[3] BIC previous (4922), BIC new (4916) Column to be reemoved(best_pic_win)"
## [1] "[4] BIC previous (4916), BIC new (4910) Column to be reemoved(best_dir_win)"
## [1] "[5] BIC previous (4910), BIC new (4905) Column to be reemoved(summer_season)"
## [1] "[6] BIC previous (4905), BIC new (4900) Column to be reemoved(feature_film)"
## [1] "[7] BIC previous (4900), BIC new (4895) Column to be reemoved(drama)"
## [1] "[8] BIC previous (4895), BIC new (4890) Column to be reemoved(imdb_num_votes)"
## [1] "[9] BIC previous (4890), BIC new (4886) Column to be reemoved(thtr_rel_year)"
## [1] "[10] BIC previous (4886), BIC new (4881) Column to be reemoved(best_actor_win)"
## [1] "[11] BIC previous (4881), BIC new (4878) Column to be reemoved(best_actress_win)"
## [1] "[12] BIC previous (4878), BIC new (4874) Column to be reemoved(best_pic_nom)"

Bayesian Posterior Probability Diagnostics

Viewing the posteriror probability output below, There is a 36 % chance that either of the top two models are the true model. This means that between the two there is a total 72 % chance of the true model.

The model chosen has four parameters, other than intercept, namely, MPAA_rating_R, runtime, imdb_rating and critics score. Therefore there are \(p^2\) (16) number of models for the four predictors. These are depicted in the image below.

The selected model used is most similar to the first block of the image below with the black square removed. The models range from best to worst as the columns move left to right. The lower horizontal axis is the log of the posterior probabilities in the table (X -1000).

mSet_no_na = na.omit(mSet)


bma_as = bas.lm(audience_score ~ ., data = mSet_no_na,
                   prior = "BIC", 
                   modelprior = uniform())
                  
bma_as

## 
## Call:
## bas.lm(formula = audience_score ~ ., data = mSet_no_na, prior = "BIC",     modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##        Intercept  mpaa_rating_Ryes           runtime       imdb_rating  
##           1.0000            0.2008            0.5077            1.0000  
##    critics_score  
##           0.9014

round(summary(bma_as),3)

##      Intercept mpaa_rating_Ryes runtime imdb_rating critics_score    BF
## [1,]         1                0       1           1             1 1.000
## [2,]         1                0       0           1             1 0.997
## [3,]         1                1       0           1             1 0.252
## [4,]         1                1       1           1             1 0.239
## [5,]         1                0       1           1             0 0.126
##      PostProbs    R2 dim   logmarg
## [1,]     0.362 0.755   4 -3615.279
## [2,]     0.361 0.752   3 -3615.282
## [3,]     0.091 0.754   4 -3616.657
## [4,]     0.087 0.756   5 -3616.710
## [5,]     0.046 0.751   3 -3617.354

image(bma_as,rotate=F)

Interpretation of Coefficients

The posterior mean and standard deviations, for the intercept and the four predictors, are provided in the first two columns of the output below. The third column post \(p(B != 0)\) is the probability that the coefficient is non-zero. Notice that for both the intercept and imdb rating the event is certain. The critics score has a 90% chance of not being zero, the runtime 50% and mpaa rating R is only 20%.

The plots below graphically depict the coeeficient statistics provided in the table. Note that the spikes indicate the likelihood of a zero probability. The height oof spikes are scaled to their probabilities. Since the intercept and imbd rating there are no spikes, again denoting non-zero probability

coef_as = coefficients(bma_as)
coef_as

## 
##  Marginal Posterior Summaries of Coefficients: 
##                   post mean  post SD   post p(B != 0)
## Intercept         62.34769    0.39478   1.00000      
## mpaa_rating_Ryes  -0.30490    0.70385   0.20075      
## runtime           -0.02737    0.03086   0.50765      
## imdb_rating       14.97684    0.72427   1.00000      
## critics_score      0.06445    0.02961   0.90135

par(mfrow=c(2,3))

plot(coef_as)
par(mfrow=c(1,1))

Part 5: Prediction

The motion picture ‘Rogue One: A Star Wars Story’, starring Felicity Jones, will be used for the prediction movie. The audience score, as delineated in the data frame below, is 88. The values for the fields of the data frame were taken from references [4][5].

The predicted result 88, is the same as the audience score.

rogueOne<-data.frame(audience_score=as.numeric(88),
                  feature_film=as.factor('yes'),
                  drama=as.factor('no'),
                  summer_season=as.factor('no'),
                  oscar_season=as.factor('yes'),
                  mpaa_rating_R=as.factor('no'),
                  runtime=as.numeric(133),
                  thtr_rel_year=as.numeric(2016),
                  imdb_rating=as.numeric(8.1),
                  imdb_num_votes=as.integer(239631),
                  critics_score=as.numeric(85),
                  best_pic_nom=as.factor('no'),
                  best_pic_win=as.factor('no'),
                  best_actor_win=as.factor('no'),
                  best_actress_win=as.factor('no'),
                  best_dir_win=as.factor('no'),
                  top200_box=as.factor('yes') )




BayesPred<-predict(m,rogueOne)
round(BayesPred)

##  1 
## 88

Part 6: Conclusion

The reasech question was analyzed using methods outlined in project. An Exploratory Data Analysis was performed for the five derived variables. A Bayesian Information criterion backwards elimination produced a model that was scrutinized using Bayesian Posterior Probabilities. The model was used to predict some recent movies.

Although the prediction for ‘Rogue One: A Star Wars Story’ behaved perfectly, there are some limitations to this model. Other movies, such as ‘Sully’ and ‘Jackie’, were predicted to have worse results than if no elimination results were not used in model building.