Setup

Load packages

library(tidyverse)
library(statsr)
library(GGally)

Load data

load("movies.Rdata")

Introduction

This report is presented to Paramount Studios to illuminate what makes a movie popular. The primary conclusion is that popular movies are closely correlated with popular directors. Much more discussion is below.

Part 1: Data

Data Overview

The data consist of 651 rows of randomly sampled movies from IMDb and/or Rotten Tomatoes, including one movie that is listed twice. The exact method of sampling is unknown, but it is known to be random. The sample appears biased to include more recent movies. The data do not include a single film released before 1970. With that bias in mind, this report focuses on “contemporary movies,” or those with theatrical release in year 2000 or later. With the caveats already mentioned, the conclusions herein are generalizable to all contemporary movies listed on IMDb and Rotten Tomatoes.

This study is purely observational and one cannot draw conclusions about causality, although a topic like “what makes a movie popular” surely tempts one to think in causal terms.

Data Cleaning

These are minor glitches in the data that are repaired to prepare for data analyses to come.

A duplicate movie to be removed:

# Remove duplicate movies ("Man on Wire")
movie_pop <- 
  distinct(movies, imdb_url, .keep_all=TRUE)

Missing and inconsistent studio names to be reconciled (e.g., “Paramount” and “Paramount Studio”). This is simply a rough cut at cleaning the names. A full correction of studio names is beyond the scope of this project:

# Convert from factor to character
movie_pop$studio <- as.character(movie_pop$studio)

# Fill in missing studio names
movie_pop$studio[movie_pop$title=="Dirty Sanchez: The Movie"] <- "Vertigo Films"
movie_pop$studio[movie_pop$title=="Inbred"] <- "Anchor Bay"
movie_pop$studio[movie_pop$title=="The Man Who Sued God"] <- "Buena Vista"

# Clean studio names with manual corrections stored in file
movie_pop <- left_join(movie_pop, read.csv("studio_corrections.csv"), by="studio")

## Warning: Column `studio` joining character vector and factor, coercing into
## character vector

# Just a sample of the cleaning -- all the versions of Warner Bros
movie_pop %>% group_by(studio,studio_cleaned) %>% summarise(n=n()) %>% filter(grepl("Warner",studio)==TRUE)

## # A tibble: 7 x 3
## # Groups:   studio [7]
##                        studio  studio_cleaned     n
##                         <chr>          <fctr> <int>
## 1        Warner Bros Pictures Warner Brothers     1
## 2                Warner Bros. Warner Brothers     7
## 3       Warner Bros. Pictures Warner Brothers    30
## 4           Warner Home Video Warner Brothers    19
## 5          Warner Independent Warner Brothers     1
## 6 Warner Independent Pictures Warner Brothers     3
## 7      Warners Bros. Pictures Warner Brothers     1

Duplicate actor names in the cast of a single movie (e.g., “Traudl Junge” is both actor1 and actor2 for Blind Spot: Hitler’s Secretary):

# Tall list of actors x movies
movies_tall <- movie_pop %>%
  gather(actor1,actor2,actor3,actor4,actor5,key=role,value=actor)

# Flag duplicates within cast of each movie
movies_tall$dupcast <- duplicated(movies_tall %>% select(title,actor))

# Reassemble wide list with cleaned up cast
movie_pop <- 
  spread(movies_tall %>% filter(dupcast==FALSE),key=role,value=actor) %>% select(-dupcast)

Different movies with identical titles. Release year is added to the titles of these movies so that each movie title is unique:

# Resolve other duplicate title names by adding year to title
movie_pop[duplicated(movie_pop$title),]$title <-
  paste(movie_pop[duplicated(movie_pop$title),]$title,
        movie_pop[duplicated(movie_pop$title),]$thtr_rel_year)

Missing runtime

movie_pop[movie_pop$title=="The End of America",]$runtime <- 74

Box office: the ultimate measure of popularity

The most obvious measure of popularity of a movie is its box office or gross revenue. Unfortunately, the movies data do not include box office. To help choose a proxy variable for box office, revenue numbers are found manually for a random sample of 30 of the 650 movies. The values are stored in the variable box_office, which represents how many thousands of dollars a movie made. For example, The Hangover made $277,322,000. In the Exploratory Data Analysis section, several variables available in the movies data are analyzed for correlation with box office based on these 30 randomly selected films. It will be seen that imdb_num_votes provides the best proxy, with a log-log transformation.

# Reproducible randomized list of movies. we then manually find box office for 30
set.seed(7575672)
movie_pop$rnd <- runif(nrow(movie_pop))
write.csv(movie_pop %>% arrange(rnd) %>% select(title,studio,thtr_rel_year),
          "movies_for_box_office.csv")

# Manually add box office figures and read them back in and join to movie_pop
movie_pop <- left_join(movie_pop, read.csv("movies_with_box_office.csv"), by="title")

## Warning: Column `title` joining character vector and factor, coercing into
## character vector

movie_pop %>% filter(!is.na(box_office)) %>% select(title,thtr_rel_year,box_office) %>% arrange(desc(box_office))

## # A tibble: 28 x 3
##                               title thtr_rel_year box_office
##                               <chr>         <dbl>      <int>
##  1                     The Hangover          2009     277322
##  2                Beverly Hills Cop          1984     234760
##  3          While You Were Sleeping          1995      81057
##  4                         Spy Game          2001      62362
##  5 Star Trek V - The Final Frontier          1989      52210
##  6                 The Longest Yard          1974      43008
##  7                   The Main Event          1979      42800
##  8                             Taps          1981      35856
##  9                       Robin Hood          1973      32056
## 10                    Micki + Maude          1984      26080
## # ... with 18 more rows

Part 2: Research question

What makes a movie popular? More specifically this report is interested in box office, via some proxy variable in movies that is (hopefully) highly correlated to box office. With the data provided, this report seeks to find the best proxy for box office, and then to construct the best possible linear model that explains that proxy variable.

Spoiler: Ultimately we arrive at a model with a response variable log_votes (log of imdb_num_votes) and an R² of 0.72. The most important explanatory variable by far is the historical performance of the director at getting IMDb votes for their previous movies. This linear model only works when the director has known work prior to the movie in question, which is only true about one fifth of the time given our data. For the other four fifths, when the director’s previous experience is unknown, an alternative linear model is derived with an R² of 0.60.

Part 3: Exploratory data analysis

Basic inventory

The movies dataset has 650 unique observations of 32 variables, which are documented here.

The data include a very small number of TV movies and NC-17 rated features, which are not relevant to our theatrical focus and are therefore removed. Variable title_type is ignored from here forth because its remaining information is redundant with genre.

# Omit TV movies and NC-17s because so few and not of interest
movie_pop %>% count(title_type)

## # A tibble: 3 x 2
##     title_type     n
##         <fctr> <int>
## 1  Documentary    54
## 2 Feature Film   591
## 3     TV Movie     5

movie_pop %>% count(mpaa_rating)

## # A tibble: 6 x 2
##   mpaa_rating     n
##        <fctr> <int>
## 1           G    19
## 2       NC-17     2
## 3          PG   118
## 4       PG-13   132
## 5           R   329
## 6     Unrated    50

movie_pop <- movie_pop %>% 
  filter(title_type!="TV Movie",mpaa_rating!="NC-17") %>% 
  droplevels()

Box office

Looking at the random sample of box office numbers gathered in the preceding section, one sees the distrubtion of box office is extremely right skewed. A log scale provides a more uniform distribution that will prove more useful:

movie_pop %>% ggplot(aes(x=box_office)) + geom_histogram(bins = 20) + ggtitle("Box office distribution\nBased on random sample of 30 movies")

## Warning: Removed 615 rows containing non-finite values (stat_bin).

movie_pop %>% ggplot(aes(x=log2(box_office))) + geom_histogram(bins = 20) + ggtitle("Logged distribution of box office")

## Warning: Removed 615 rows containing non-finite values (stat_bin).

Year of theatrical release

There is a distinct upward trend in the number of movies released each year, and none released before 1970. Perhaps this is a bias built into IMDb and Rotten Tomatoes, or maybe built into the sampling method. In any case, the analysis will accomodate this bias by focusing on “contemporary movies”, those released since 2000. The full cleaned dataset will be kept in movie_40years for some auxiliary analyses to come.

movie_pop %>% ggplot(aes(x=thtr_rel_year)) + geom_histogram(binwidth = 1) + ggtitle("Number of movies released per year")

# Save full 40 year history and filter movie_pop to last 15 years or so
movie_40years <- movie_pop
movie_pop <- movie_pop %>% filter(thtr_rel_year>1999)

Proxy for box office = IMDb Votes

The number of people who voted on a movie, just like box office, is extremely right skewed. This makes sense because neither votes nor box office can be less than zero, and both can grow exponentially for blockbuster movies. Once again a log scale provides a useful distribution:

movie_pop %>% ggplot(aes(x=imdb_num_votes)) + geom_histogram(bins=20) + ggtitle("IMDb votes distribution")

# Log(IMDb_num_votes) is roughly normal
movie_pop %>% ggplot(aes(x=log2(imdb_num_votes))) + geom_histogram(bins=20) + ggtitle("Logged IMDb votes distribution")

# Track log of IMDb votes with new variable
movie_pop <- movie_pop %>% mutate(log_votes=log2(imdb_num_votes))
movie_40years <- movie_40years %>% mutate(log_votes=log2(imdb_num_votes))

In the table and chart below it is seen that the log-log correlation of votes and box office is very good (R=0.92) and will provide the sought-after proxy.

Note that without a log-log analysis, there is still a correlation between votes and box office, but it is subject to one extremely influential point which makes the linear model suspect. In the next subsection, other movie scores in the data are evaluated as possible proxies for box office and found to be quite inferior. This report uses log_votes as a proxy for log(box_office).

# Log-log correlation is excellent
cor(movie_pop$log_votes, log2(movie_pop$box_office), use = "complete.obs")

## [1] 0.9229588

movie_pop %>% ggplot(aes(x=log_votes,y=log2(box_office))) + geom_point(size=2)  + geom_smooth(alpha=0.3,method="lm") + ggtitle("Log(box office) vs Log(votes)")

## Warning: Removed 323 rows containing non-finite values (stat_smooth).

## Warning: Removed 323 rows containing missing values (geom_point).

movie_pop %>% ggplot(aes(x=imdb_num_votes,y=box_office)) + geom_point(size=2) + geom_smooth(alpha=0.3,method="lm") + ggtitle("Box office vs Votes\nAffected by extremely influential point") + ylab("Box office (000s)")

## Warning: Removed 323 rows containing non-finite values (stat_smooth).

## Warning: Removed 323 rows containing missing values (geom_point).

Movie ratings and scores

On Rotten Tomatoes there is a distinction between “score,” which is numeric, and “rating,” which is categorical and derived mostly from the associated score. The numeric scores offer richer information and so the report uses these instead of categorical ratings.

As shown below, none of these movie score variables comes close to log_votes as a proxy for box_office. It is not shown here but in the un-logged case the correlations are just as bad. Note that it doesn’t make sense to use the log of a review score, because a review score has a strict maximum and cannot grow exponentially for any blockbusters.

movie_pop %>% ggplot(aes(x=audience_score,y=log2(box_office))) + geom_point() + geom_smooth(method="lm") + ggtitle("Log(box office) vs. Audience score\nR=0.41")

## Warning: Removed 323 rows containing non-finite values (stat_smooth).

## Warning: Removed 323 rows containing missing values (geom_point).

movie_pop %>% ggplot(aes(x=critics_score,y=log2(box_office))) + geom_point() + geom_smooth(method="lm")  + ggtitle("Log(box office) vs. Critics score\nR=0.20")

## Warning: Removed 323 rows containing non-finite values (stat_smooth).

## Warning: Removed 323 rows containing missing values (geom_point).

movie_pop %>% ggplot(aes(x=imdb_rating,y=log2(box_office))) + geom_point() + geom_smooth(method="lm")  + ggtitle("Log(box office) vs. IMDb rating\nR=0.40")

## Warning: Removed 323 rows containing non-finite values (stat_smooth).

## Warning: Removed 323 rows containing missing values (geom_point).

cor(movie_pop$audience_score, movie_pop$box_office, use = "complete.obs")

## [1] 0.4131753

cor(movie_pop$critics_score, movie_pop$box_office, use = "complete.obs")

## [1] 0.2028081

cor(movie_pop$imdb_rating, movie_pop$box_office, use = "complete.obs")

## [1] 0.3979792

Possible explanatory variables for the linear model

With log_votes established as the response variable, the other variables in the data are evaluated as possible explanatory variables to include in the linear model.

Movie reviews and scores

The chart below shows that Imdb_rating, critics_score, and audience_score are all highly correlated with each other, but not so much with log_votes. We will consider only imdb_rating for possible inclusion in the linear model, excluding the others because of colinearity.

ggpairs(movie_pop,columns=c(13,16,18,36))

Oscar connections and top-200 box office

The data include best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, and top200_box. These are categorical variables, and ANOVA is used to evaluate their promise as explanatory variables for the linear model. The p-value is especially small for a best picture nomination, indicating excellent potential as an explanatory variable. Potential is also good for including a director or actor with oscar. A best actress oscar seems to have less effect than a best actor oscar.

We will omit best_pic_win from consideration because it is colinear with best_pic_nom. We will omit top200_box because it would make for a self-referential and unhelpful approach to explaining box office. The others are possible candidates for the linear model.

summary(aov(movie_pop$log_votes ~ movie_pop$best_pic_nom))

##                         Df Sum Sq Mean Sq F value   Pr(>F)    
## movie_pop$best_pic_nom   1  102.8  102.83   16.62 5.73e-05 ***
## Residuals              331 2048.1    6.19                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(movie_pop$log_votes ~ movie_pop$best_pic_win))

##                         Df Sum Sq Mean Sq F value Pr(>F)  
## movie_pop$best_pic_win   1   37.8   37.80   5.921 0.0155 *
## Residuals              331 2113.1    6.38                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(movie_pop$log_votes ~ movie_pop$best_actor_win))

##                           Df Sum Sq Mean Sq F value   Pr(>F)    
## movie_pop$best_actor_win   1   83.1   83.09    13.3 0.000308 ***
## Residuals                331 2067.8    6.25                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(movie_pop$log_votes ~ movie_pop$best_actress_win))

##                             Df Sum Sq Mean Sq F value Pr(>F)  
## movie_pop$best_actress_win   1   40.9   40.95   6.423 0.0117 *
## Residuals                  331 2110.0    6.37                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(movie_pop$log_votes ~ movie_pop$best_dir_win))

##                         Df Sum Sq Mean Sq F value   Pr(>F)    
## movie_pop$best_dir_win   1   73.1   73.12   11.65 0.000722 ***
## Residuals              331 2077.8    6.28                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(movie_pop$log_votes ~ movie_pop$top200_box))

##                       Df Sum Sq Mean Sq F value  Pr(>F)    
## movie_pop$top200_box   1   77.9   77.91   12.44 0.00048 ***
## Residuals            331 2073.0    6.26                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Genre, rating, and month of release

The data include genre, mpaa_rating, and thtr_rel_month (converted from number to factor). Again, these are categorical variables, and ANOVA is used to evaluate their promise as explanatory variables. We include all as candidates for the linear model, even though the month receives a weak (large) p-value.

summary(aov(movie_pop$log_votes ~ movie_pop$genre))

##                  Df Sum Sq Mean Sq F value Pr(>F)    
## movie_pop$genre  10  750.6   75.06   17.26 <2e-16 ***
## Residuals       322 1400.3    4.35                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(movie_pop$log_votes ~ movie_pop$mpaa_rating))

##                        Df Sum Sq Mean Sq F value Pr(>F)    
## movie_pop$mpaa_rating   4  598.5  149.61   31.61 <2e-16 ***
## Residuals             328 1552.5    4.73                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(movie_pop$log_votes ~ movie_pop$thtr_rel_month))

##                           Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$thtr_rel_month   1   16.7  16.747   2.597  0.108
## Residuals                331 2134.2   6.448

# ggpairs(movie_pop,columns=c(3,4,5,36))

Studio activity

The data include the studio for each movie, tidied up in the variable studio_cleaned. Below is a summary of the log_votes for recent films from the most active studios.

movie_pop %>% group_by(studio_cleaned) %>% filter(n()>7) %>% ggplot(aes(x=studio_cleaned,y=log_votes,group=studio_cleaned)) + geom_boxplot(varwidth = TRUE) + coord_flip() + ggtitle("Log_votes for active studios")

Notice above how the varwidth option adjusts the width of each boxplot according to the number of movies it represents. It appears that the more active studios typically produce movies with more log_votes. A new variable studio_category is therefore created to capture studio activity, with three ranges, small, medium, and large, defined based on the full 40-year dataset. In the calculation below, each “large” studio gets a studio_category to itself, while other studios get studio_category of “medium” or “small.”"

Actor and director activity

Along with studio activity, the historical activity of directors and actors will also be calculated and considered as explanatory for movie popularity. The calculation begins with a transformed “tall” version of the data, where each row records the participation of a specific individual or organization in a specific film. The variable entity names the individual/organization, and the variable role names their mode of participation as “actor”, “director”, or “studio”.

# Tall list of actors x movies
movies_tall <- movie_40years %>%
  gather(actor1,actor2,actor3,actor4,actor5,key=role,value=actor,na.rm=TRUE) %>% 
  select(title, thtr_rel_year, log_votes, 
         critics_score, audience_score, imdb_rating,          
         actor) %>%
  mutate(role="actor")
names(movies_tall)[names(movies_tall)=="actor"] <- "entity"

# Add directors
dir_movies <- movie_40years %>%
  select(title, thtr_rel_year, log_votes, 
         critics_score, audience_score, imdb_rating,
         director) %>%
  filter(!is.na(director)) %>%
  mutate(role="director")
names(dir_movies)[names(dir_movies)=="director"] <- "entity"
movies_tall <- rbind(movies_tall,dir_movies)

# Add studios
studio_movies <- movie_40years %>%
  filter(!is.na(studio_cleaned)) %>%
  mutate(role="studio") %>%
  select(title, thtr_rel_year, log_votes, 
         critics_score, audience_score, imdb_rating,
         studio_cleaned, role)
names(studio_movies)[names(studio_movies)=="studio_cleaned"] <- "entity"
movies_tall <- rbind(movies_tall,studio_movies)

With the tall form of data it is then easy to count how many times a person has acted or directed, or how many times a studio has released a movie. These statistics are then joined to the original “wide” data frame movie_pop:

# Count # films for each actor
actor_filmcount <- movies_tall %>% 
  filter(role=="actor") %>%
  select(entity) %>%
  group_by(entity) %>%
  summarise(n=n())

# Join # films (in our dataset) for each cast member
colnames(actor_filmcount) <- c("actor1","actor1_exp")
movie_pop <- left_join(movie_pop, actor_filmcount, by="actor1")
colnames(actor_filmcount) <- c("actor2","actor2_exp")
movie_pop <- left_join(movie_pop, actor_filmcount, by="actor2")
colnames(actor_filmcount) <- c("actor3","actor3_exp")
movie_pop <- left_join(movie_pop, actor_filmcount, by="actor3")
colnames(actor_filmcount) <- c("actor4","actor4_exp")
movie_pop <- left_join(movie_pop, actor_filmcount, by="actor4")
colnames(actor_filmcount) <- c("actor5","actor5_exp")
movie_pop <- left_join(movie_pop, actor_filmcount, by="actor5")

# Add total cast experience
movie_pop <- movie_pop %>% group_by(title) %>%
  mutate(cast_exp=sum(c(actor1_exp,
                        actor2_exp,
                        actor3_exp,
                        actor4_exp,
                        actor5_exp),
                      na.rm=TRUE))

# Count director experience 
director_films <- movies_tall %>% 
  filter(role=="director") %>%
  select(entity) %>%
  group_by(entity) %>%
  summarise(n=n())

colnames(director_films) <- c("director","dir_exp")
movie_pop <- left_join(movie_pop, director_films, by="director")

# Count studio experience
studio_films <-  movies_tall %>% 
  filter(role=="studio") %>%
  select(entity) %>%
  group_by(entity) %>%
  summarise(n=n())

At the same time, studio_category is calculated and joined to movie_pop:

colnames(studio_films) <- c("studio_cleaned","studio_cleaned_n")
movie_pop <- 
  left_join(movie_pop,studio_films,by="studio_cleaned") %>%
  mutate(studio_category=ifelse(studio_cleaned_n>25,
                                studio_cleaned,
                                ifelse(studio_cleaned_n>7,
                                       "Medium",
                                       "Small")))

## Warning: Column `studio_cleaned` joining factor and character vector,
## coercing into character vector

The above calculations produce these values to be considered as possible explanatory variables:

cast_exp: Let actor_{i_exp} be the total number of movies that include actor i in their cast. Then cast_exp is the sum of all actor_{i_exp} (i=1,2,3,4,5)
dir_exp: total number of movies with this director
studio_cleaned_n: total number of movies with this studio
studio_category: If studio in >25 movies then that studio is large and gets its own name as a category. If studio in 8-24 movies then “medium”, otherwise “small”.

The one new categorical variable is studio_category which is evaluated using ANOVA below. The p-value is very small, indicating strong potential as an explanatory variable:

summary(aov(movie_pop$log_votes ~ movie_pop$studio_category))

##                            Df Sum Sq Mean Sq F value   Pr(>F)    
## movie_pop$studio_category   7  469.8   67.12   12.97 9.93e-15 ***
## Residuals                 325 1681.1    5.17                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The other new variables are numeric and are evaluated below, along with runtime and thtr_rel_year, which are the only remaining numeric variable not yet evaluated as possible explanatory elements of the model.

ggpairs(movie_pop,columns=c(4,7,42,43,44,36))

Categorical versions of cast experience and runtime

Eyeballing (in the gallery above) the scatterplot of log_votes vs cast_exp, it appears less a steady linear relationship and more a step function from rookie cast to veteran cast. Similarly, the scatterplot of log_votes vs runtime suggests three categories: short, average, and long. New categorical variables are created reflecting these observations. Their explanatory promise is tested with ANOVA with positive results.

movie_pop <- movie_pop %>% 
  mutate(length=ifelse(runtime>=135,"Long",
                       ifelse(runtime>=90,"Avg","Short")),
         cast_experience=ifelse(cast_exp>=8,"Veteran","Rookie"))
                                
summary(aov(movie_pop$log_votes ~ movie_pop$length))

##                   Df Sum Sq Mean Sq F value   Pr(>F)    
## movie_pop$length   2  254.6  127.28   22.15 9.42e-10 ***
## Residuals        330 1896.4    5.75                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(aov(movie_pop$log_votes ~ movie_pop$cast_experience))

##                            Df Sum Sq Mean Sq F value   Pr(>F)    
## movie_pop$cast_experience   1  357.4   357.4   65.96 9.21e-15 ***
## Residuals                 331 1793.5     5.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Historical scores

Historical scores measure the average performance (via log_votes, critics_score, audience_score, or imdb_rating) for an individual or organization in previous movies. This is probably one of the most obvious real-life measurements a producer would consider when staffing/casting: Who is available today with a good track record of previous work in popular movies?

Historical scores are more complicated to compute than the numbers above, because each individual’s historical average changes each time they are in another movie. The calculation below uses a for-loop to calculate average scores for each individual or organization at each moment preceding a specific film they are in.

The calculation estimates historical scores based on what is included in the movies data. This is obviously missing most of real-life history for whatever individual we want to analyze. We are confident proceeding this way because the whatever historical data does turn up will be representative, due to the random sampling used to generate the original data.

Historical scores, when they are available, will prove to be powerful explanatory variables. However, historical scores are unfortunately not available in the majority of cases. As a result there will be two distinct linear models: one model with historical scores for when the associated individuals have a supporting history, and another model without historical scores for when the individuals do not have a supporting history.

Looking to the code that follows, the four functions below use the movies_tall table to calculate historical scores for a specified entity-name in a specific role (e.g., actor, director, studio) for a specific movie in a specific year.

# Function to calc entity's avg votes prior to this film
a_vote_hist <- function(this_entity,this_role,this_film,this_year) {
  return(movies_tall %>% 
           filter(entity==this_entity,
                  role==this_role,
                  thtr_rel_year<=this_year,
                  title!=this_film) %>% 
           summarise(avg_votes=mean(log_votes)) %>%
           as.numeric())
  }  

# function to calc entity's critic score prior to this film
a_critics_hist <- function(this_entity,this_role,this_film,this_year) {
  return(movies_tall %>% 
           filter(entity==this_entity,
                  role==this_role,
                  thtr_rel_year<=this_year,
                  title!=this_film) %>% 
           summarise(avg_critics_score=mean(critics_score)) %>%
           as.numeric())
  }
 
# Function to calc entity's audience score prior to this film
a_aud_hist <- function(this_entity,this_role,this_film,this_year) {
  return(movies_tall %>% 
           filter(entity==this_entity,
                  role==this_role,
                  thtr_rel_year<=this_year,
                  title!=this_film) %>% 
           summarise(avg_aud_score=mean(audience_score)) %>%
           as.numeric())
  }

# Function to calc entity's imdb score prior to this film
a_imdb_hist <- function(this_entity,this_role,this_film,this_year) {
  return(movies_tall %>% 
           filter(entity==this_entity,
                  role==this_role,
                  thtr_rel_year<=this_year,
                  title!=this_film) %>% 
           summarise(avg_imdb_rating
                     =mean(imdb_rating)) %>%
           as.numeric())
}

This for loop below calculates a comprehensive set of historical scores.

# Iterate over all entity x film pairs and calc stats for entity's prior films
# Save our lengthy calculation in CSV
calculate_entity_movie_rolling_stats <- function() {
  movies_tall <- movies_tall %>%
    mutate(prior_films=0,
           critic_hist=NA,
           aud_hist=NA,
           imdb_hist=NA,
           vote_hist=NA)

  for (i in 1:nrow(movies_tall)) {
    row <- movies_tall[i,]
    this_entity <- row$entity
    this_rol <- row$role
    this_film <- row$title
    this_year <- row$thtr_rel_year
    movies_tall[i,]$prior_films <- 
      nrow(movies_tall %>% 
             filter(entity==this_entity,
                    role==this_rol,
                    title!=this_film,
                    thtr_rel_year<=this_year))
    movies_tall[i,]$critic_hist <- a_critics_hist(this_entity,this_rol,this_film,this_year)
    movies_tall[i,]$aud_hist <- a_aud_hist(this_entity,this_rol,this_film,this_year)
    movies_tall[i,]$imdb_hist <- a_imdb_hist(this_entity,this_rol,this_film,this_year)
    movies_tall[i,]$vote_hist <- a_vote_hist(this_entity,this_rol,this_film,this_year)
    }
  write.csv(movies_tall %>% arrange(role,entity,thtr_rel_year),file="entity_movies.csv")
  }

The above calculation takes a long time. The results are stored in a CSV file above, and then below they are read back in. To save knitting time it’s usually fine to comment out the calculation (represented by the one-line function call below) and re-use the existing CSV file.

# Actor movies has history of previous stats
# calculate_entity_movie_rolling_stats()

The code below joins the results of the above calculation (historical scores) with the “wide” movie_pop data frame.

movies_tall <- read.csv("entity_movies.csv") %>%
  select(entity,role,title,prior_films,critic_hist,aud_hist,imdb_hist,vote_hist)

# Join historical info for each cast member actor1, actor2, actor3, actor4, actor5
colnames(movies_tall) <-
  c("actor1","role","title","prior_films1","critic_hist1","aud_hist1","imdb_hist1","vote_hist1")
movie_pop <- 
  left_join(movie_pop,
            movies_tall %>% filter(role=="actor") %>% select(-role),
            by=c("title","actor1"))

## Warning: Column `title` joining character vector and factor, coercing into
## character vector

## Warning: Column `actor1` joining character vector and factor, coercing into
## character vector

colnames(movies_tall) <-
  c("actor2","role","title","prior_films2","critic_hist2","aud_hist2","imdb_hist2","vote_hist2")
movie_pop <- 
  left_join(movie_pop,
            movies_tall %>% filter(role=="actor") %>% select(-role),
            by=c("title","actor2"))

## Warning: Column `title` joining character vector and factor, coercing into
## character vector

## Warning: Column `actor2` joining character vector and factor, coercing into
## character vector

colnames(movies_tall) <-
  c("actor3","role","title","prior_films3","critic_hist3","aud_hist3","imdb_hist3","vote_hist3")
movie_pop <-  
  left_join(movie_pop,
            movies_tall %>% filter(role=="actor") %>% select(-role),
            by=c("title","actor3"))

## Warning: Column `title` joining character vector and factor, coercing into
## character vector

## Warning: Column `actor3` joining character vector and factor, coercing into
## character vector

colnames(movies_tall) <- 
  c("actor4","role","title","prior_films4","critic_hist4","aud_hist4","imdb_hist4","vote_hist4")
movie_pop <-  
  left_join(movie_pop,
            movies_tall %>% filter(role=="actor") %>% select(-role),
            by=c("title","actor4"))

## Warning: Column `title` joining character vector and factor, coercing into
## character vector

## Warning: Column `actor4` joining character vector and factor, coercing into
## character vector

colnames(movies_tall) <-
  c("actor5","role","title","prior_films5","critic_hist5","aud_hist5","imdb_hist5","vote_hist5")
movie_pop <-  
  left_join(movie_pop,
            movies_tall %>% filter(role=="actor") %>% select(-role),
            by=c("title","actor5"))

## Warning: Column `title` joining character vector and factor, coercing into
## character vector

## Warning: Column `actor5` joining character vector and factor, coercing into
## character vector

# Add director and studio also

colnames(movies_tall) <-
  c("director","role","title","prior_films_dir","critic_hist_dir","aud_hist_dir","imdb_hist_dir","vote_hist_dir")
movie_pop <-  
  left_join(movie_pop,
            movies_tall %>% filter(role=="director") %>% select(-role),
            by=c("title","director"))

## Warning: Column `title` joining character vector and factor, coercing into
## character vector

## Warning: Column `director` joining character vector and factor, coercing
## into character vector

colnames(movies_tall) <-
  c("studio_cleaned","role","title","prior_films_studio","critic_hist_studio","aud_hist_studio","imdb_hist_studio","vote_hist_studio")
movie_pop <-  
  left_join(movie_pop,
            movies_tall %>% filter(role=="studio") %>% select(-role),
            by=c("title","studio_cleaned"))

## Warning: Column `title` joining character vector and factor, coercing into
## character vector

## Warning: Column `studio_cleaned` joining character vector and factor,
## coercing into character vector

colnames(movies_tall) <- 
  c("entity","role","title","prior_films","critic_hist","aud_hist","imdb_hist","vote_hist")

Finally for each movie we summarise the historical scores of its entire cast, which is done below:

movie_pop <- movie_pop %>% ungroup() %>% group_by(title) %>%
  mutate(
    cast_prior_films=sum(c(prior_films1,
                           prior_films2,
                           prior_films3,
                           prior_films4,
                           prior_films5),
                         na.rm=TRUE),
    cast_critic_hist=mean(c(critic_hist1,
                            critic_hist2,
                            critic_hist3,
                            critic_hist4,
                            critic_hist5),
                          na.rm=TRUE),
    cast_aud_hist=mean(c(aud_hist1,
                         aud_hist2,
                         aud_hist3,
                         aud_hist4,
                         aud_hist5),
                       na.rm=TRUE),
    cast_imdb_hist=mean(c(imdb_hist1,
                         imdb_hist2,
                         imdb_hist3,
                         imdb_hist4,
                         imdb_hist5),
                       na.rm=TRUE),
    cast_vote_hist=ifelse(cast_prior_films>0,
                          max(c(vote_hist1,
                                vote_hist2,
                                vote_hist3,
                                vote_hist4,
                                vote_hist5),
                              na.rm=TRUE),
                          NA))

Evaluation of historical scores as explanatory variables

For most movies, their actors and director have no known work history in the data. An alternative linear model will be developed for those cases. But first, the movies that do have director histories are put in movie_pop_exp for developing our primary linear model. Movies without a director history go into movie_pop_inexp for the secondary linear model.

movie_pop_exp <- movie_pop %>% filter(prior_films_dir>0)
movie_pop_inexp <- anti_join(movie_pop, movie_pop_exp, by="title")

Of all the historical scores calculated above, the director’s vote history has by far the best correlation with the movie’s log_votes. See the gallery of charts below. The correlation factor is 0.55. No other historical score has a correlation above 0.40. The charts for the other historical scores (related to actors and studios) are omitted for brevity. We will include this one historical score, vote_hist_dir, and ignore the others in order to minimize the dependence of the linear model on the thin historical record.

ggpairs(movie_pop_exp,columns=c(73,74,75,76,77,36))

Part 4: Modeling

Primary model (with historical scores)

Based on exploratory data analysis, there are 11 candidate explanatory variables for our primary linear model. The full linear model below includes all 11.

model_full <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win +
                   best_dir_win,
                 data=movie_pop_exp)
summary(model_full)

## 
## Call:
## lm(formula = log_votes ~ vote_hist_dir + cast_experience + studio_category + 
##     imdb_rating + genre + length + mpaa_rating + thtr_rel_month + 
##     best_actor_win + best_actress_win + best_dir_win, data = movie_pop_exp)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.323 -0.794  0.000  0.624  2.582 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                     8.21979    2.40765   3.414  0.00163 **
## vote_hist_dir                   0.32779    0.10873   3.015  0.00476 **
## cast_experienceVeteran         -0.55262    0.87861  -0.629  0.53346   
## studio_categoryMedium          -1.16650    1.13828  -1.025  0.31249   
## studio_categoryMGM             -1.18922    1.37151  -0.867  0.39180   
## studio_categoryParamount       -0.56191    1.15126  -0.488  0.62853   
## studio_categorySmall            0.08453    1.13831   0.074  0.94123   
## studio_categorySony            -0.90019    1.15652  -0.778  0.44159   
## studio_categoryUniversal       -0.18671    1.22599  -0.152  0.87983   
## studio_categoryWarner Brothers  0.85092    1.11642   0.762  0.45106   
## imdb_rating                     0.54100    0.30407   1.779  0.08390 . 
## genreAnimation                  0.16002    2.16752   0.074  0.94157   
## genreArt House & International  2.29722    2.91656   0.788  0.43621   
## genreComedy                     0.53768    1.23859   0.434  0.66688   
## genreDocumentary               -2.21255    1.91344  -1.156  0.25538   
## genreDrama                     -0.95356    1.10086  -0.866  0.39228   
## genreHorror                    -0.68077    1.65729  -0.411  0.68374   
## genreMystery & Suspense        -1.54460    1.11483  -1.386  0.17467   
## genreOther                      1.27069    2.49371   0.510  0.61356   
## genreScience Fiction & Fantasy -1.54064    2.42830  -0.634  0.52991   
## lengthLong                      0.11027    0.64420   0.171  0.86508   
## lengthShort                    -1.34112    0.92538  -1.449  0.15616   
## mpaa_ratingPG-13                1.47745    1.03111   1.433  0.16077   
## mpaa_ratingR                    1.08984    1.06434   1.024  0.31288   
## mpaa_ratingUnrated             -3.63740    2.08623  -1.744  0.09002 . 
## thtr_rel_month                 -0.03714    0.07331  -0.507  0.61563   
## best_actor_winyes               0.44773    0.62207   0.720  0.47647   
## best_actress_winyes            -0.40237    0.64126  -0.627  0.53442   
## best_dir_winyes                 0.31043    0.64504   0.481  0.63333   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.51 on 35 degrees of freedom
## Multiple R-squared:  0.7301, Adjusted R-squared:  0.5142 
## F-statistic: 3.382 on 28 and 35 DF,  p-value: 0.0003906

The R² of 0.73 is promising, but is this a valid linear model? The analysis below confirms that it is.

Checking linear model assumptions

Linear relationship: the residual scatterplots for both numerical variables show appropriate random scatter around 0.

plot(model_full$residuals ~ movie_pop_exp$vote_hist_dir)

plot(model_full$residuals ~ movie_pop_exp$imdb_rating)

Normal distribution: The histogram and Q-Q plot show an essentially normal distribution of residuals.

hist(model_full$residuals)

qqnorm(model_full$residuals)
qqline(model_full$residuals)

Constant variability of residuals: Plotting residuals vs fitted values does not show a fan shape. We do see small residuals at both left and right extremes. However, there are so few left/right extreme points involved that we are not too concerned.

plot(model_full$residuals ~ model_full$fitted.values)

plot(abs(model_full$residuals) ~ model_full$fitted.values)

Independent Residuals: Plotting residuals by index does not reveal any time pattern.

plot(model_full$residuals)

Model selection

Starting with the validated full linear model, a parsimonious model is developed via backward elimination based on adjusted R². This method is used despite the extra work, because adjusted R² is more reliable than p-values as a means of variable elimination.

A more complete report of each backward step is included in an appendix. Here are the highlights of backward elimination:

Step 1: Eliminating best_dir_win produces the best result, adjusted R²=0.525
Step 2: Eliminating thtr_rel_month produces the best result, adjusted R²=0.535
Step 3: Eliminating best_actress_win produces the best result, adjusted R²=0.541
Step 4: Eliminaring best_actor_win produces the best result, adjusted R²=0.548
Step 5: Eliminating cast_experience produces the best result, adjusted R²=0.553
Step 6: No matter what is eliminated, adjusted R² gets worse. The parsimonious model comes from Step 5.

Parsimonious linear model (with historical scores)

The final model has R²=0.72 and adjusted R²=0.55. It is shown below. The vote history of the director has by far the most influence on the prediction. The only other parameter with a stronger p-value is the intercept, which applies universally to all points and does not play a part in differentiating popular vs unpopular movies.

The conclusion of this report has more discussion that interprets this linear model.

model_pars <- lm(log_votes ~
                   vote_hist_dir + 
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating,
                 data=movie_pop_exp)
summary(model_pars)

## 
## Call:
## lm(formula = log_votes ~ vote_hist_dir + studio_category + imdb_rating + 
##     genre + length + mpaa_rating, data = movie_pop_exp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3892 -0.8246  0.0000  0.7124  2.6204 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      8.0633     2.1499   3.750 0.000559 ***
## vote_hist_dir                    0.3402     0.1023   3.327 0.001892 ** 
## studio_categoryMedium           -1.1198     1.0368  -1.080 0.286622    
## studio_categoryMGM              -1.3082     1.2839  -1.019 0.314338    
## studio_categoryParamount        -0.3386     1.0694  -0.317 0.753200    
## studio_categorySmall             0.2337     1.0765   0.217 0.829264    
## studio_categorySony             -0.9159     1.0709  -0.855 0.397494    
## studio_categoryUniversal        -0.1107     1.1400  -0.097 0.923099    
## studio_categoryWarner Brothers   1.0153     1.0460   0.971 0.337546    
## imdb_rating                      0.4469     0.2708   1.650 0.106780    
## genreAnimation                  -0.8098     1.8884  -0.429 0.670329    
## genreArt House & International   2.8790     2.6268   1.096 0.279636    
## genreComedy                      0.1330     1.0249   0.130 0.897400    
## genreDocumentary                -1.9512     1.6183  -1.206 0.235016    
## genreDrama                      -1.1082     0.9655  -1.148 0.257876    
## genreHorror                     -0.7647     1.4407  -0.531 0.598488    
## genreMystery & Suspense         -1.8899     1.0091  -1.873 0.068392 .  
## genreOther                       1.3774     2.0667   0.666 0.508938    
## genreScience Fiction & Fantasy  -0.8319     2.1733  -0.383 0.703916    
## lengthLong                       0.1105     0.5899   0.187 0.852295    
## lengthShort                     -1.1941     0.8181  -1.460 0.152203    
## mpaa_ratingPG-13                 1.4749     0.9506   1.552 0.128641    
## mpaa_ratingR                     1.1079     0.9993   1.109 0.274210    
## mpaa_ratingUnrated              -3.8535     1.9068  -2.021 0.050020 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.449 on 40 degrees of freedom
## Multiple R-squared:  0.7161, Adjusted R-squared:  0.5529 
## F-statistic: 4.387 on 23 and 40 DF,  p-value: 2.18e-05

Alternate linear model without historical scores

The development of the secondary linear model (without historical scores) is in an appendix. It has R²=0.60 and adjusted R²=0.55.

model_wo_hist <- lm(log_votes ~
                      dir_exp + 
                      cast_experience +
                      studio_category +
                      imdb_rating +
                      genre + 
                      length + 
                      mpaa_rating,
                    data=movie_pop_inexp)
summary(model_wo_hist)

## 
## Call:
## lm(formula = log_votes ~ dir_exp + cast_experience + studio_category + 
##     imdb_rating + genre + length + mpaa_rating, data = movie_pop_inexp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5158 -1.0173  0.0590  0.9685  3.8182 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     10.3493     1.3484   7.675 4.05e-13 ***
## dir_exp                          0.5945     0.3905   1.523 0.129183    
## cast_experienceVeteran           0.2988     0.2456   1.216 0.224991    
## studio_categoryMedium           -0.6614     0.4496  -1.471 0.142539    
## studio_categoryMGM              -0.9008     0.8948  -1.007 0.315070    
## studio_categoryParamount        -0.1374     0.6008  -0.229 0.819350    
## studio_categorySmall            -1.5138     0.4550  -3.327 0.001014 ** 
## studio_categorySony             -0.1704     0.5754  -0.296 0.767432    
## studio_categoryUniversal         0.6434     0.6814   0.944 0.345996    
## studio_categoryWarner Brothers  -0.3156     0.5439  -0.580 0.562232    
## imdb_rating                      0.7352     0.1260   5.834 1.72e-08 ***
## genreAnimation                  -0.3487     1.0283  -0.339 0.734802    
## genreArt House & International  -2.4741     0.7128  -3.471 0.000614 ***
## genreComedy                     -1.3042     0.4389  -2.971 0.003263 ** 
## genreDocumentary                -4.4226     0.5534  -7.991 5.45e-14 ***
## genreDrama                      -1.6727     0.3911  -4.278 2.72e-05 ***
## genreHorror                     -0.0521     0.6846  -0.076 0.939402    
## genreMusical & Performing Arts  -3.0844     0.8208  -3.758 0.000215 ***
## genreMystery & Suspense         -0.9133     0.5099  -1.791 0.074514 .  
## genreOther                      -0.4517     1.0675  -0.423 0.672587    
## genreScience Fiction & Fantasy   0.6188     1.3043   0.474 0.635643    
## lengthLong                       1.2191     0.6236   1.955 0.051725 .  
## lengthShort                     -0.6063     0.2918  -2.077 0.038815 *  
## mpaa_ratingPG                    0.5761     0.9861   0.584 0.559634    
## mpaa_ratingPG-13                 1.3881     1.0163   1.366 0.173256    
## mpaa_ratingR                     0.9015     0.9993   0.902 0.367864    
## mpaa_ratingUnrated              -0.4289     0.9913  -0.433 0.665664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.691 on 242 degrees of freedom
## Multiple R-squared:  0.5971, Adjusted R-squared:  0.5538 
## F-statistic: 13.79 on 26 and 242 DF,  p-value: < 2.2e-16

Part 5: Prediction

The linear models are tested as predictors with a few movies released in 2016 (and therefore not in the dataset). Two movies are included whose directors have histories in the existing movies data. Joel Coen has an average vote history of 18.5 and Clint Eastwood has an average vote history of 15.1:

m2016 <- read.csv("2016 movie data frame.csv")
(m2016 %>% select(title,imdb_num_votes,log_votes))

##           title imdb_num_votes log_votes
## 1 Hail, Caesar!          94326  16.52537
## 2         Sully         157111  17.26142

The predict function shows fitted values for these two movies. The model overestimates the log_votes of Hail, Caesar! by about 2, which means it overestimates actual un-logged votes by a factor of 4 (2²=4). The model does better with Clint Eastwood’s Sully, coming within 0.66 of the correct log_votes, or for actual votes within a factor of 1.6 (2^0.66=1.58).

In both cases the 95% prediction intervals comfortably include the actual log_votes.

predict(model_pars,m2016,interval = "predict", level=0.95)

##        fit      lwr      upr
## 1 18.67311 15.20599 22.14022
## 2 17.92329 14.69984 21.14674

Repeating the above steps for the linear model without historical scores, the predictions are also OK and the 95% prediction intervals comfortably include the actual values.

m2016_inexp <- read.csv("2016 movie dir wo hist.csv")
(m2016_inexp %>% select(title,imdb_num_votes,log_votes))

##                      title imdb_num_votes log_votes
## 1         God's Not Dead 2           8132  12.98939
## 2 The Purge: Election Year          60262  15.87896

predict(model_wo_hist,m2016_inexp,interval = "predict", level=0.95)

##        fit       lwr      upr
## 1 10.90038  7.354509 14.44624
## 2 16.30568 12.698367 19.91300

Part 6: Conclusion

The results of model_pars (the final primary model, which uses historical scores) suggests the following rough guide to making a popular movie:

Start with a director with good track record of making popular movies.
Make sure the movie runs at least 90 minutes and is rated either PG-13 or R.

Once those two items are handled, the other things that really matter are the genre and the studio. Below a very simple model uses just these variables and achieves an R² of 0.62. Admittedly the pool of movies used to demonstrate this model is small (49 movies) but it suggests that director, genre, and studio drive a great deal of what makes movies popular.

std_movies <- movie_pop_exp %>% filter(mpaa_rating=="R" | mpaa_rating=="PG-13",
                                       runtime>=90)
model_simple <- lm(log_votes ~
                     vote_hist_dir +
                     studio_category +
                     genre,
                   data=std_movies
                     )
summary(model_simple)

## 
## Call:
## lm(formula = log_votes ~ vote_hist_dir + studio_category + genre, 
##     data = std_movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2749 -0.7472  0.0000  0.6881  3.4299 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    13.64743    1.79729   7.593 9.75e-09 ***
## vote_hist_dir                   0.29051    0.09627   3.018  0.00488 ** 
## studio_categoryMedium          -2.42486    1.07642  -2.253  0.03105 *  
## studio_categoryMGM             -0.96154    1.38905  -0.692  0.49364    
## studio_categoryParamount       -1.37392    1.16176  -1.183  0.24541    
## studio_categorySmall           -0.03535    1.12372  -0.031  0.97509    
## studio_categorySony            -2.59932    1.24640  -2.085  0.04484 *  
## studio_categoryUniversal       -0.79320    1.17304  -0.676  0.50363    
## studio_categoryWarner Brothers  0.03355    1.14400   0.029  0.97678    
## genreComedy                     1.00106    1.10993   0.902  0.37364    
## genreDocumentary               -0.36718    1.60013  -0.229  0.81992    
## genreDrama                     -0.74851    0.88306  -0.848  0.40275    
## genreHorror                     0.35366    1.60565   0.220  0.82703    
## genreMystery & Suspense        -2.13799    0.99295  -2.153  0.03871 *  
## genreOther                      3.05407    1.66228   1.837  0.07519 .  
## genreScience Fiction & Fantasy -0.97943    1.97819  -0.495  0.62380    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.368 on 33 degrees of freedom
## Multiple R-squared:  0.6165, Adjusted R-squared:  0.4422 
## F-statistic: 3.537 on 15 and 33 DF,  p-value: 0.001206

Limitations of this report

This report relies on a relatively small number of movies. In particular, the available data support only a very sparse record of historical scores.

Future research

It would be interesting to develop a larger random sample that is explicitly designed to support analysis of director historical performance, since the current study made do with a very thin historical record. At the same time, developing a random sample more focused on contemporary movies would be helpful to firm up forward-looking research. Another future tack could be to acquire a larger but more tightly constrained random sample of movies: Movie shorts, unrated films, and perhaps some other categories of offbeat movies could be excluded from the sample, so that all the movies studied demonstrate a reasonable attempt at achieving popularity via votes and/or box office. Finally, including box office in the data would be a big step forward.

Appendix: Linear model summaries

Primary model (with historical scores): backward elimination

Step 1

Eliminating best_dir_win produces the best result, adjusted R2=0.525

model_1 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("best_dir_win")

## [1] "best_dir_win"

summary(model_1)$adj.r.squared

## [1] 0.5246087

model_1 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_dir_win,
                 data=movie_pop_exp)
("best_actress_win")

## [1] "best_actress_win"

summary(model_1)$adj.r.squared

## [1] 0.5224213

model_1 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actress_win +
                   best_dir_win,
                 data=movie_pop_exp)
("best_actor_win")

## [1] "best_actor_win"

summary(model_1)$adj.r.squared

## [1] 0.520744

model_1 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win +
                   best_actress_win +
                   best_dir_win,
                 data=movie_pop_exp)
("thtr_rel_month")

## [1] "thtr_rel_month"

summary(model_1)$adj.r.squared

## [1] 0.5242713

model_1 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win +
                   best_dir_win,
                 data=movie_pop_exp)
("mpaa rating")

## [1] "mpaa rating"

summary(model_1)$adj.r.squared

## [1] 0.4588938

model_1 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win +
                   best_dir_win,
                 data=movie_pop_exp)
("length")

## [1] "length"

summary(model_1)$adj.r.squared

## [1] 0.5127774

model_1 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win +
                   best_dir_win,
                 data=movie_pop_exp)
("genre")

## [1] "genre"

summary(model_1)$adj.r.squared

## [1] 0.4585008

model_1 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win +
                   best_dir_win,
                 data=movie_pop_exp)
("imdb rating")

## [1] "imdb rating"

summary(model_1)$adj.r.squared

## [1] 0.4850213

model_1 <- lm(log_votes ~
                   vote_hist_dir +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win +
                   best_dir_win,
                 data=movie_pop_exp)
("cast exp")

## [1] "cast exp"

summary(model_1)$adj.r.squared

## [1] 0.522396

model_1 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win +
                   best_dir_win,
                 data=movie_pop_exp)
("studio category")

## [1] "studio category"

summary(model_1)$adj.r.squared

## [1] 0.5096701

model_1 <- lm(log_votes ~
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win +
                   best_dir_win,
                 data=movie_pop_exp)
("vote hist dir")

## [1] "vote hist dir"

summary(model_1)$adj.r.squared

## [1] 0.4050974

Step 2

Eliminating thtr_rel_month produces the best result, adjusted R²=0.535

model_2 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win,
                 data=movie_pop_exp)
("best_actress_win")

## [1] "best_actress_win"

summary(model_2)$adj.r.squared

## [1] 0.5328133

model_2 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actress_win,
                 data=movie_pop_exp)
("best_actor_win")

## [1] "best_actor_win"

summary(model_2)$adj.r.squared

## [1] 0.5312404

model_2 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("thtr_rel_month +")

## [1] "thtr_rel_month +"

summary(model_2)$adj.r.squared

## [1] 0.5345116

model_2 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("mpaa_rating +")

## [1] "mpaa_rating +"

summary(model_2)$adj.r.squared

## [1] 0.4588629

model_2 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("length + ")

## [1] "length + "

summary(model_2)$adj.r.squared

## [1] 0.5209983

model_2 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("genre +")

## [1] "genre +"

summary(model_2)$adj.r.squared

## [1] 0.467094

model_2 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("imdb_rating +")

## [1] "imdb_rating +"

summary(model_2)$adj.r.squared

## [1] 0.4970808

model_2 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("studio_category +")

## [1] "studio_category +"

summary(model_2)$adj.r.squared

## [1] 0.5068384

model_2 <- lm(log_votes ~
                   vote_hist_dir + 
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("cast_experience +")

## [1] "cast_experience +"

summary(model_2)$adj.r.squared

## [1] 0.5307835

model_2 <- lm(log_votes ~
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("vote_hist_dir +")

## [1] "vote_hist_dir +"

summary(model_2)$adj.r.squared

## [1] 0.4116467

Step 3

Eliminating best_actress_win produces the best result, adjusted R²=0.541

model_3 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win,
                 data=movie_pop_exp)
("best_actress_win,")

## [1] "best_actress_win,"

summary(model_3)$adj.r.squared

## [1] 0.5411207

model_3 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actress_win,
                 data=movie_pop_exp)
("best_actor_win +")

## [1] "best_actor_win +"

summary(model_3)$adj.r.squared

## [1] 0.5389272

model_3 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("mpaa_rating +")

## [1] "mpaa_rating +"

summary(model_3)$adj.r.squared

## [1] 0.4715956

model_3 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   mpaa_rating +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("length +")

## [1] "length +"

summary(model_3)$adj.r.squared

## [1] 0.5326567

model_3 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   length + 
                   mpaa_rating +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("genre +")

## [1] "genre +"

summary(model_3)$adj.r.squared

## [1] 0.4757784

model_3 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("imdb_rating +")

## [1] "imdb_rating +"

summary(model_3)$adj.r.squared

## [1] 0.5020417

model_3 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("studio_category +")

## [1] "studio_category +"

summary(model_3)$adj.r.squared

## [1] 0.5180172

model_3 <- lm(log_votes ~
                   vote_hist_dir + 
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("cast_experience +")

## [1] "cast_experience +"

summary(model_3)$adj.r.squared

## [1] 0.5371279

model_3 <- lm(log_votes ~
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_exp)
("vote_hist_dir +")

## [1] "vote_hist_dir +"

summary(model_3)$adj.r.squared

## [1] 0.4156992

Step 4

Eliminaring best_actor_win produces the best result, adjusted R²=0.548

model_4 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating,
                 data=movie_pop_exp)
("best_actor_win,")

## [1] "best_actor_win,"

summary(model_4)$adj.r.squared

## [1] 0.5481989

model_4 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   best_actor_win,
                 data=movie_pop_exp)
("mpaa_rating +")

## [1] "mpaa_rating +"

summary(model_4)$adj.r.squared

## [1] 0.474786

model_4 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   mpaa_rating +
                   best_actor_win,
                 data=movie_pop_exp)
("length +")

## [1] "length +"

summary(model_4)$adj.r.squared

## [1] 0.5409035

model_4 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   length + 
                   mpaa_rating +
                   best_actor_win,
                 data=movie_pop_exp)
("genre +")

## [1] "genre +"

summary(model_4)$adj.r.squared

## [1] 0.4831262

model_4 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win,
                 data=movie_pop_exp)
("imdb_rating +")

## [1] "imdb_rating +"

summary(model_4)$adj.r.squared

## [1] 0.5128765

model_4 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win,
                 data=movie_pop_exp)
("studio_category +")

## [1] "studio_category +"

summary(model_4)$adj.r.squared

## [1] 0.5224998

model_4 <- lm(log_votes ~
                   vote_hist_dir + 
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win,
                 data=movie_pop_exp)
("cast_experience +")

## [1] "cast_experience +"

summary(model_4)$adj.r.squared

## [1] 0.5454214

model_4 <- lm(log_votes ~
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win,
                 data=movie_pop_exp)
("vote_hist_dir +")

## [1] "vote_hist_dir +"

summary(model_4)$adj.r.squared

## [1] 0.4272995

Step 5

Eliminating cast_experience produces the best result, adjusted R²=0.553

model_5 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length,
                 data=movie_pop_exp)
("mpaa_rating,")

## [1] "mpaa_rating,"

summary(model_5)$adj.r.squared

## [1] 0.4727739

model_5 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   mpaa_rating,
                 data=movie_pop_exp)
("length +")

## [1] "length +"

summary(model_5)$adj.r.squared

## [1] 0.5407633

model_5 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   length + 
                   mpaa_rating,
                 data=movie_pop_exp)
("genre +")

## [1] "genre +"

summary(model_5)$adj.r.squared

## [1] 0.4787957

model_5 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   studio_category +
                   genre + 
                   length + 
                   mpaa_rating,
                 data=movie_pop_exp)
("imdb_rating +")

## [1] "imdb_rating +"

summary(model_5)$adj.r.squared

## [1] 0.5236966

model_5 <- lm(log_votes ~
                   vote_hist_dir + 
                   cast_experience +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating,
                 data=movie_pop_exp)
("studio_category +")

## [1] "studio_category +"

summary(model_5)$adj.r.squared

## [1] 0.5098173

model_5 <- lm(log_votes ~
                   vote_hist_dir + 
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating,
                 data=movie_pop_exp)
("cast_experience +")

## [1] "cast_experience +"

summary(model_5)$adj.r.squared

## [1] 0.552903

model_5 <- lm(log_votes ~
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating,
                 data=movie_pop_exp)
("vote_hist_dir + ")

## [1] "vote_hist_dir + "

summary(model_5)$adj.r.squared

## [1] 0.4372893

Step 6

No matter what variable is eliminated, adjusted R² gets worse. The final parsimonious model is the one produced by Step 5.

model_6 <- lm(log_votes ~
                   vote_hist_dir + 
                   studio_category +
                   imdb_rating +
                   genre + 
                   length,
                 data=movie_pop_exp)
("mpaa_rating,")

## [1] "mpaa_rating,"

summary(model_6)$adj.r.squared

## [1] 0.4845038

model_6 <- lm(log_votes ~
                   vote_hist_dir + 
                   studio_category +
                   imdb_rating +
                   genre + 
                   mpaa_rating,
                 data=movie_pop_exp)
("length +")

## [1] "length +"

summary(model_6)$adj.r.squared

## [1] 0.5507098

model_6 <- lm(log_votes ~
                   vote_hist_dir + 
                   studio_category +
                   imdb_rating +
                   length + 
                   mpaa_rating,
                 data=movie_pop_exp)
("genre + ")

## [1] "genre + "

summary(model_6)$adj.r.squared

## [1] 0.4894325

model_6 <- lm(log_votes ~
                   vote_hist_dir + 
                   studio_category +
                   genre + 
                   length + 
                   mpaa_rating,
                 data=movie_pop_exp)
("imdb_rating +")

## [1] "imdb_rating +"

summary(model_6)$adj.r.squared

## [1] 0.5341204

model_6 <- lm(log_votes ~
                   vote_hist_dir + 
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating,
                 data=movie_pop_exp)
("studio_category +")

## [1] "studio_category +"

summary(model_6)$adj.r.squared

## [1] 0.5123088

model_6 <- lm(log_votes ~
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating,
                 data=movie_pop_exp)
("vote_hist_dir + ")

## [1] "vote_hist_dir + "

summary(model_6)$adj.r.squared

## [1] 0.4431179

Secondary model (without historical scores)

The secondary model uses dir_exp (total number of movies with this director) instead of vote_hist_dir. Starting with a full model, we proceed with backward elimination relying on p-values, mainly for variety in methodology and ease of implementation.

Full model

model_full <- lm(log_votes ~
                   dir_exp + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win +
                   best_dir_win,
                 data=movie_pop_inexp)
summary(model_full)

## 
## Call:
## lm(formula = log_votes ~ dir_exp + cast_experience + studio_category + 
##     imdb_rating + genre + length + mpaa_rating + thtr_rel_month + 
##     best_actor_win + best_actress_win + best_dir_win, data = movie_pop_inexp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4629 -1.0745  0.0485  0.9389  3.9987 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    10.28406    1.36195   7.551 9.20e-13 ***
## dir_exp                         0.59691    0.39395   1.515 0.131046    
## cast_experienceVeteran          0.20830    0.26079   0.799 0.425230    
## studio_categoryMedium          -0.64807    0.45338  -1.429 0.154196    
## studio_categoryMGM             -0.91844    0.90136  -1.019 0.309264    
## studio_categoryParamount       -0.08764    0.60649  -0.145 0.885224    
## studio_categorySmall           -1.51814    0.45852  -3.311 0.001074 ** 
## studio_categorySony            -0.20925    0.58236  -0.359 0.719678    
## studio_categoryUniversal        0.58232    0.69040   0.843 0.399826    
## studio_categoryWarner Brothers -0.31894    0.54868  -0.581 0.561599    
## imdb_rating                     0.73004    0.12710   5.744 2.82e-08 ***
## genreAnimation                 -0.42979    1.03651  -0.415 0.678767    
## genreArt House & International -2.50054    0.72345  -3.456 0.000648 ***
## genreComedy                    -1.34288    0.44407  -3.024 0.002768 ** 
## genreDocumentary               -4.45626    0.55867  -7.977 6.32e-14 ***
## genreDrama                     -1.71599    0.39794  -4.312 2.37e-05 ***
## genreHorror                    -0.06973    0.68985  -0.101 0.919570    
## genreMusical & Performing Arts -3.06940    0.82511  -3.720 0.000249 ***
## genreMystery & Suspense        -0.95553    0.52204  -1.830 0.068445 .  
## genreOther                     -0.44205    1.07354  -0.412 0.680883    
## genreScience Fiction & Fantasy  0.52895    1.32251   0.400 0.689545    
## lengthLong                      1.18196    0.63033   1.875 0.061998 .  
## lengthShort                    -0.58759    0.29398  -1.999 0.046773 *  
## mpaa_ratingPG                   0.55333    0.99419   0.557 0.578351    
## mpaa_ratingPG-13                1.39112    1.02180   1.361 0.174661    
## mpaa_ratingR                    0.88961    1.00549   0.885 0.377184    
## mpaa_ratingUnrated             -0.43318    0.99682  -0.435 0.664274    
## thtr_rel_month                  0.01995    0.03275   0.609 0.543028    
## best_actor_winyes               0.33917    0.35559   0.954 0.341138    
## best_actress_winyes             0.13245    0.40596   0.326 0.744504    
## best_dir_winyes                 0.29049    0.92506   0.314 0.753781    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.699 on 238 degrees of freedom
## Multiple R-squared:    0.6,  Adjusted R-squared:  0.5496 
## F-statistic:  11.9 on 30 and 238 DF,  p-value: < 2.2e-16

Step one

Remove best dir win

model_1 <- lm(log_votes ~
                   dir_exp + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win +
                   best_actress_win,
                 data=movie_pop_inexp)
summary(model_1)

## 
## Call:
## lm(formula = log_votes ~ dir_exp + cast_experience + studio_category + 
##     imdb_rating + genre + length + mpaa_rating + thtr_rel_month + 
##     best_actor_win + best_actress_win, data = movie_pop_inexp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4480 -1.0747  0.0477  0.9346  4.0073 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    10.25456    1.35614   7.562 8.52e-13 ***
## dir_exp                         0.60418    0.39252   1.539 0.125074    
## cast_experienceVeteran          0.21043    0.26021   0.809 0.419499    
## studio_categoryMedium          -0.64013    0.45182  -1.417 0.157849    
## studio_categoryMGM             -0.91080    0.89933  -1.013 0.312207    
## studio_categoryParamount       -0.08543    0.60530  -0.141 0.887886    
## studio_categorySmall           -1.52040    0.45760  -3.323 0.001032 ** 
## studio_categorySony            -0.19462    0.57940  -0.336 0.737237    
## studio_categoryUniversal        0.60215    0.68621   0.878 0.381096    
## studio_categoryWarner Brothers -0.32225    0.54755  -0.589 0.556734    
## imdb_rating                     0.73214    0.12669   5.779 2.33e-08 ***
## genreAnimation                 -0.43158    1.03454  -0.417 0.676930    
## genreArt House & International -2.50303    0.72204  -3.467 0.000625 ***
## genreComedy                    -1.34580    0.44314  -3.037 0.002654 ** 
## genreDocumentary               -4.45715    0.55761  -7.993 5.60e-14 ***
## genreDrama                     -1.71718    0.39717  -4.324 2.26e-05 ***
## genreHorror                    -0.07265    0.68849  -0.106 0.916046    
## genreMusical & Performing Arts -3.06954    0.82355  -3.727 0.000242 ***
## genreMystery & Suspense        -0.93475    0.51685  -1.809 0.071779 .  
## genreOther                     -0.44319    1.07151  -0.414 0.679530    
## genreScience Fiction & Fantasy  0.53702    1.31976   0.407 0.684440    
## lengthLong                      1.17074    0.62813   1.864 0.063569 .  
## lengthShort                    -0.58565    0.29336  -1.996 0.047027 *  
## mpaa_ratingPG                   0.54917    0.99223   0.553 0.580457    
## mpaa_ratingPG-13                1.38455    1.01966   1.358 0.175791    
## mpaa_ratingR                    0.89089    1.00358   0.888 0.375588    
## mpaa_ratingUnrated             -0.43145    0.99492  -0.434 0.664934    
## thtr_rel_month                  0.02081    0.03257   0.639 0.523444    
## best_actor_winyes               0.35449    0.35157   1.008 0.314328    
## best_actress_winyes             0.15207    0.40037   0.380 0.704410    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.696 on 239 degrees of freedom
## Multiple R-squared:  0.5998, Adjusted R-squared:  0.5513 
## F-statistic: 12.35 on 29 and 239 DF,  p-value: < 2.2e-16

Step two

Remove best actress win

model_2 <- lm(log_votes ~
                   dir_exp + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   thtr_rel_month +
                   best_actor_win,
                 data=movie_pop_inexp)
summary(model_2)

## 
## Call:
## lm(formula = log_votes ~ dir_exp + cast_experience + studio_category + 
##     imdb_rating + genre + length + mpaa_rating + thtr_rel_month + 
##     best_actor_win, data = movie_pop_inexp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4536 -1.0221  0.0370  0.9254  3.9767 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    10.26564    1.35341   7.585 7.28e-13 ***
## dir_exp                         0.59433    0.39097   1.520 0.129790    
## cast_experienceVeteran          0.23437    0.25200   0.930 0.353280    
## studio_categoryMedium          -0.64254    0.45097  -1.425 0.155516    
## studio_categoryMGM             -0.91714    0.89757  -1.022 0.307903    
## studio_categoryParamount       -0.08770    0.60419  -0.145 0.884716    
## studio_categorySmall           -1.53096    0.45593  -3.358 0.000913 ***
## studio_categorySony            -0.20906    0.57712  -0.362 0.717491    
## studio_categoryUniversal        0.58712    0.68385   0.859 0.391438    
## studio_categoryWarner Brothers -0.31906    0.54651  -0.584 0.559889    
## imdb_rating                     0.73126    0.12644   5.783 2.27e-08 ***
## genreAnimation                 -0.43254    1.03269  -0.419 0.675704    
## genreArt House & International -2.48030    0.71828  -3.453 0.000655 ***
## genreComedy                    -1.32937    0.44023  -3.020 0.002804 ** 
## genreDocumentary               -4.43803    0.55434  -8.006 5.09e-14 ***
## genreDrama                     -1.69736    0.39303  -4.319 2.30e-05 ***
## genreHorror                    -0.06775    0.68714  -0.099 0.921535    
## genreMusical & Performing Arts -3.06304    0.82191  -3.727 0.000242 ***
## genreMystery & Suspense        -0.90627    0.51047  -1.775 0.077106 .  
## genreOther                     -0.44623    1.06957  -0.417 0.676903    
## genreScience Fiction & Fantasy  0.54008    1.31738   0.410 0.682199    
## lengthLong                      1.18951    0.62507   1.903 0.058235 .  
## lengthShort                    -0.59178    0.29239  -2.024 0.044084 *  
## mpaa_ratingPG                   0.54319    0.99034   0.548 0.583862    
## mpaa_ratingPG-13                1.37682    1.01764   1.353 0.177343    
## mpaa_ratingR                    0.89135    1.00179   0.890 0.374489    
## mpaa_ratingUnrated             -0.43569    0.99308  -0.439 0.661254    
## thtr_rel_month                  0.02095    0.03251   0.644 0.519883    
## best_actor_winyes               0.34697    0.35038   0.990 0.323051    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.693 on 240 degrees of freedom
## Multiple R-squared:  0.5996, Adjusted R-squared:  0.5529 
## F-statistic: 12.83 on 28 and 240 DF,  p-value: < 2.2e-16

Step three

Remove thtr rel month

model_3 <- lm(log_votes ~
                   dir_exp + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win,
                 data=movie_pop_inexp)
summary(model_3)

## 
## Call:
## lm(formula = log_votes ~ dir_exp + cast_experience + studio_category + 
##     imdb_rating + genre + length + mpaa_rating + best_actor_win, 
##     data = movie_pop_inexp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4790 -1.0082  0.0508  0.9802  3.8935 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    10.32740    1.34837   7.659 4.53e-13 ***
## dir_exp                         0.58990    0.39043   1.511 0.132128    
## cast_experienceVeteran          0.24344    0.25130   0.969 0.333666    
## studio_categoryMedium          -0.66083    0.44953  -1.470 0.142855    
## studio_categoryMGM             -0.88215    0.89484  -0.986 0.325215    
## studio_categoryParamount       -0.07974    0.60333  -0.132 0.894961    
## studio_categorySmall           -1.53442    0.45535  -3.370 0.000876 ***
## studio_categorySony            -0.20724    0.57641  -0.360 0.719511    
## studio_categoryUniversal        0.59553    0.68289   0.872 0.384040    
## studio_categoryWarner Brothers -0.34400    0.54447  -0.632 0.528109    
## imdb_rating                     0.73665    0.12601   5.846 1.63e-08 ***
## genreAnimation                 -0.38443    1.02874  -0.374 0.708960    
## genreArt House & International -2.43359    0.71374  -3.410 0.000762 ***
## genreComedy                    -1.31266    0.43893  -2.991 0.003073 ** 
## genreDocumentary               -4.42577    0.55334  -7.998 5.28e-14 ***
## genreDrama                     -1.67484    0.39100  -4.284 2.66e-05 ***
## genreHorror                    -0.03707    0.68465  -0.054 0.956868    
## genreMusical & Performing Arts -3.06688    0.82089  -3.736 0.000233 ***
## genreMystery & Suspense        -0.91063    0.50981  -1.786 0.075322 .  
## genreOther                     -0.42337    1.06768  -0.397 0.692065    
## genreScience Fiction & Fantasy  0.65116    1.30447   0.499 0.618111    
## lengthLong                      1.18644    0.62429   1.900 0.058565 .  
## lengthShort                    -0.59653    0.29194  -2.043 0.042110 *  
## mpaa_ratingPG                   0.59321    0.98609   0.602 0.548021    
## mpaa_ratingPG-13                1.39207    1.01613   1.370 0.171968    
## mpaa_ratingR                    0.92333    0.99935   0.924 0.356447    
## mpaa_ratingUnrated             -0.41388    0.99130  -0.418 0.676677    
## best_actor_winyes               0.36141    0.34924   1.035 0.301774    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.691 on 241 degrees of freedom
## Multiple R-squared:  0.5989, Adjusted R-squared:  0.5539 
## F-statistic: 13.33 on 27 and 241 DF,  p-value: < 2.2e-16

Step four

Remove best actor win

model_3 <- lm(log_votes ~
                   dir_exp + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating,
                 data=movie_pop_inexp)
summary(model_3)

## 
## Call:
## lm(formula = log_votes ~ dir_exp + cast_experience + studio_category + 
##     imdb_rating + genre + length + mpaa_rating, data = movie_pop_inexp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5158 -1.0173  0.0590  0.9685  3.8182 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     10.3493     1.3484   7.675 4.05e-13 ***
## dir_exp                          0.5945     0.3905   1.523 0.129183    
## cast_experienceVeteran           0.2988     0.2456   1.216 0.224991    
## studio_categoryMedium           -0.6614     0.4496  -1.471 0.142539    
## studio_categoryMGM              -0.9008     0.8948  -1.007 0.315070    
## studio_categoryParamount        -0.1374     0.6008  -0.229 0.819350    
## studio_categorySmall            -1.5138     0.4550  -3.327 0.001014 ** 
## studio_categorySony             -0.1704     0.5754  -0.296 0.767432    
## studio_categoryUniversal         0.6434     0.6814   0.944 0.345996    
## studio_categoryWarner Brothers  -0.3156     0.5439  -0.580 0.562232    
## imdb_rating                      0.7352     0.1260   5.834 1.72e-08 ***
## genreAnimation                  -0.3487     1.0283  -0.339 0.734802    
## genreArt House & International  -2.4741     0.7128  -3.471 0.000614 ***
## genreComedy                     -1.3042     0.4389  -2.971 0.003263 ** 
## genreDocumentary                -4.4226     0.5534  -7.991 5.45e-14 ***
## genreDrama                      -1.6727     0.3911  -4.278 2.72e-05 ***
## genreHorror                     -0.0521     0.6846  -0.076 0.939402    
## genreMusical & Performing Arts  -3.0844     0.8208  -3.758 0.000215 ***
## genreMystery & Suspense         -0.9133     0.5099  -1.791 0.074514 .  
## genreOther                      -0.4517     1.0675  -0.423 0.672587    
## genreScience Fiction & Fantasy   0.6188     1.3043   0.474 0.635643    
## lengthLong                       1.2191     0.6236   1.955 0.051725 .  
## lengthShort                     -0.6063     0.2918  -2.077 0.038815 *  
## mpaa_ratingPG                    0.5761     0.9861   0.584 0.559634    
## mpaa_ratingPG-13                 1.3881     1.0163   1.366 0.173256    
## mpaa_ratingR                     0.9015     0.9993   0.902 0.367864    
## mpaa_ratingUnrated              -0.4289     0.9913  -0.433 0.665664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.691 on 242 degrees of freedom
## Multiple R-squared:  0.5971, Adjusted R-squared:  0.5538 
## F-statistic: 13.79 on 26 and 242 DF,  p-value: < 2.2e-16

Step five

The adjusted R² got worse in step 4, so the final model is the result of step 3, repeated below:

model_inexp_pars <- lm(log_votes ~
                   dir_exp + 
                   cast_experience +
                   studio_category +
                   imdb_rating +
                   genre + 
                   length + 
                   mpaa_rating +
                   best_actor_win,
                 data=movie_pop_inexp)
summary(model_inexp_pars)

## 
## Call:
## lm(formula = log_votes ~ dir_exp + cast_experience + studio_category + 
##     imdb_rating + genre + length + mpaa_rating + best_actor_win, 
##     data = movie_pop_inexp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4790 -1.0082  0.0508  0.9802  3.8935 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    10.32740    1.34837   7.659 4.53e-13 ***
## dir_exp                         0.58990    0.39043   1.511 0.132128    
## cast_experienceVeteran          0.24344    0.25130   0.969 0.333666    
## studio_categoryMedium          -0.66083    0.44953  -1.470 0.142855    
## studio_categoryMGM             -0.88215    0.89484  -0.986 0.325215    
## studio_categoryParamount       -0.07974    0.60333  -0.132 0.894961    
## studio_categorySmall           -1.53442    0.45535  -3.370 0.000876 ***
## studio_categorySony            -0.20724    0.57641  -0.360 0.719511    
## studio_categoryUniversal        0.59553    0.68289   0.872 0.384040    
## studio_categoryWarner Brothers -0.34400    0.54447  -0.632 0.528109    
## imdb_rating                     0.73665    0.12601   5.846 1.63e-08 ***
## genreAnimation                 -0.38443    1.02874  -0.374 0.708960    
## genreArt House & International -2.43359    0.71374  -3.410 0.000762 ***
## genreComedy                    -1.31266    0.43893  -2.991 0.003073 ** 
## genreDocumentary               -4.42577    0.55334  -7.998 5.28e-14 ***
## genreDrama                     -1.67484    0.39100  -4.284 2.66e-05 ***
## genreHorror                    -0.03707    0.68465  -0.054 0.956868    
## genreMusical & Performing Arts -3.06688    0.82089  -3.736 0.000233 ***
## genreMystery & Suspense        -0.91063    0.50981  -1.786 0.075322 .  
## genreOther                     -0.42337    1.06768  -0.397 0.692065    
## genreScience Fiction & Fantasy  0.65116    1.30447   0.499 0.618111    
## lengthLong                      1.18644    0.62429   1.900 0.058565 .  
## lengthShort                    -0.59653    0.29194  -2.043 0.042110 *  
## mpaa_ratingPG                   0.59321    0.98609   0.602 0.548021    
## mpaa_ratingPG-13                1.39207    1.01613   1.370 0.171968    
## mpaa_ratingR                    0.92333    0.99935   0.924 0.356447    
## mpaa_ratingUnrated             -0.41388    0.99130  -0.418 0.676677    
## best_actor_winyes               0.36141    0.34924   1.035 0.301774    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.691 on 241 degrees of freedom
## Multiple R-squared:  0.5989, Adjusted R-squared:  0.5539 
## F-statistic: 13.33 on 27 and 241 DF,  p-value: < 2.2e-16

Modeling and predicting what makes movies popular