Proxy for box office = IMDb Votes
The number of people who voted on a movie, just like box office, is extremely right skewed. This makes sense because neither votes nor box office can be less than zero, and both can grow exponentially for blockbuster movies. Once again a log scale provides a useful distribution:
movie_pop %>% ggplot(aes(x=imdb_num_votes)) + geom_histogram(bins=20) + ggtitle("IMDb votes distribution")

# Log(IMDb_num_votes) is roughly normal
movie_pop %>% ggplot(aes(x=log2(imdb_num_votes))) + geom_histogram(bins=20) + ggtitle("Logged IMDb votes distribution")

# Track log of IMDb votes with new variable
movie_pop <- movie_pop %>% mutate(log_votes=log2(imdb_num_votes))
movie_40years <- movie_40years %>% mutate(log_votes=log2(imdb_num_votes))
In the table and chart below it is seen that the log-log correlation of votes and box office is very good (R=0.92) and will provide the sought-after proxy.
Note that without a log-log analysis, there is still a correlation between votes and box office, but it is subject to one extremely influential point which makes the linear model suspect. In the next subsection, other movie scores in the data are evaluated as possible proxies for box office and found to be quite inferior. This report uses log_votes as a proxy for log(box_office).
# Log-log correlation is excellent
cor(movie_pop$log_votes, log2(movie_pop$box_office), use = "complete.obs")
## [1] 0.9229588
movie_pop %>% ggplot(aes(x=log_votes,y=log2(box_office))) + geom_point(size=2) + geom_smooth(alpha=0.3,method="lm") + ggtitle("Log(box office) vs Log(votes)")
## Warning: Removed 323 rows containing non-finite values (stat_smooth).
## Warning: Removed 323 rows containing missing values (geom_point).

movie_pop %>% ggplot(aes(x=imdb_num_votes,y=box_office)) + geom_point(size=2) + geom_smooth(alpha=0.3,method="lm") + ggtitle("Box office vs Votes\nAffected by extremely influential point") + ylab("Box office (000s)")
## Warning: Removed 323 rows containing non-finite values (stat_smooth).
## Warning: Removed 323 rows containing missing values (geom_point).

Possible explanatory variables for the linear model
With log_votes established as the response variable, the other variables in the data are evaluated as possible explanatory variables to include in the linear model.
Movie reviews and scores
The chart below shows that Imdb_rating, critics_score, and audience_score are all highly correlated with each other, but not so much with log_votes. We will consider only imdb_rating for possible inclusion in the linear model, excluding the others because of colinearity.
ggpairs(movie_pop,columns=c(13,16,18,36))

Oscar connections and top-200 box office
The data include best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, and top200_box. These are categorical variables, and ANOVA is used to evaluate their promise as explanatory variables for the linear model. The p-value is especially small for a best picture nomination, indicating excellent potential as an explanatory variable. Potential is also good for including a director or actor with oscar. A best actress oscar seems to have less effect than a best actor oscar.
We will omit best_pic_win from consideration because it is colinear with best_pic_nom. We will omit top200_box because it would make for a self-referential and unhelpful approach to explaining box office. The others are possible candidates for the linear model.
summary(aov(movie_pop$log_votes ~ movie_pop$best_pic_nom))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$best_pic_nom 1 102.8 102.83 16.62 5.73e-05 ***
## Residuals 331 2048.1 6.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(movie_pop$log_votes ~ movie_pop$best_pic_win))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$best_pic_win 1 37.8 37.80 5.921 0.0155 *
## Residuals 331 2113.1 6.38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(movie_pop$log_votes ~ movie_pop$best_actor_win))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$best_actor_win 1 83.1 83.09 13.3 0.000308 ***
## Residuals 331 2067.8 6.25
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(movie_pop$log_votes ~ movie_pop$best_actress_win))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$best_actress_win 1 40.9 40.95 6.423 0.0117 *
## Residuals 331 2110.0 6.37
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(movie_pop$log_votes ~ movie_pop$best_dir_win))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$best_dir_win 1 73.1 73.12 11.65 0.000722 ***
## Residuals 331 2077.8 6.28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(movie_pop$log_votes ~ movie_pop$top200_box))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$top200_box 1 77.9 77.91 12.44 0.00048 ***
## Residuals 331 2073.0 6.26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Genre, rating, and month of release
The data include genre, mpaa_rating, and thtr_rel_month (converted from number to factor). Again, these are categorical variables, and ANOVA is used to evaluate their promise as explanatory variables. We include all as candidates for the linear model, even though the month receives a weak (large) p-value.
summary(aov(movie_pop$log_votes ~ movie_pop$genre))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$genre 10 750.6 75.06 17.26 <2e-16 ***
## Residuals 322 1400.3 4.35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(movie_pop$log_votes ~ movie_pop$mpaa_rating))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$mpaa_rating 4 598.5 149.61 31.61 <2e-16 ***
## Residuals 328 1552.5 4.73
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(movie_pop$log_votes ~ movie_pop$thtr_rel_month))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$thtr_rel_month 1 16.7 16.747 2.597 0.108
## Residuals 331 2134.2 6.448
# ggpairs(movie_pop,columns=c(3,4,5,36))
Studio activity
The data include the studio for each movie, tidied up in the variable studio_cleaned. Below is a summary of the log_votes for recent films from the most active studios.
movie_pop %>% group_by(studio_cleaned) %>% filter(n()>7) %>% ggplot(aes(x=studio_cleaned,y=log_votes,group=studio_cleaned)) + geom_boxplot(varwidth = TRUE) + coord_flip() + ggtitle("Log_votes for active studios")

Notice above how the varwidth option adjusts the width of each boxplot according to the number of movies it represents. It appears that the more active studios typically produce movies with more log_votes. A new variable studio_category is therefore created to capture studio activity, with three ranges, small, medium, and large, defined based on the full 40-year dataset. In the calculation below, each “large” studio gets a studio_category to itself, while other studios get studio_category of “medium” or “small.”"
Actor and director activity
Along with studio activity, the historical activity of directors and actors will also be calculated and considered as explanatory for movie popularity. The calculation begins with a transformed “tall” version of the data, where each row records the participation of a specific individual or organization in a specific film. The variable entity names the individual/organization, and the variable role names their mode of participation as “actor”, “director”, or “studio”.
# Tall list of actors x movies
movies_tall <- movie_40years %>%
gather(actor1,actor2,actor3,actor4,actor5,key=role,value=actor,na.rm=TRUE) %>%
select(title, thtr_rel_year, log_votes,
critics_score, audience_score, imdb_rating,
actor) %>%
mutate(role="actor")
names(movies_tall)[names(movies_tall)=="actor"] <- "entity"
# Add directors
dir_movies <- movie_40years %>%
select(title, thtr_rel_year, log_votes,
critics_score, audience_score, imdb_rating,
director) %>%
filter(!is.na(director)) %>%
mutate(role="director")
names(dir_movies)[names(dir_movies)=="director"] <- "entity"
movies_tall <- rbind(movies_tall,dir_movies)
# Add studios
studio_movies <- movie_40years %>%
filter(!is.na(studio_cleaned)) %>%
mutate(role="studio") %>%
select(title, thtr_rel_year, log_votes,
critics_score, audience_score, imdb_rating,
studio_cleaned, role)
names(studio_movies)[names(studio_movies)=="studio_cleaned"] <- "entity"
movies_tall <- rbind(movies_tall,studio_movies)
With the tall form of data it is then easy to count how many times a person has acted or directed, or how many times a studio has released a movie. These statistics are then joined to the original “wide” data frame movie_pop:
# Count # films for each actor
actor_filmcount <- movies_tall %>%
filter(role=="actor") %>%
select(entity) %>%
group_by(entity) %>%
summarise(n=n())
# Join # films (in our dataset) for each cast member
colnames(actor_filmcount) <- c("actor1","actor1_exp")
movie_pop <- left_join(movie_pop, actor_filmcount, by="actor1")
colnames(actor_filmcount) <- c("actor2","actor2_exp")
movie_pop <- left_join(movie_pop, actor_filmcount, by="actor2")
colnames(actor_filmcount) <- c("actor3","actor3_exp")
movie_pop <- left_join(movie_pop, actor_filmcount, by="actor3")
colnames(actor_filmcount) <- c("actor4","actor4_exp")
movie_pop <- left_join(movie_pop, actor_filmcount, by="actor4")
colnames(actor_filmcount) <- c("actor5","actor5_exp")
movie_pop <- left_join(movie_pop, actor_filmcount, by="actor5")
# Add total cast experience
movie_pop <- movie_pop %>% group_by(title) %>%
mutate(cast_exp=sum(c(actor1_exp,
actor2_exp,
actor3_exp,
actor4_exp,
actor5_exp),
na.rm=TRUE))
# Count director experience
director_films <- movies_tall %>%
filter(role=="director") %>%
select(entity) %>%
group_by(entity) %>%
summarise(n=n())
colnames(director_films) <- c("director","dir_exp")
movie_pop <- left_join(movie_pop, director_films, by="director")
# Count studio experience
studio_films <- movies_tall %>%
filter(role=="studio") %>%
select(entity) %>%
group_by(entity) %>%
summarise(n=n())
At the same time, studio_category is calculated and joined to movie_pop:
colnames(studio_films) <- c("studio_cleaned","studio_cleaned_n")
movie_pop <-
left_join(movie_pop,studio_films,by="studio_cleaned") %>%
mutate(studio_category=ifelse(studio_cleaned_n>25,
studio_cleaned,
ifelse(studio_cleaned_n>7,
"Medium",
"Small")))
## Warning: Column `studio_cleaned` joining factor and character vector,
## coercing into character vector
The above calculations produce these values to be considered as possible explanatory variables:
-
cast_exp: Let actori_exp be the total number of movies that include actor i in their cast. Then cast_exp is the sum of all actori_exp (i=1,2,3,4,5)
-
dir_exp: total number of movies with this director
-
studio_cleaned_n: total number of movies with this studio
-
studio_category: If studio in >25 movies then that studio is large and gets its own name as a category. If studio in 8-24 movies then “medium”, otherwise “small”.
The one new categorical variable is studio_category which is evaluated using ANOVA below. The p-value is very small, indicating strong potential as an explanatory variable:
summary(aov(movie_pop$log_votes ~ movie_pop$studio_category))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$studio_category 7 469.8 67.12 12.97 9.93e-15 ***
## Residuals 325 1681.1 5.17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The other new variables are numeric and are evaluated below, along with runtime and thtr_rel_year, which are the only remaining numeric variable not yet evaluated as possible explanatory elements of the model.
ggpairs(movie_pop,columns=c(4,7,42,43,44,36))

Categorical versions of cast experience and runtime
Eyeballing (in the gallery above) the scatterplot of log_votes vs cast_exp, it appears less a steady linear relationship and more a step function from rookie cast to veteran cast. Similarly, the scatterplot of log_votes vs runtime suggests three categories: short, average, and long. New categorical variables are created reflecting these observations. Their explanatory promise is tested with ANOVA with positive results.
movie_pop <- movie_pop %>%
mutate(length=ifelse(runtime>=135,"Long",
ifelse(runtime>=90,"Avg","Short")),
cast_experience=ifelse(cast_exp>=8,"Veteran","Rookie"))
summary(aov(movie_pop$log_votes ~ movie_pop$length))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$length 2 254.6 127.28 22.15 9.42e-10 ***
## Residuals 330 1896.4 5.75
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(movie_pop$log_votes ~ movie_pop$cast_experience))
## Df Sum Sq Mean Sq F value Pr(>F)
## movie_pop$cast_experience 1 357.4 357.4 65.96 9.21e-15 ***
## Residuals 331 1793.5 5.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Historical scores
Historical scores measure the average performance (via log_votes, critics_score, audience_score, or imdb_rating) for an individual or organization in previous movies. This is probably one of the most obvious real-life measurements a producer would consider when staffing/casting: Who is available today with a good track record of previous work in popular movies?
Historical scores are more complicated to compute than the numbers above, because each individual’s historical average changes each time they are in another movie. The calculation below uses a for-loop to calculate average scores for each individual or organization at each moment preceding a specific film they are in.
The calculation estimates historical scores based on what is included in the movies data. This is obviously missing most of real-life history for whatever individual we want to analyze. We are confident proceeding this way because the whatever historical data does turn up will be representative, due to the random sampling used to generate the original data.
Historical scores, when they are available, will prove to be powerful explanatory variables. However, historical scores are unfortunately not available in the majority of cases. As a result there will be two distinct linear models: one model with historical scores for when the associated individuals have a supporting history, and another model without historical scores for when the individuals do not have a supporting history.
Looking to the code that follows, the four functions below use the movies_tall table to calculate historical scores for a specified entity-name in a specific role (e.g., actor, director, studio) for a specific movie in a specific year.
# Function to calc entity's avg votes prior to this film
a_vote_hist <- function(this_entity,this_role,this_film,this_year) {
return(movies_tall %>%
filter(entity==this_entity,
role==this_role,
thtr_rel_year<=this_year,
title!=this_film) %>%
summarise(avg_votes=mean(log_votes)) %>%
as.numeric())
}
# function to calc entity's critic score prior to this film
a_critics_hist <- function(this_entity,this_role,this_film,this_year) {
return(movies_tall %>%
filter(entity==this_entity,
role==this_role,
thtr_rel_year<=this_year,
title!=this_film) %>%
summarise(avg_critics_score=mean(critics_score)) %>%
as.numeric())
}
# Function to calc entity's audience score prior to this film
a_aud_hist <- function(this_entity,this_role,this_film,this_year) {
return(movies_tall %>%
filter(entity==this_entity,
role==this_role,
thtr_rel_year<=this_year,
title!=this_film) %>%
summarise(avg_aud_score=mean(audience_score)) %>%
as.numeric())
}
# Function to calc entity's imdb score prior to this film
a_imdb_hist <- function(this_entity,this_role,this_film,this_year) {
return(movies_tall %>%
filter(entity==this_entity,
role==this_role,
thtr_rel_year<=this_year,
title!=this_film) %>%
summarise(avg_imdb_rating
=mean(imdb_rating)) %>%
as.numeric())
}
This for loop below calculates a comprehensive set of historical scores.
# Iterate over all entity x film pairs and calc stats for entity's prior films
# Save our lengthy calculation in CSV
calculate_entity_movie_rolling_stats <- function() {
movies_tall <- movies_tall %>%
mutate(prior_films=0,
critic_hist=NA,
aud_hist=NA,
imdb_hist=NA,
vote_hist=NA)
for (i in 1:nrow(movies_tall)) {
row <- movies_tall[i,]
this_entity <- row$entity
this_rol <- row$role
this_film <- row$title
this_year <- row$thtr_rel_year
movies_tall[i,]$prior_films <-
nrow(movies_tall %>%
filter(entity==this_entity,
role==this_rol,
title!=this_film,
thtr_rel_year<=this_year))
movies_tall[i,]$critic_hist <- a_critics_hist(this_entity,this_rol,this_film,this_year)
movies_tall[i,]$aud_hist <- a_aud_hist(this_entity,this_rol,this_film,this_year)
movies_tall[i,]$imdb_hist <- a_imdb_hist(this_entity,this_rol,this_film,this_year)
movies_tall[i,]$vote_hist <- a_vote_hist(this_entity,this_rol,this_film,this_year)
}
write.csv(movies_tall %>% arrange(role,entity,thtr_rel_year),file="entity_movies.csv")
}
The above calculation takes a long time. The results are stored in a CSV file above, and then below they are read back in. To save knitting time it’s usually fine to comment out the calculation (represented by the one-line function call below) and re-use the existing CSV file.
# Actor movies has history of previous stats
# calculate_entity_movie_rolling_stats()
The code below joins the results of the above calculation (historical scores) with the “wide” movie_pop data frame.
movies_tall <- read.csv("entity_movies.csv") %>%
select(entity,role,title,prior_films,critic_hist,aud_hist,imdb_hist,vote_hist)
# Join historical info for each cast member actor1, actor2, actor3, actor4, actor5
colnames(movies_tall) <-
c("actor1","role","title","prior_films1","critic_hist1","aud_hist1","imdb_hist1","vote_hist1")
movie_pop <-
left_join(movie_pop,
movies_tall %>% filter(role=="actor") %>% select(-role),
by=c("title","actor1"))
## Warning: Column `title` joining character vector and factor, coercing into
## character vector
## Warning: Column `actor1` joining character vector and factor, coercing into
## character vector
colnames(movies_tall) <-
c("actor2","role","title","prior_films2","critic_hist2","aud_hist2","imdb_hist2","vote_hist2")
movie_pop <-
left_join(movie_pop,
movies_tall %>% filter(role=="actor") %>% select(-role),
by=c("title","actor2"))
## Warning: Column `title` joining character vector and factor, coercing into
## character vector
## Warning: Column `actor2` joining character vector and factor, coercing into
## character vector
colnames(movies_tall) <-
c("actor3","role","title","prior_films3","critic_hist3","aud_hist3","imdb_hist3","vote_hist3")
movie_pop <-
left_join(movie_pop,
movies_tall %>% filter(role=="actor") %>% select(-role),
by=c("title","actor3"))
## Warning: Column `title` joining character vector and factor, coercing into
## character vector
## Warning: Column `actor3` joining character vector and factor, coercing into
## character vector
colnames(movies_tall) <-
c("actor4","role","title","prior_films4","critic_hist4","aud_hist4","imdb_hist4","vote_hist4")
movie_pop <-
left_join(movie_pop,
movies_tall %>% filter(role=="actor") %>% select(-role),
by=c("title","actor4"))
## Warning: Column `title` joining character vector and factor, coercing into
## character vector
## Warning: Column `actor4` joining character vector and factor, coercing into
## character vector
colnames(movies_tall) <-
c("actor5","role","title","prior_films5","critic_hist5","aud_hist5","imdb_hist5","vote_hist5")
movie_pop <-
left_join(movie_pop,
movies_tall %>% filter(role=="actor") %>% select(-role),
by=c("title","actor5"))
## Warning: Column `title` joining character vector and factor, coercing into
## character vector
## Warning: Column `actor5` joining character vector and factor, coercing into
## character vector
# Add director and studio also
colnames(movies_tall) <-
c("director","role","title","prior_films_dir","critic_hist_dir","aud_hist_dir","imdb_hist_dir","vote_hist_dir")
movie_pop <-
left_join(movie_pop,
movies_tall %>% filter(role=="director") %>% select(-role),
by=c("title","director"))
## Warning: Column `title` joining character vector and factor, coercing into
## character vector
## Warning: Column `director` joining character vector and factor, coercing
## into character vector
colnames(movies_tall) <-
c("studio_cleaned","role","title","prior_films_studio","critic_hist_studio","aud_hist_studio","imdb_hist_studio","vote_hist_studio")
movie_pop <-
left_join(movie_pop,
movies_tall %>% filter(role=="studio") %>% select(-role),
by=c("title","studio_cleaned"))
## Warning: Column `title` joining character vector and factor, coercing into
## character vector
## Warning: Column `studio_cleaned` joining character vector and factor,
## coercing into character vector
colnames(movies_tall) <-
c("entity","role","title","prior_films","critic_hist","aud_hist","imdb_hist","vote_hist")
Finally for each movie we summarise the historical scores of its entire cast, which is done below:
movie_pop <- movie_pop %>% ungroup() %>% group_by(title) %>%
mutate(
cast_prior_films=sum(c(prior_films1,
prior_films2,
prior_films3,
prior_films4,
prior_films5),
na.rm=TRUE),
cast_critic_hist=mean(c(critic_hist1,
critic_hist2,
critic_hist3,
critic_hist4,
critic_hist5),
na.rm=TRUE),
cast_aud_hist=mean(c(aud_hist1,
aud_hist2,
aud_hist3,
aud_hist4,
aud_hist5),
na.rm=TRUE),
cast_imdb_hist=mean(c(imdb_hist1,
imdb_hist2,
imdb_hist3,
imdb_hist4,
imdb_hist5),
na.rm=TRUE),
cast_vote_hist=ifelse(cast_prior_films>0,
max(c(vote_hist1,
vote_hist2,
vote_hist3,
vote_hist4,
vote_hist5),
na.rm=TRUE),
NA))
Evaluation of historical scores as explanatory variables
For most movies, their actors and director have no known work history in the data. An alternative linear model will be developed for those cases. But first, the movies that do have director histories are put in movie_pop_exp for developing our primary linear model. Movies without a director history go into movie_pop_inexp for the secondary linear model.
movie_pop_exp <- movie_pop %>% filter(prior_films_dir>0)
movie_pop_inexp <- anti_join(movie_pop, movie_pop_exp, by="title")
Of all the historical scores calculated above, the director’s vote history has by far the best correlation with the movie’s log_votes. See the gallery of charts below. The correlation factor is 0.55. No other historical score has a correlation above 0.40. The charts for the other historical scores (related to actors and studios) are omitted for brevity. We will include this one historical score, vote_hist_dir, and ignore the others in order to minimize the dependence of the linear model on the thin historical record.
ggpairs(movie_pop_exp,columns=c(73,74,75,76,77,36))
