Introduction

For this week, I am using IMDB movie data that I found on kaggle. I am looking to see what makes a hit movie (high revenue), specifically during a time of economic downturn and recession, and I will see how a film’s rating, director, year, runtime, and genre effected average film revenue from 2006 to 2016. Please keep in mind that the ‘Great Recession’ began around December 2007 and recovery began from 2010 onward. The recession period that I look at is from 2008 through 2011. https://www.bls.gov/opub/mlr/2018/article/great-recession-great-recovery.htm.

I speculate that the film industry would do better during a time of poor economic growth than other times regardless of ratings. Since film is a form of escapism, I expect to see similar trends as the Great Depression of the 1930s and 40s. During the recession, I also expect to see a different trend than in non recession times of successful film genres and film directors. Therefore, the recession will reflect better trends of reveue than pre or post recession. Per the instructions, I will let the figures speak for themselves.

Loading in packages and data

library(ggplot2)
library(ggthemes)
library(ggrepel)
library(tidyverse)
library(readr)
library(viridis)
library(dplyr)

IMDB <- read_csv("C:/Users/abbys/Downloads/IMDB-Movie-Data.csv")
options(dplyr.show_progress = FALSE)
head(IMDB)
## # A tibble: 6 x 12
##    Rank Title Genre Description Director Actors  Year `Runtime (Minut~
##   <dbl> <chr> <chr> <chr>       <chr>    <chr>  <dbl>            <dbl>
## 1     1 Guar~ Acti~ A group of~ James G~ Chris~  2014              121
## 2     2 Prom~ Adve~ Following ~ Ridley ~ Noomi~  2012              124
## 3     3 Split Horr~ Three girl~ M. Nigh~ James~  2016              117
## 4     4 Sing  Anim~ In a city ~ Christo~ Matth~  2016              108
## 5     5 Suic~ Acti~ A secret g~ David A~ Will ~  2016              123
## 6     6 The ~ Acti~ European m~ Yimou Z~ Matt ~  2016              103
## # ... with 4 more variables: Rating <dbl>, Votes <dbl>, `Revenue
## #   (Millions)` <dbl>, Metascore <dbl>
dim(IMDB)
## [1] 1000   12

Cleaning up the data

imdb1<-IMDB%>%
  select(Year,`Runtime (Minutes)`,Rating,`Revenue (Millions)`,Director,Genre, Title)%>%
  filter(!is.na(Year),
         !is.na(`Runtime (Minutes)`),
         !is.na(Rating),
         !is.na(`Revenue (Millions)`),
         !is.na(Director),
         !is.na(Genre),
         !is.na(Title))%>%
  mutate("Revenue"=`Revenue (Millions)`)

dim(imdb1)
## [1] 872   8
head(imdb1)
## # A tibble: 6 x 8
##    Year `Runtime (Minut~ Rating `Revenue (Milli~ Director Genre Title
##   <dbl>            <dbl>  <dbl>            <dbl> <chr>    <chr> <chr>
## 1  2014              121    8.1            333.  James G~ Acti~ Guar~
## 2  2012              124    7              126.  Ridley ~ Adve~ Prom~
## 3  2016              117    7.3            138.  M. Nigh~ Horr~ Split
## 4  2016              108    7.2            270.  Christo~ Anim~ Sing 
## 5  2016              123    6.2            325.  David A~ Acti~ Suic~
## 6  2016              103    6.1             45.1 Yimou Z~ Acti~ The ~
## # ... with 1 more variable: Revenue <dbl>

Which years had the highest average film revenue from 2006 to 2016?

a<-imdb1%>%
  group_by(Year)%>%
  summarise(Mean_Rev=mean(Revenue))
ggplot(data=a,aes(x=reorder(Year,Mean_Rev),y=Mean_Rev,fill=Mean_Rev))+geom_bar(stat = "identity")+scale_fill_viridis(name = "Average Revenue", option = "C")+coord_flip()+labs(title = 'Average Revenue(In Millions)',subtitle='By Year',x="Year",y="Average Revenue")+geom_text(aes(label=round(Mean_Rev,digits=2)), color="yellow", size=2.75,hjust=-.2)+theme_dark()

Rating vs Average Revenue for the decade:

b<-imdb1%>%
  group_by(Rating)%>%
  summarize(Mean_Rev=mean(Revenue))
gg<-ggplot(b, aes(y = Mean_Rev, x = Rating) ) + geom_point(color='purple') + geom_line(color='blue') +theme_calc() +stat_smooth(method = "lm",color="cyan")+labs(title = 'Average Revenue(In Millions) By IMDB Ratings Using LM Model')
library(plotly)
ggplotly(gg)

###Year vs Average Rating: Were movies better during the recession than other years?

bb<-imdb1%>%
  group_by(Year)%>%
  summarize(Mean_Rating=mean(Rating))
gg2<-ggplot(bb, aes(y = Mean_Rating, x = Year) ) + geom_point(color='purple') + geom_line(color='pink') +theme_calc() +stat_smooth(method = "lm",color="orange")+labs(title = 'Average IMDB Rating By Year Using LM Model')
library(plotly)
ggplotly(gg2)

###Movie length vs Revenue: What were the trends for the most financially successful length of a movie during the recession period and pre/post recession?

c<-imdb1%>%
  mutate(Runtime=ifelse(`Runtime (Minutes)`>=66&`Runtime (Minutes)`<=120,"1-2 Hours",
                 ifelse(`Runtime (Minutes)`>120&`Runtime (Minutes)`<=180,"2-3 Hours","3+ Hours"
                               )))%>%
  group_by(Year,Runtime)%>%
  summarize(Mean_Rev=mean(Revenue))%>%
  ungroup()%>%
  mutate(Year=as.integer(Year))
c
## # A tibble: 24 x 3
##     Year Runtime   Mean_Rev
##    <int> <chr>        <dbl>
##  1  2006 1-2 Hours     73.7
##  2  2006 2-3 Hours    109. 
##  3  2007 1-2 Hours     77.8
##  4  2007 2-3 Hours    103. 
##  5  2007 3+ Hours      25.0
##  6  2008 1-2 Hours     80.0
##  7  2008 2-3 Hours    161. 
##  8  2009 1-2 Hours     81.6
##  9  2009 2-3 Hours    167. 
## 10  2010 1-2 Hours     88.8
## # ... with 14 more rows
library(gganimate)
## Warning: package 'gganimate' was built under R version 3.5.3
library(gapminder)
## Warning: package 'gapminder' was built under R version 3.5.3
library(gifski)
## Warning: package 'gifski' was built under R version 3.5.3
ggplot(c, mapping=aes(x = Runtime,y = Mean_Rev))+ geom_col(fill='purple')+  labs(title = 'Average Revenue Per Film Length: 2006-2016',
       subtitle='Date: {frame_time}', 
       x = 'Film length (Minutes)', 
       y = 'Average Revenue')+theme_bw()+ transition_time(Year)

##Directors with the highest average film revenue not during the recession 2006-2007,2012-2016:

decade<-imdb1%>%
  select(Director,Revenue,Year)%>%
  mutate(Year = sjmisc::rec(Year, rec = "2006=2006; 2007=2007; 2012=2012; 2013=2013; 2014=2014; 2015=2015; 2016=2016"))%>%
  filter(!is.na(Year))%>%
  group_by(Director)%>%
  summarize(Mean_Rev=mean(Revenue))%>%
  filter(Mean_Rev>=84.1725)#75th quantile
decade
## # A tibble: 120 x 2
##    Director                    Mean_Rev
##    <chr>                          <dbl>
##  1 Adam McKay                     109. 
##  2 Alan Taylor                    148. 
##  3 Alejandro González Iñárritu     86.8
##  4 Alessandro Carloni             144. 
##  5 Alfonso Cuarón                 155. 
##  6 Andrew Stanton                 280. 
##  7 Angelina Jolie                 116. 
##  8 Anthony Russo                  334. 
##  9 Barry Sonnenfeld                99.3
## 10 Baz Luhrmann                   145. 
## # ... with 110 more rows
ggplot(data=decade,aes(x=(Director),y=Mean_Rev,fill=Mean_Rev))+geom_bar(stat = "identity")+scale_fill_viridis(name = "Average Revenue", option = "C")+coord_flip()+labs(title = 'Top 25% Most Profitable Film Directors on Average',subtitle=' 2006-2007,2012-2016',x="Director",y="Average Revenue")+geom_text(aes(label=round(Mean_Rev,digits=2)), color="royal blue", size=5,hjust=-.25)+theme_gdocs()

Directors with the highest average revenue during the Great Recession

d<-imdb1%>%
  select(Director,Revenue,Year)%>%
  group_by(Director)%>%
  filter(Year>=2008&Year<2012)%>%
  summarize(Mean_Rev=mean(Revenue))
d
## # A tibble: 179 x 2
##    Director        Mean_Rev
##    <chr>              <dbl>
##  1 Adam McKay       110.   
##  2 Albert Hughes     94.8  
##  3 Alexander Payne   82.6  
##  4 Alexandre Aja     25    
##  5 Allen Coulter     19.1  
##  6 Andrew Jarecki     0.580
##  7 Andrew Niccol     37.6  
##  8 Andrew Stanton   224.   
##  9 Andy Tennant      70.2  
## 10 Anne Fletcher    164.   
## # ... with 169 more rows
library(wordcloud2) 
## Warning: package 'wordcloud2' was built under R version 3.5.3
wordcloud2(d, size=.3)

###Top 50% Most Profitable film Genres Not During the Recession

e<-imdb1%>%
  select(Genre,Year,Revenue)%>%
   mutate(Year = sjmisc::rec(Year, rec = "2006=2006; 2007=2007;2012=2012;2013=2013;2014=2014;2015=2015;2016=2016"))%>%
  filter(!is.na(Year))%>%
  group_by(Genre)%>%
  summarize(MeanRev=mean(Revenue))%>%
  filter(MeanRev>=median(MeanRev))
e
## # A tibble: 86 x 2
##    Genre                     MeanRev
##    <chr>                       <dbl>
##  1 Action                      132. 
##  2 Action,Adventure            224. 
##  3 Action,Adventure,Comedy      99.4
##  4 Action,Adventure,Drama       76.2
##  5 Action,Adventure,Family     113. 
##  6 Action,Adventure,Fantasy    216. 
##  7 Action,Adventure,Horror     202. 
##  8 Action,Adventure,Mystery    150. 
##  9 Action,Adventure,Sci-Fi     215. 
## 10 Action,Adventure,Thriller   157. 
## # ... with 76 more rows
ggplot(data=e,aes(x=reorder(Genre,MeanRev),y=MeanRev,fill=MeanRev))+geom_bar(stat = "identity")+scale_fill_viridis(name = "Average Revenue", option = "C")+coord_flip()+labs(title = 'Top 50% Most Profitable Film Genres on Average',subtitle=' 2006-2007,2012-2016',x="Genre",y="Average Revenue")+geom_text(aes(label=round(MeanRev,digits=2)), color="royal blue", size=3,hjust=-.25)+theme_gdocs()

Compare to the Rest of the Decade: Highest Grossing Films During the Recession By Genre

d1<-imdb1%>%
  filter(Year>=2008&Year<=2011)
q75<-quantile(d1$Revenue)
q75
##     0%    25%    50%    75%   100% 
##   0.02  25.00  75.28 143.62 760.51
d2<-d1%>%
  select(Title,Genre,Revenue,Year)%>%
  filter(Year>=2008&Year<=2011)%>%
  group_by(Title,Genre,Year)%>%
  summarize(Mean_Rev=mean(Revenue))%>%
  filter(Mean_Rev>=143.62)
d2
## # A tibble: 55 x 4
## # Groups:   Title, Genre [55]
##    Title                            Genre                     Year Mean_Rev
##    <chr>                            <chr>                    <dbl>    <dbl>
##  1 2012                             Action,Adventure,Sci-Fi   2009     166.
##  2 Alice in Wonderland              Adventure,Family,Fantasy  2010     334.
##  3 Avatar                           Action,Adventure,Fantasy  2009     761.
##  4 Bridesmaids                      Comedy,Romance            2011     169.
##  5 Captain America: The First Aven~ Action,Adventure,Sci-Fi   2011     177.
##  6 Cars 2                           Animation,Adventure,Com~  2011     191.
##  7 Clash of the Titans              Action,Adventure,Fantasy  2010     163.
##  8 Despicable Me                    Animation,Adventure,Com~  2010     252.
##  9 Fast & Furious                   Action,Crime,Thriller     2009     155.
## 10 Fast Five                        Action,Crime,Thriller     2011     210.
## # ... with 45 more rows
library(ggalluvial)
ggplot(data=d2, mapping=aes(axis1 = Title, axis2 = Year, axis3 = Genre)) +
  scale_x_discrete(limits = c("Title", "Year", "Genre"), expand = c(.1, .05)) +
  geom_alluvium(aes(fill = Mean_Rev,colour=Mean_Rev)) +
  geom_stratum() + geom_text(stat = "stratum", label.strata = TRUE) +
  theme_minimal()+labs(title = 'Top 25% of Films By Average Revenue(In Millions),Year, and By Genre During Recession')

Conclusion

From my analysis, I see that there were in fact unique trends in terms of revenue during the recession in the film industry that were different from pre and post recession years in terms of genres, directors, ratings, and runtimes. If I wanted a more detailed analysis, I would need a dataset that would separate films by month in addition to years. It would also be interesting to compare how the film industry did in the long run, including the Great Depression and adjusted for inflation revenues to find more interesting trends.