For this week, I am using IMDB movie data that I found on kaggle. I am looking to see what makes a hit movie (high revenue), specifically during a time of economic downturn and recession, and I will see how a film’s rating, director, year, runtime, and genre effected average film revenue from 2006 to 2016. Please keep in mind that the ‘Great Recession’ began around December 2007 and recovery began from 2010 onward. The recession period that I look at is from 2008 through 2011. https://www.bls.gov/opub/mlr/2018/article/great-recession-great-recovery.htm.
I speculate that the film industry would do better during a time of poor economic growth than other times regardless of ratings. Since film is a form of escapism, I expect to see similar trends as the Great Depression of the 1930s and 40s. During the recession, I also expect to see a different trend than in non recession times of successful film genres and film directors. Therefore, the recession will reflect better trends of reveue than pre or post recession. Per the instructions, I will let the figures speak for themselves.
library(ggplot2)
library(ggthemes)
library(ggrepel)
library(tidyverse)
library(readr)
library(viridis)
library(dplyr)
IMDB <- read_csv("C:/Users/abbys/Downloads/IMDB-Movie-Data.csv")
options(dplyr.show_progress = FALSE)
head(IMDB)
## # A tibble: 6 x 12
## Rank Title Genre Description Director Actors Year `Runtime (Minut~
## <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 1 Guar~ Acti~ A group of~ James G~ Chris~ 2014 121
## 2 2 Prom~ Adve~ Following ~ Ridley ~ Noomi~ 2012 124
## 3 3 Split Horr~ Three girl~ M. Nigh~ James~ 2016 117
## 4 4 Sing Anim~ In a city ~ Christo~ Matth~ 2016 108
## 5 5 Suic~ Acti~ A secret g~ David A~ Will ~ 2016 123
## 6 6 The ~ Acti~ European m~ Yimou Z~ Matt ~ 2016 103
## # ... with 4 more variables: Rating <dbl>, Votes <dbl>, `Revenue
## # (Millions)` <dbl>, Metascore <dbl>
dim(IMDB)
## [1] 1000 12
imdb1<-IMDB%>%
select(Year,`Runtime (Minutes)`,Rating,`Revenue (Millions)`,Director,Genre, Title)%>%
filter(!is.na(Year),
!is.na(`Runtime (Minutes)`),
!is.na(Rating),
!is.na(`Revenue (Millions)`),
!is.na(Director),
!is.na(Genre),
!is.na(Title))%>%
mutate("Revenue"=`Revenue (Millions)`)
dim(imdb1)
## [1] 872 8
head(imdb1)
## # A tibble: 6 x 8
## Year `Runtime (Minut~ Rating `Revenue (Milli~ Director Genre Title
## <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 2014 121 8.1 333. James G~ Acti~ Guar~
## 2 2012 124 7 126. Ridley ~ Adve~ Prom~
## 3 2016 117 7.3 138. M. Nigh~ Horr~ Split
## 4 2016 108 7.2 270. Christo~ Anim~ Sing
## 5 2016 123 6.2 325. David A~ Acti~ Suic~
## 6 2016 103 6.1 45.1 Yimou Z~ Acti~ The ~
## # ... with 1 more variable: Revenue <dbl>
a<-imdb1%>%
group_by(Year)%>%
summarise(Mean_Rev=mean(Revenue))
ggplot(data=a,aes(x=reorder(Year,Mean_Rev),y=Mean_Rev,fill=Mean_Rev))+geom_bar(stat = "identity")+scale_fill_viridis(name = "Average Revenue", option = "C")+coord_flip()+labs(title = 'Average Revenue(In Millions)',subtitle='By Year',x="Year",y="Average Revenue")+geom_text(aes(label=round(Mean_Rev,digits=2)), color="yellow", size=2.75,hjust=-.2)+theme_dark()
b<-imdb1%>%
group_by(Rating)%>%
summarize(Mean_Rev=mean(Revenue))
gg<-ggplot(b, aes(y = Mean_Rev, x = Rating) ) + geom_point(color='purple') + geom_line(color='blue') +theme_calc() +stat_smooth(method = "lm",color="cyan")+labs(title = 'Average Revenue(In Millions) By IMDB Ratings Using LM Model')
library(plotly)
ggplotly(gg)
###Year vs Average Rating: Were movies better during the recession than other years?
bb<-imdb1%>%
group_by(Year)%>%
summarize(Mean_Rating=mean(Rating))
gg2<-ggplot(bb, aes(y = Mean_Rating, x = Year) ) + geom_point(color='purple') + geom_line(color='pink') +theme_calc() +stat_smooth(method = "lm",color="orange")+labs(title = 'Average IMDB Rating By Year Using LM Model')
library(plotly)
ggplotly(gg2)
###Movie length vs Revenue: What were the trends for the most financially successful length of a movie during the recession period and pre/post recession?
c<-imdb1%>%
mutate(Runtime=ifelse(`Runtime (Minutes)`>=66&`Runtime (Minutes)`<=120,"1-2 Hours",
ifelse(`Runtime (Minutes)`>120&`Runtime (Minutes)`<=180,"2-3 Hours","3+ Hours"
)))%>%
group_by(Year,Runtime)%>%
summarize(Mean_Rev=mean(Revenue))%>%
ungroup()%>%
mutate(Year=as.integer(Year))
c
## # A tibble: 24 x 3
## Year Runtime Mean_Rev
## <int> <chr> <dbl>
## 1 2006 1-2 Hours 73.7
## 2 2006 2-3 Hours 109.
## 3 2007 1-2 Hours 77.8
## 4 2007 2-3 Hours 103.
## 5 2007 3+ Hours 25.0
## 6 2008 1-2 Hours 80.0
## 7 2008 2-3 Hours 161.
## 8 2009 1-2 Hours 81.6
## 9 2009 2-3 Hours 167.
## 10 2010 1-2 Hours 88.8
## # ... with 14 more rows
library(gganimate)
## Warning: package 'gganimate' was built under R version 3.5.3
library(gapminder)
## Warning: package 'gapminder' was built under R version 3.5.3
library(gifski)
## Warning: package 'gifski' was built under R version 3.5.3
ggplot(c, mapping=aes(x = Runtime,y = Mean_Rev))+ geom_col(fill='purple')+ labs(title = 'Average Revenue Per Film Length: 2006-2016',
subtitle='Date: {frame_time}',
x = 'Film length (Minutes)',
y = 'Average Revenue')+theme_bw()+ transition_time(Year)
##Directors with the highest average film revenue not during the recession 2006-2007,2012-2016:
decade<-imdb1%>%
select(Director,Revenue,Year)%>%
mutate(Year = sjmisc::rec(Year, rec = "2006=2006; 2007=2007; 2012=2012; 2013=2013; 2014=2014; 2015=2015; 2016=2016"))%>%
filter(!is.na(Year))%>%
group_by(Director)%>%
summarize(Mean_Rev=mean(Revenue))%>%
filter(Mean_Rev>=84.1725)#75th quantile
decade
## # A tibble: 120 x 2
## Director Mean_Rev
## <chr> <dbl>
## 1 Adam McKay 109.
## 2 Alan Taylor 148.
## 3 Alejandro González Iñárritu 86.8
## 4 Alessandro Carloni 144.
## 5 Alfonso Cuarón 155.
## 6 Andrew Stanton 280.
## 7 Angelina Jolie 116.
## 8 Anthony Russo 334.
## 9 Barry Sonnenfeld 99.3
## 10 Baz Luhrmann 145.
## # ... with 110 more rows
ggplot(data=decade,aes(x=(Director),y=Mean_Rev,fill=Mean_Rev))+geom_bar(stat = "identity")+scale_fill_viridis(name = "Average Revenue", option = "C")+coord_flip()+labs(title = 'Top 25% Most Profitable Film Directors on Average',subtitle=' 2006-2007,2012-2016',x="Director",y="Average Revenue")+geom_text(aes(label=round(Mean_Rev,digits=2)), color="royal blue", size=5,hjust=-.25)+theme_gdocs()
d<-imdb1%>%
select(Director,Revenue,Year)%>%
group_by(Director)%>%
filter(Year>=2008&Year<2012)%>%
summarize(Mean_Rev=mean(Revenue))
d
## # A tibble: 179 x 2
## Director Mean_Rev
## <chr> <dbl>
## 1 Adam McKay 110.
## 2 Albert Hughes 94.8
## 3 Alexander Payne 82.6
## 4 Alexandre Aja 25
## 5 Allen Coulter 19.1
## 6 Andrew Jarecki 0.580
## 7 Andrew Niccol 37.6
## 8 Andrew Stanton 224.
## 9 Andy Tennant 70.2
## 10 Anne Fletcher 164.
## # ... with 169 more rows
library(wordcloud2)
## Warning: package 'wordcloud2' was built under R version 3.5.3
wordcloud2(d, size=.3)
###Top 50% Most Profitable film Genres Not During the Recession
e<-imdb1%>%
select(Genre,Year,Revenue)%>%
mutate(Year = sjmisc::rec(Year, rec = "2006=2006; 2007=2007;2012=2012;2013=2013;2014=2014;2015=2015;2016=2016"))%>%
filter(!is.na(Year))%>%
group_by(Genre)%>%
summarize(MeanRev=mean(Revenue))%>%
filter(MeanRev>=median(MeanRev))
e
## # A tibble: 86 x 2
## Genre MeanRev
## <chr> <dbl>
## 1 Action 132.
## 2 Action,Adventure 224.
## 3 Action,Adventure,Comedy 99.4
## 4 Action,Adventure,Drama 76.2
## 5 Action,Adventure,Family 113.
## 6 Action,Adventure,Fantasy 216.
## 7 Action,Adventure,Horror 202.
## 8 Action,Adventure,Mystery 150.
## 9 Action,Adventure,Sci-Fi 215.
## 10 Action,Adventure,Thriller 157.
## # ... with 76 more rows
ggplot(data=e,aes(x=reorder(Genre,MeanRev),y=MeanRev,fill=MeanRev))+geom_bar(stat = "identity")+scale_fill_viridis(name = "Average Revenue", option = "C")+coord_flip()+labs(title = 'Top 50% Most Profitable Film Genres on Average',subtitle=' 2006-2007,2012-2016',x="Genre",y="Average Revenue")+geom_text(aes(label=round(MeanRev,digits=2)), color="royal blue", size=3,hjust=-.25)+theme_gdocs()
d1<-imdb1%>%
filter(Year>=2008&Year<=2011)
q75<-quantile(d1$Revenue)
q75
## 0% 25% 50% 75% 100%
## 0.02 25.00 75.28 143.62 760.51
d2<-d1%>%
select(Title,Genre,Revenue,Year)%>%
filter(Year>=2008&Year<=2011)%>%
group_by(Title,Genre,Year)%>%
summarize(Mean_Rev=mean(Revenue))%>%
filter(Mean_Rev>=143.62)
d2
## # A tibble: 55 x 4
## # Groups: Title, Genre [55]
## Title Genre Year Mean_Rev
## <chr> <chr> <dbl> <dbl>
## 1 2012 Action,Adventure,Sci-Fi 2009 166.
## 2 Alice in Wonderland Adventure,Family,Fantasy 2010 334.
## 3 Avatar Action,Adventure,Fantasy 2009 761.
## 4 Bridesmaids Comedy,Romance 2011 169.
## 5 Captain America: The First Aven~ Action,Adventure,Sci-Fi 2011 177.
## 6 Cars 2 Animation,Adventure,Com~ 2011 191.
## 7 Clash of the Titans Action,Adventure,Fantasy 2010 163.
## 8 Despicable Me Animation,Adventure,Com~ 2010 252.
## 9 Fast & Furious Action,Crime,Thriller 2009 155.
## 10 Fast Five Action,Crime,Thriller 2011 210.
## # ... with 45 more rows
library(ggalluvial)
ggplot(data=d2, mapping=aes(axis1 = Title, axis2 = Year, axis3 = Genre)) +
scale_x_discrete(limits = c("Title", "Year", "Genre"), expand = c(.1, .05)) +
geom_alluvium(aes(fill = Mean_Rev,colour=Mean_Rev)) +
geom_stratum() + geom_text(stat = "stratum", label.strata = TRUE) +
theme_minimal()+labs(title = 'Top 25% of Films By Average Revenue(In Millions),Year, and By Genre During Recession')
From my analysis, I see that there were in fact unique trends in terms of revenue during the recession in the film industry that were different from pre and post recession years in terms of genres, directors, ratings, and runtimes. If I wanted a more detailed analysis, I would need a dataset that would separate films by month in addition to years. It would also be interesting to compare how the film industry did in the long run, including the Great Depression and adjusted for inflation revenues to find more interesting trends.