TMDB 5000 Movie Dataset (source: Kaggle) has been prepared by scrapping movie related data for nearly 5000 movies from IMDB website. This dataset was prepared with a view to predict the success of a movie before its release, considering factors like Budget, Genre, Language, Production House, Cast, Crew and many more.
Image credits: http://lostseoulmovie.com/wp-content/uploads/2017/05/IMDB_Logo.png
TMDB consists of 2 datasets. The dataset for movie contains 4803 observations with 20 variables while the dataset for credits contains 4803 observations with 4 variables. We merge both the datasets to create a new data set as well which we can use for our analysis later on.
library(jsonlite)
library(readxl)
library(readr)
library(dplyr)
library(tidyr)
library(viridis)
library(ggcorrplot)
library(scales)
library(treemapify)
library(wordcloud)
library(tm)
library(SnowballC)
library(ggrepel)
theme_set(theme_classic())
credits <- read_csv("C:/Users/Devanshu/Downloads/tmdb-5000-movie-dataset/tmdb_5000_credits.csv")
movies <- read_csv("C:/Users/Devanshu/Downloads/tmdb-5000-movie-dataset/tmdb_5000_movies.csv")
overall <- inner_join(movies,credits, by=c("id"="movie_id", "title"))
#Dropping the variable homepage
movies <- subset(movies,select = -c(3))
#Convert budget and revenue in millions
movies$budget <- movies$budget/1000000
movies$revenue <- movies$revenue/1000000
summary(movies)
## budget genres id keywords
## Min. : 0.00 Length:4803 Min. : 5 Length:4803
## 1st Qu.: 0.79 Class :character 1st Qu.: 9014 Class :character
## Median : 15.00 Mode :character Median : 14629 Mode :character
## Mean : 29.05 Mean : 57166
## 3rd Qu.: 40.00 3rd Qu.: 58611
## Max. :380.00 Max. :459488
##
## original_language original_title overview
## Length:4803 Length:4803 Length:4803
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## popularity production_companies production_countries
## Min. : 0.000 Length:4803 Length:4803
## 1st Qu.: 4.668 Class :character Class :character
## Median : 12.922 Mode :character Mode :character
## Mean : 21.492
## 3rd Qu.: 28.314
## Max. :875.581
##
## release_date revenue runtime spoken_languages
## Min. :1916-09-04 Min. : 0.00 Min. : 0 Length:4803
## 1st Qu.:1999-07-14 1st Qu.: 0.00 1st Qu.: 94 Class :character
## Median :2005-10-03 Median : 19.17 Median :104 Mode :character
## Mean :2002-12-27 Mean : 82.26 Mean :107
## 3rd Qu.:2011-02-16 3rd Qu.: 92.92 3rd Qu.:118
## Max. :2017-02-03 Max. :2787.97 Max. :338
## NA's :1 NA's :80
## status tagline title vote_average
## Length:4803 Length:4803 Length:4803 Min. : 0.000
## Class :character Class :character Class :character 1st Qu.: 5.600
## Mode :character Mode :character Mode :character Median : 6.200
## Mean : 6.092
## 3rd Qu.: 6.800
## Max. :10.000
##
## vote_count
## Min. : 0.0
## 1st Qu.: 54.0
## Median : 235.0
## Mean : 690.2
## 3rd Qu.: 737.0
## Max. :13752.0
##
We can see that there are only 80 missing values which are all for the variable runtime. There are also some anomalies observed.The minimum values of budget, revenue and runtime are zero which is counter intuitive.
We want to observe how the variables of our importance representing the budget, revenue, popularity rating, runtime, average movie rating on IMDB and number of votes on IMDB are represented.
par(mfrow=c(2,3))
hist(movies$budget,col = 'blue',breaks=40,main='Distribution of movie budget',xlab = 'budget (in million $)')
hist(movies$revenue,col = 'blue',breaks=40,main='Distribution of movie revenue',xlab = 'revenue (in million $)',xlim=c(0,1000))
hist(movies$runtime,col = 'blue',breaks=40,main='Distribution of movie runtime',xlab = 'runtime (in minutes)',xlim=c(0,250))
hist(movies$vote_average,col = 'blue',breaks=40,main='Distribution of movie rating',xlab = 'rating (out of 10)',xlim=c(0,10))
hist(movies$vote_count,col = 'blue',breaks=80,main='Distribution of votes',xlab = 'number of votes against a review')
hist(movies$popularity,col = 'blue',breaks=40,main='Distribution of popularity rating',xlab = 'popularity rating')
Movie runtime and Movie rating appear to be closer to normal distribution.
We are interested in seeing how investment in a movie affects the revenue. Or does the runtime of a movie affect its revenue? We draw scatter plots to find answers to some of such questions.
#Budget vs revenue
ggplot(movies,aes(x=budget,y=revenue)) + geom_point() + geom_smooth(se=FALSE) + labs(title='Revenue vs. budget',x='budget(in million $)',y='revenue(in million $)')
We can say that investment seems to affect returns positively for a movie.
#Runtime vs revenue
ggplot(data=movies,aes(x=runtime,y=revenue))+ geom_point(alpha=0.5,color= 'blue') + scale_fill_viridis(discrete=F)+ geom_smooth(color='red',se=FALSE) + labs(title='Revenue vs. Runtime',x='runtime(in minutes)',y='revenue(in million $)')
We observe that movies close to a runtime of 150 minutes are possible to register higher returns.
#Budget vs popularity
ggplot(movies,aes(x=budget,y=popularity)) + geom_point() + geom_smooth(se=FALSE) + labs(title='Popularity vs. Budget',x='budget(in million $)',y='popularity rating')
## `geom_smooth()` using method = 'gam'
As budget increases, the popularity seems to show an increasing trend.
We are also interested in seeing the relationship of all the important variables with one another. The correlation matrix tells us the strength of all relationships.
# Omit missing values of revenue
movies.corr <- na.omit(movies)
##Correlation matrix
cor.movies <- cor(movies.corr[,c('budget','popularity','revenue','runtime','vote_average','vote_count')])
ggcorrplot(cor.movies,hc.order = TRUE,lab=TRUE) + ggtitle("Correlation between important predictors")
Movies with higher revenue also tend to be reviewed a lot on IMDB. Similarly, budget and revenue have a strong positive correlation. Budget of a movie has negligible effect on the IMDB movie rating.
#Opening JSON for genre
df3 <- overall %>%
filter(nchar(genres)>2) %>%
mutate(
file3 = lapply(genres, fromJSON)
) %>%
unnest(file3) %>%
select(id, title, genre=name )
genre <- slice(df3)
###Boxplot
df.box <- overall %>% inner_join(genre,by=c("id","title")) %>% group_by(genre)
df.box$revenue <- df.box$revenue/1000000
ggplot(df.box,aes(x=genre,y=revenue))+ geom_boxplot(fill="blue") + labs(y='revenue(in million $)',title='Revenue by Genre') + theme(axis.text.x = element_text(angle=90, vjust=0.6))
We can see that the median revenues of Animation, Adventure and Fantasy genres are the highest. Documentaries and foreign movies tend to earn the least.
#Top 20 costliest movies
movies.cost <- movies[order(-movies$budget),] %>% head(n=20)
ggplot(movies.cost,aes(x=reorder(title,budget),y=budget)) + geom_point(size=2, alpha=0.6,color="blue") + geom_segment(aes(x=title,xend=title,y=min(budget),yend=max(budget)),linetype="dashed",size=0.2)+labs(title="Top 20 costliest movies of all time ",y="budget(in million $)",x="movie title")+ coord_flip()
‘Pirates of the Caribbean: On Stranger Tides’ is the costliest movie of all time.
#Top 20 grossing movies by revenue
movies.rev <- movies[order(-movies$revenue),] %>% head(n=20)
ggplot(movies.rev, aes(x=reorder(title,revenue), y=revenue)) +geom_bar(stat="identity", width=.5,fill="tomato3") + labs(title="Top 20 grossing movies of all time",y="revenue(in million $)",x="movie title")+coord_flip()
‘Avatar’ is the highest earning movie of all time.
#Reading JSON for production houses
df4 <- overall %>%
filter(nchar(production_companies)>2) %>%
mutate(
file4 = lapply(production_companies, fromJSON)
) %>%
unnest(file4) %>%
select(id, title, production_company=name )
company <- slice(df4)
#Top spending production houses
prod.sp <- overall %>% inner_join(company,by=c("id","title")) %>% select(production_company,budget)
prod.2 <- aggregate(prod.sp$budget,by=list(production_company=prod.sp$production_company),FUN=sum)
prod.g <- prod.2[order(-prod.2$x),] %>% rename(budget=x) %>% head(10)
prod.g$budget <- prod.g$budget/1000000
ggplot(prod.g, aes(x=reorder(production_company,budget), y=budget)) +
geom_bar(stat="identity", width=.5, fill="brown") +
labs(title="Top 10 production houses by production investment",y="budget(in million $)",x="movie title")+coord_flip()
‘Imagine Entertainment’ has spent the highest amount on movie production.
#Top earning production houses
prod.rev <- overall %>% inner_join(company,by=c("id","title")) %>% select(production_company,revenue)
prod.1 <- aggregate(prod.rev$revenue,by=list(production_company=prod.rev$production_company),FUN=sum)
prod.f <- prod.1[order(-prod.1$x),] %>% rename(revenue=x) %>% head(10)
prod.f$revenue <- prod.f$revenue/1000000
ggplot(prod.f, aes(x=reorder(production_company,revenue), y=revenue)) + geom_point(size=4) +
geom_segment(aes(x=production_company,xend=production_company,y=0,yend=revenue)) +
labs(title="Top 10 production houses by revenue",y="revenue(in million $)",x="production house")+coord_flip()
‘Warnes Bros’ has earned the highest returns on the movies it has produced.
It is important for a production house to understand if it is making any profit on the production cost incurred. We create a variable called ‘roi’ which signifies the return on investment through ratio of revenue to budget. Then we categorise all the movies into 2 categories based on the roi - if roi is less than 1, then the movie is making a ‘loss on investment’, else it makes a ‘gain on investment’.
We want to see that for movies made for a budget greater than 10 million, which movies have the highest ROI.
#Top 10 ROI
movies.upd <- movies %>% mutate(roi=revenue/budget)
rev.0 <- movies.upd %>% filter(budget>10)
toproi <- rev.0[order(-rev.0$roi),] %>% head(10)
ggplot(toproi, aes(x=reorder(title,roi), y=roi)) +
geom_bar(width=.6, fill="blue",stat="identity") +
labs(title="Top 10 movies based on return on investment(Budget>1 million $)",y="Return on Investment",x="movie title")+coord_flip()
‘E.T. the Extra-Terrestrial’ and ‘Star Wars’ have the highest ROI for movies made on a budget of more than $10 Million.
movies.upd$Profit <- ifelse(movies.upd$roi>=1,c("gain on investment"),c("loss on investment"))
plot1<- movies.upd %>% filter(roi>0) %>% group_by(Profit)
ggplot(plot1,aes(x=Profit,y=vote_average))+ geom_boxplot(fill="brown") + labs(title="Average Movie Rating compared to Return on Investment",y='Average Movie Rating',x='Return on Investment') + theme(axis.text.x = element_text(angle=0, vjust=0.6))
We see that as expected, movies which are making a profit on investment are also rated higher by the users on IMDB.