About TMDB Dataset

TMDB 5000 Movie Dataset (source: Kaggle) has been prepared by scrapping movie related data for nearly 5000 movies from IMDB website. This dataset was prepared with a view to predict the success of a movie before its release, considering factors like Budget, Genre, Language, Production House, Cast, Crew and many more.

Image credits: http://lostseoulmovie.com/wp-content/uploads/2017/05/IMDB_Logo.png

TMDB consists of 2 datasets. The dataset for movie contains 4803 observations with 20 variables while the dataset for credits contains 4803 observations with 4 variables. We merge both the datasets to create a new data set as well which we can use for our analysis later on.

library(jsonlite)
library(readxl)
library(readr)
library(dplyr)
library(tidyr)
library(viridis)
library(ggcorrplot)
library(scales)
library(treemapify)
library(wordcloud)
library(tm)
library(SnowballC)
library(ggrepel)
theme_set(theme_classic())
credits <- read_csv("C:/Users/Devanshu/Downloads/tmdb-5000-movie-dataset/tmdb_5000_credits.csv")
movies <- read_csv("C:/Users/Devanshu/Downloads/tmdb-5000-movie-dataset/tmdb_5000_movies.csv")
overall <- inner_join(movies,credits, by=c("id"="movie_id", "title"))

Summary Statistics of movies dataset

#Dropping the variable homepage
movies <- subset(movies,select = -c(3))

#Convert budget and revenue in millions
movies$budget <- movies$budget/1000000
movies$revenue <- movies$revenue/1000000
summary(movies)
##      budget          genres                id           keywords        
##  Min.   :  0.00   Length:4803        Min.   :     5   Length:4803       
##  1st Qu.:  0.79   Class :character   1st Qu.:  9014   Class :character  
##  Median : 15.00   Mode  :character   Median : 14629   Mode  :character  
##  Mean   : 29.05                      Mean   : 57166                     
##  3rd Qu.: 40.00                      3rd Qu.: 58611                     
##  Max.   :380.00                      Max.   :459488                     
##                                                                         
##  original_language  original_title       overview        
##  Length:4803        Length:4803        Length:4803       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##    popularity      production_companies production_countries
##  Min.   :  0.000   Length:4803          Length:4803         
##  1st Qu.:  4.668   Class :character     Class :character    
##  Median : 12.922   Mode  :character     Mode  :character    
##  Mean   : 21.492                                            
##  3rd Qu.: 28.314                                            
##  Max.   :875.581                                            
##                                                             
##   release_date           revenue           runtime    spoken_languages  
##  Min.   :1916-09-04   Min.   :   0.00   Min.   :  0   Length:4803       
##  1st Qu.:1999-07-14   1st Qu.:   0.00   1st Qu.: 94   Class :character  
##  Median :2005-10-03   Median :  19.17   Median :104   Mode  :character  
##  Mean   :2002-12-27   Mean   :  82.26   Mean   :107                     
##  3rd Qu.:2011-02-16   3rd Qu.:  92.92   3rd Qu.:118                     
##  Max.   :2017-02-03   Max.   :2787.97   Max.   :338                     
##  NA's   :1                              NA's   :80                      
##     status            tagline             title            vote_average   
##  Length:4803        Length:4803        Length:4803        Min.   : 0.000  
##  Class :character   Class :character   Class :character   1st Qu.: 5.600  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.200  
##                                                           Mean   : 6.092  
##                                                           3rd Qu.: 6.800  
##                                                           Max.   :10.000  
##                                                                           
##    vote_count     
##  Min.   :    0.0  
##  1st Qu.:   54.0  
##  Median :  235.0  
##  Mean   :  690.2  
##  3rd Qu.:  737.0  
##  Max.   :13752.0  
## 

We can see that there are only 80 missing values which are all for the variable runtime. There are also some anomalies observed.The minimum values of budget, revenue and runtime are zero which is counter intuitive.

Distribution of Important Variables

We want to observe how the variables of our importance representing the budget, revenue, popularity rating, runtime, average movie rating on IMDB and number of votes on IMDB are represented.

par(mfrow=c(2,3))
hist(movies$budget,col = 'blue',breaks=40,main='Distribution of movie budget',xlab = 'budget (in million $)')
hist(movies$revenue,col = 'blue',breaks=40,main='Distribution of movie revenue',xlab = 'revenue (in million $)',xlim=c(0,1000))
hist(movies$runtime,col = 'blue',breaks=40,main='Distribution of movie runtime',xlab = 'runtime (in minutes)',xlim=c(0,250))
hist(movies$vote_average,col = 'blue',breaks=40,main='Distribution of movie rating',xlab = 'rating (out of 10)',xlim=c(0,10))
hist(movies$vote_count,col = 'blue',breaks=80,main='Distribution of votes',xlab = 'number of votes against a review')
hist(movies$popularity,col = 'blue',breaks=40,main='Distribution of popularity rating',xlab = 'popularity rating')

Movie runtime and Movie rating appear to be closer to normal distribution.

Relationship between Important Variables

We are interested in seeing how investment in a movie affects the revenue. Or does the runtime of a movie affect its revenue? We draw scatter plots to find answers to some of such questions.

#Budget vs revenue
ggplot(movies,aes(x=budget,y=revenue)) + geom_point() + geom_smooth(se=FALSE) + labs(title='Revenue vs. budget',x='budget(in million $)',y='revenue(in million $)')

We can say that investment seems to affect returns positively for a movie.

#Runtime vs revenue
ggplot(data=movies,aes(x=runtime,y=revenue))+ geom_point(alpha=0.5,color= 'blue') + scale_fill_viridis(discrete=F)+ geom_smooth(color='red',se=FALSE) + labs(title='Revenue vs. Runtime',x='runtime(in minutes)',y='revenue(in million $)')

We observe that movies close to a runtime of 150 minutes are possible to register higher returns.

#Budget vs popularity
ggplot(movies,aes(x=budget,y=popularity)) + geom_point() + geom_smooth(se=FALSE) + labs(title='Popularity vs. Budget',x='budget(in million $)',y='popularity rating')
## `geom_smooth()` using method = 'gam'

As budget increases, the popularity seems to show an increasing trend.

We are also interested in seeing the relationship of all the important variables with one another. The correlation matrix tells us the strength of all relationships.

# Omit missing values of revenue
movies.corr <- na.omit(movies)

##Correlation matrix
cor.movies <- cor(movies.corr[,c('budget','popularity','revenue','runtime','vote_average','vote_count')])
ggcorrplot(cor.movies,hc.order = TRUE,lab=TRUE) + ggtitle("Correlation between important predictors")

Movies with higher revenue also tend to be reviewed a lot on IMDB. Similarly, budget and revenue have a strong positive correlation. Budget of a movie has negligible effect on the IMDB movie rating.

Some interesting Data Visualizations

Movies from which Genre are the Highest Earning?

#Opening JSON for genre
df3 <- overall %>%    
  filter(nchar(genres)>2) %>%     
  mutate(                           
    file3 = lapply(genres, fromJSON) 
  ) %>%                             
  unnest(file3) %>%                    
  select(id, title, genre=name ) 

genre <- slice(df3)

###Boxplot
df.box <- overall %>% inner_join(genre,by=c("id","title")) %>% group_by(genre)
df.box$revenue <- df.box$revenue/1000000
ggplot(df.box,aes(x=genre,y=revenue))+ geom_boxplot(fill="blue") + labs(y='revenue(in million $)',title='Revenue by Genre') + theme(axis.text.x = element_text(angle=90, vjust=0.6))

We can see that the median revenues of Animation, Adventure and Fantasy genres are the highest. Documentaries and foreign movies tend to earn the least.

Top 20 Costliest Movies ever made

#Top 20 costliest movies
movies.cost <- movies[order(-movies$budget),] %>% head(n=20)
ggplot(movies.cost,aes(x=reorder(title,budget),y=budget)) + geom_point(size=2, alpha=0.6,color="blue") + geom_segment(aes(x=title,xend=title,y=min(budget),yend=max(budget)),linetype="dashed",size=0.2)+labs(title="Top 20 costliest movies of all time ",y="budget(in million $)",x="movie title")+ coord_flip()

‘Pirates of the Caribbean: On Stranger Tides’ is the costliest movie of all time.

Top 20 Grossing Movies

#Top 20 grossing movies by revenue
movies.rev <- movies[order(-movies$revenue),] %>% head(n=20)
ggplot(movies.rev, aes(x=reorder(title,revenue), y=revenue)) +geom_bar(stat="identity", width=.5,fill="tomato3") + labs(title="Top 20 grossing movies of all time",y="revenue(in million $)",x="movie title")+coord_flip()

‘Avatar’ is the highest earning movie of all time.

Top spending Production Houses

#Reading JSON for production houses
df4 <- overall %>%    
  filter(nchar(production_companies)>2) %>%     
  mutate(                           
    file4 = lapply(production_companies, fromJSON) 
  ) %>%                             
  unnest(file4) %>%                    
  select(id, title, production_company=name ) 

company <- slice(df4)

#Top spending production houses   
prod.sp <- overall %>% inner_join(company,by=c("id","title")) %>% select(production_company,budget) 
prod.2 <- aggregate(prod.sp$budget,by=list(production_company=prod.sp$production_company),FUN=sum)
prod.g <- prod.2[order(-prod.2$x),] %>% rename(budget=x) %>% head(10)
prod.g$budget <- prod.g$budget/1000000

ggplot(prod.g, aes(x=reorder(production_company,budget), y=budget)) + 
  geom_bar(stat="identity", width=.5, fill="brown") + 
  labs(title="Top 10 production houses by production investment",y="budget(in million $)",x="movie title")+coord_flip()

‘Imagine Entertainment’ has spent the highest amount on movie production.

Top Earning Production Houses

#Top earning production houses
prod.rev <- overall %>% inner_join(company,by=c("id","title")) %>% select(production_company,revenue) 
prod.1 <- aggregate(prod.rev$revenue,by=list(production_company=prod.rev$production_company),FUN=sum)
prod.f <- prod.1[order(-prod.1$x),] %>% rename(revenue=x) %>% head(10)
prod.f$revenue <- prod.f$revenue/1000000

ggplot(prod.f, aes(x=reorder(production_company,revenue), y=revenue)) + geom_point(size=4) +
  geom_segment(aes(x=production_company,xend=production_company,y=0,yend=revenue)) + 
  labs(title="Top 10 production houses by revenue",y="revenue(in million $)",x="production house")+coord_flip()

‘Warnes Bros’ has earned the highest returns on the movies it has produced.

Return on Investment

It is important for a production house to understand if it is making any profit on the production cost incurred. We create a variable called ‘roi’ which signifies the return on investment through ratio of revenue to budget. Then we categorise all the movies into 2 categories based on the roi - if roi is less than 1, then the movie is making a ‘loss on investment’, else it makes a ‘gain on investment’.

Top 10 movies by Return on Investment

We want to see that for movies made for a budget greater than 10 million, which movies have the highest ROI.

#Top 10 ROI
movies.upd <- movies %>% mutate(roi=revenue/budget)
rev.0 <- movies.upd %>% filter(budget>10)
toproi <- rev.0[order(-rev.0$roi),] %>% head(10)

ggplot(toproi, aes(x=reorder(title,roi), y=roi)) + 
  geom_bar(width=.6, fill="blue",stat="identity") + 
  labs(title="Top 10 movies based on return on investment(Budget>1 million $)",y="Return on Investment",x="movie title")+coord_flip()

‘E.T. the Extra-Terrestrial’ and ‘Star Wars’ have the highest ROI for movies made on a budget of more than $10 Million.

Average Movie Rating based on the Return on Invsetment

movies.upd$Profit <- ifelse(movies.upd$roi>=1,c("gain on investment"),c("loss on investment"))
plot1<- movies.upd %>% filter(roi>0) %>% group_by(Profit)
ggplot(plot1,aes(x=Profit,y=vote_average))+ geom_boxplot(fill="brown") + labs(title="Average Movie Rating compared to Return on Investment",y='Average Movie Rating',x='Return on Investment') + theme(axis.text.x = element_text(angle=0, vjust=0.6))

We see that as expected, movies which are making a profit on investment are also rated higher by the users on IMDB.