Every year hundreds of movies get released but not all of them are successful. The aim of the project is to analyze the TMDB movie dataset which has details about the movies, their production cost and revenue generated along with rating inforamtion. We want to come up with insights by analyzing the dataset.
As mentioned on the Kaggle website, the major question we are trying to answer here is what can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry.
We are trying to analyze the dataset to find answers to the questions posed in the problem statement. We will start with a univariate analysis and then move ahead with a mutivariate analysis to understand the impact of certain factors in determining the success of the movie. The success of a movie could be measured in terms of the following metrics:
To start with the data analysis, we have used the following R packages:
Here we are checking if the package is installed on your PC. Else, dont worry, we are installing it or you!!
#Checking if the package is installed on your PC. ELse installing it or you!!
packages <- c("readr", "jsonlite","tidyr","dplyr","DT","knitr","ggplot2","ggpubr","lubridate","plotly")
for (i in packages){
if( i %in% rownames(installed.packages()) == FALSE)
{install.packages(i)
}
}
Loading the required libraries.
#Loading the required packages
library(readr)
library(jsonlite)
library(tidyr)
library(dplyr)
library(DT)
library(knitr)
library(ggplot2)
library(ggpubr)
library(lubridate)
library(plotly)
library(kableExtra)
The dataset is obtained from Kaggle.
We have downloaded the dataset from this source and hosted in our custom GitHub profile for creating a robust source of data. This will make sure that we can even have mutiple versions of the data along with corresponding analysis making it easier for code sharing.
The TMDB dataset was generated from The Movie Database API. It has a set of columns that were collated to understand describe the success of movies.
All these details makes the dataset interesting from the perspective of doing an Exploratory Data Analysis to try generating insights.
We perform the data importing from the github profile where we have hosted the data. The url for the data is set to the variable url
and the data is read into the object df
. Then the column names of the dataset are examined are found to be as shown below.
#URL to read the data from
url <- "https://raw.githubusercontent.com/rengalv/Movies-Data-Analysis-Grab-a-Popcorn/master/tmdb_5000_movies.csv"
#Reading the csv file from the URL
movies <- read_csv(url,col_names = TRUE,na = "NA")
#Preview of the data dimensions and column names
dim(movies)
## [1] 4803 20
#Examining the column names in the dataset
colnames(movies)
## [1] "budget" "genres" "homepage"
## [4] "id" "keywords" "original_language"
## [7] "original_title" "overview" "popularity"
## [10] "production_companies" "production_countries" "release_date"
## [13] "revenue" "runtime" "spoken_languages"
## [16] "status" "tagline" "title"
## [19] "vote_average" "vote_count"
When we examine the Structure of the dataset, we find that the columns can be in any of the following datatypes:
We also find that even though some columns have a class as chr
, they are actually in JSON format which needs to be converted to columns with one of the base r datatypes.
The first thing we wanted to do was to remove the duplicate values from the dataset. We did this by checking if there were two rows in the dataset that had the same movie name.
movies <- movies[!duplicated(movies$title), ]
The de-duplicated dataset has the following dimensions:
dim(movies)
## [1] 4800 20
We notice from the dataset that it has columns with data in the JSON format. So, we need to bring those columns to the base datatypes in r so that we can perform analysis.
Following are the columns found to be in JSON format:
We worked on converting each of these columns into separate dataframes.
Since the implementation was replicable for each of the columns in the JSON format, we wrote a function to implement the same. Finally we have 5 new data frames which can then be merged with our base movies
dataset.
#Takes in the dataframe and column name to be trnasformed
json_to_df <- function(df, column){
column_1 <- df[apply(df[,column],1,nchar)>2,] #Checks if there is an entry
list_1 <- lapply(column_1[[column]], fromJSON) #Converts the JSON to a list
values <- data.frame(unlist(lapply(list_1, function(x) paste(x$name,collapse = ",")))) #Collapsing all the values of the list as a key value pair
final_df <- cbind(column_1$id, column_1$title, values) #new data frame with the key and values a s columns
names(final_df) <- c("id", "title", column)
return(final_df)
}
genres_df <- json_to_df(movies, "genres")
keywords_df <- json_to_df(movies, "keywords")
prod_cntry_df <- json_to_df(movies, "production_countries")
prod_cmpny_df <- json_to_df(movies, "production_companies")
spoken_lang_df <- json_to_df(movies, "spoken_languages")
Now that we have created them as separate dataframes, we want to combine all these dataframes with the movies
dataframe to get the final dataset which we will be used for the analysis going forward
For that, we first remove the JSON columns present in the movies
dataset and then combine the new columns we have created for all the JSON columns
#Subset the movies dataframe by removing the JSON columns
movies_1 <- subset(movies, select = -c(genres,keywords,production_companies, production_countries,spoken_languages))
#Join the columns from all the generated dataframes from previous step
movies_new <- movies_1 %>%
full_join(genres_df, by = c("id", "title")) %>%
full_join(keywords_df, by = c("id", "title")) %>%
full_join(prod_cntry_df, by = c("id", "title")) %>%
full_join(prod_cmpny_df, by = c("id", "title")) %>%
full_join(spoken_lang_df, by = c("id", "title"))
#Have a look at the final dataset
glimpse(movies_new)
## Observations: 4,800
## Variables: 20
## $ budget <int> 237000000, 300000000, 245000000, 25000000...
## $ homepage <chr> "http://www.avatarmovie.com/", "http://di...
## $ id <int> 19995, 285, 206647, 49026, 49529, 559, 38...
## $ original_language <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ original_title <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ overview <chr> "In the 22nd century, a paraplegic Marine...
## $ popularity <dbl> 150.43758, 139.08262, 107.37679, 112.3129...
## $ release_date <date> 2009-12-10, 2007-05-19, 2015-10-26, 2012...
## $ revenue <dbl> 2787965087, 961000000, 880674609, 1084939...
## $ runtime <int> 162, 169, 148, 165, 132, 139, 100, 141, 1...
## $ status <chr> "Released", "Released", "Released", "Rele...
## $ tagline <chr> "Enter the World of Pandora.", "At the en...
## $ title <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ vote_average <dbl> 7.2, 6.9, 6.3, 7.6, 6.1, 5.9, 7.4, 7.3, 7...
## $ vote_count <int> 11800, 4500, 4466, 9106, 2124, 3576, 3330...
## $ genres <fct> Action,Adventure,Fantasy,Science Fiction,...
## $ keywords <fct> culture clash,future,space war,space colo...
## $ production_countries <fct> United States of America,United Kingdom, ...
## $ production_companies <fct> Ingenious Film Partners,Twentieth Century...
## $ spoken_languages <fct> English,Español, English, Français,Englis...
size <- dim(movies_new)
We find that there are 4800 observations and 20 columns.
We wanted to check there were how many rows in the data set with complete values for all the columns.
complete_data <- sum(complete.cases(movies_new))
We find that there are 4102 rows with no missing data in the dataset. We did not remove any of the missing values for now. We are planning to look at each column separately and see if we can perform any imputations (if required) while performing the analysis.
The table below is the preview of the final dataset. We have printed the first 100 rows of the dataset.
Each row corresponds to a movie and each column is a feature corresponding to the movie.
movies_new <- select(movies_new, title, everything())
datatable(movies_new , options = list(pageLength = 5, scrollX = "200px"))
The final dataset after performing data cleaning has the following columns. The class of each of the column is also presented below.
col <- data.frame(sapply(movies_new, class)) #Gettting the class of each column
Row_names <- rownames(col) #Getting the row names which ae the columns of our dataset
class <- col[,1]
Data_types <- cbind(Column = Row_names, Class = as.character(class))
Data_types <- Data_types[2:nrow(Data_types),] #Removing the first entry alone because it is for the row number
kable(Data_types) %>%
kable_styling(bootstrap_options = "striped")#Displays the table as shown below
Column | Class |
---|---|
budget | integer |
homepage | character |
id | integer |
original_language | character |
original_title | character |
overview | character |
popularity | numeric |
release_date | Date |
revenue | numeric |
runtime | integer |
status | character |
tagline | character |
vote_average | numeric |
vote_count | integer |
genres | factor |
keywords | factor |
production_countries | factor |
production_companies | factor |
spoken_languages | factor |
We have split our Exploratory Data Analysis into five parts:
Starting from now, we are considering only movies that had a budget and revenue of more than $1M.
revenue_data<- movies_new %>%
filter(movies_new$budget >= 1000000 & movies_new$revenue >= 1000000) %>%
mutate(ROI = (revenue - budget)/budget)
dim(revenue_data)
## [1] 2998 21
We find that there are 2998 Movies satisfying these conditions.
For this analysis we have considered only movies released after 1990.
revenue_data %>%
filter(year(release_date) > 1990) %>% #Filter movies released after 1990
group_by(year = year(release_date), month = month(release_date)) %>%
summarise(count = n()) %>% #Count of Movies released
ggplot(aes(year, as.factor(month)))+
geom_tile(aes(fill=count),colour="white")+
scale_fill_gradient(low="light blue",high = "dark blue") +
xlab("Year of Movie release")+
ylab("Month of Movie Release")+
ggtitle("Heat Map")
The infamous Dump Months of January and Febraury can be clearly seen from the above plot. That is, there will be least number of movies that get released during this time of the year.
In contrast, we find that September has been the month when most movies have been released, This can be creditted to the begining of the school year when most friends get together after the vacation and might be the targeted audience for film makers.
We select a subset of the data for this analysis as shown below.
m <- revenue_data %>%
select(title, runtime, revenue, ROI, vote_average)
dim(m)
## [1] 2998 5
We have plotted the histogram of movie ratings to understand the pattern of number of movies across various ratings.
ggplot(revenue_data, aes(vote_average)) +
geom_histogram(bins = 50) +
geom_vline(xintercept = mean(revenue_data$vote_average,na.rm = TRUE),colour = "blue") +
ylab("Movie Count") +
xlab("IMDB Rating") +
ggtitle("Histogram for IMDB ratings") +
annotate("text", label = "Mean IMDB rating",x = 6.2, y = 50, size = 3, colour = "yellow",angle=90)
We can see that the average IMDB rating is 6.3 and most of the movies fall within the imdb rating of 6 to 7. This also tells us that average quality movies are more in number compared to good quality(rating > 7) or bad quality(rating < 6) movies.
We have plotted the ratings of different movies against their ROI and tried to analyze if there is any pattern or relationship between them.
m %>%
filter(vote_average >5 ) %>% #Filter to show the graph only for vote_average values greater than 5
ggplot(aes(x = vote_average, y = ROI)) +
geom_point(alpha = 0.5, col = "darkgreen") +
theme(axis.text.x=element_text(hjust=1))+
ggtitle("ROI Vs avg_rating")+
xlab("rating ")+
ylab("") +
geom_smooth()
We can see that as the rating increase the ROI is increasing especially within the range of 7 - 8.5, we can see a significant rise in the ROI.
We have also tried to plot the run time against ROI to see if there are any interesting insights.
#Plot for ROI Vs run time
a <- ggplot(m, aes(x = runtime, y = ROI)) +
geom_point(alpha = 0.5, col = "darkgreen") +
ggtitle("ROI Vs Runtime")+
xlab("Runtime (Minutes)")+
ylab("") +
geom_vline(xintercept = c(80,125),colour = c("blue","blue"))
#Zoomed Plot
b <- ggplot(m, aes(x = runtime, y = ROI)) +
geom_point(alpha = 0.5, col = "darkgreen") +
ggtitle("Zoomed-in: 80 - 125 minutes runtime")+
xlab("Runtime (Minutes)")+
ylab("ROI") +
coord_cartesian(xlim = c(80, 125))
#To create two plots in the same window
ggarrange(a, b,
labels = c("A", "B"),
ncol = 2, nrow = 1)
We can see from the plot in the left that there are many movies in the 80 to 120 minutes bucket with significant ROI. Hence we tried to zoom in to see the pattern in the left plot where you can clearly see there are quite a good number of movies with higher ROI.Thus if a production house wants to make a movie with a higher ROI, if the lenght of the movie could be between 80 to 125 minutes, there is a lot more chance of getting better ROI’s.
The first thing that anyone would want to analyse given a movie dataset is to analyze the budget and revenue across all movies and to see if there are movies with exhorbitant budgets and revenues.
budget <- revenue_data %>%
select(title, budget) %>%
arrange(desc(budget)) %>%
head(15)
x <- ggplot(budget, aes(x = reorder(title, -desc(budget)), y = budget/1000000)) +
geom_bar( stat = "identity")+
theme(axis.text.x=element_text(hjust=1))+
ggtitle("Movie Budgets")+
xlab("")+
ylab("Budget (in Millions)") +
coord_flip()
rev <- revenue_data %>%
select(title, revenue) %>%
arrange(desc(revenue)) %>%
head(15)
y <- ggplot(rev, aes(x = (reorder(title, -desc(revenue))), y = revenue/1000000)) +
geom_bar( stat = "identity")+
theme(axis.text.x=element_text(hjust=1))+
ggtitle("Movie Revenues")+
xlab("")+
ylab("Revenue (in Millions)") +
coord_flip()
ggarrange(x, y,
labels = c("A", "B"),
ncol = 1, nrow = 2)
We can see that the movies in the top 15 budget and revenue list are completely different except for one movie - ‘The Dark Knight Rises’ which triggers our interest to dwell further into analyze the ROI of movies with significant revenue or budget.
We create a new metric to measure the success of the movie called the ROI. The metric is computed as follows: \[ROI = (Revenue - Budget)/Budget\] We then wanted to select only the movies with the top 10 ROI. We have the following results:
m_roi <- revenue_data %>%
select(title, budget, revenue, ROI) %>%
arrange(desc(ROI)) %>%
head(15)
ggplot(m_roi, aes(x = reorder(title, -desc(ROI)), y = ROI)) +
geom_bar( stat = "identity")+
theme(axis.text.x=element_text(hjust=1))+
ggtitle("Movie ROI")+
xlab("")+
ylab("ROI") +
coord_flip()
We can clearly see that not even a single movie in the top 15 list of Revenue and Budget is present in the top 15 list of movies with higher ROIs which gives a clue to hypothesize that not all movies with high budget or high revenue end up with higher ROIs.
Finally, we also wanted to try out on presenting a 3D view of the movies that made a profit. We built the following 3D interactive visualization that would give information about all movies, their budget, revenue and also if the movie made a profit(green) or loss(red).
plot_ly(revenue_data, z = ~vote_average, x = ~budget/1000000, y = ~revenue/1000000,
color = ~(revenue > budget), colors = c('#00FF00', '#FF0000'),
size = I(3)) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Budget(In Million Dollars'),
yaxis = list(title = 'Revenue(In Million Dollars'),
zaxis = list(title = 'IMDB Rating')),
title = "3D Plot: IMDB Rating vs Revenue vs Budget",
showlegend = FALSE)
We find that the rating for a particular movie need not be dependent on the budget of the movie or the revenue from the movie. There are a lot of movies that did not have a profit but still had a pretty high rating. This is evidently seen from thr 3D plot above.
The exploratory analysis that we performed has provided us with a lot of insights about the TMDB dataset. We can see a brief summary of all the anlaysis and key take aways as below.