Introduction

Every year hundreds of movies get released but not all of them are successful. The aim of the project is to analyze the TMDB movie dataset which has details about the movies, their production cost and revenue generated along with rating inforamtion. We want to come up with insights by analyzing the dataset.

Problem Statement

As mentioned on the Kaggle website, the major question we are trying to answer here is what can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry.

Analysis and Business Impact proposed

We are trying to analyze the dataset to find answers to the questions posed in the problem statement. We will start with a univariate analysis and then move ahead with a mutivariate analysis to understand the impact of certain factors in determining the success of the movie. The success of a movie could be measured in terms of the following metrics:

  • Return on Investment: Which will be a measure of the Revenue generated
  • Rating by Public: Which will also take into account the number of votes posted

Packages Required

To start with the data analysis, we have used the following R packages:

  • readr : To read the dataset which is in the csv format
  • jsonlite : To extract the columns that are in json format in the dataset
  • tidyr : To perform data cleaning operations
  • dplyr : To manipulate the dataset
  • DT : To display the cleaned dataset in a tabular format
  • knitr : To display a table of column names along with datatypes

Here we are checking if the package is installed on your PC. Else, dont worry, we are installing it or you!!

#Checking if the package is installed on your PC. ELse installing it or you!!
packages <- c("readr", "jsonlite","tidyr","dplyr","DT","knitr")
for (i in packages){
if( i %in% rownames(installed.packages()) == FALSE) 
{install.packages(i)
  }
}

Loading the required libraries.

#Loading the required packages
library(readr)
library(jsonlite)
library(tidyr)
library(dplyr)
library(DT)
library(knitr)

Data Preparation

Data Source

Original Source

The dataset is obtained from Kaggle.

Hosted in Github

We have downloaded the dataset from this source and hosted in our custom GitHub profile for creating a robust source of data. This will make sure that we can even have mutiple versions of the data along with corresponding analysis making it easier for code sharing.

Data Description

The TMDB dataset was generated from The Movie Database API. It has a set of columns that were collated to understand describe the success of movies.

  • There are columns like budget and revnew whoch gives the monetary aspect of the movies.
  • There are columns related to the votes, average rating etc.. which gives the perspective of what was the general reaction to the movies.
  • Other details like when was the movie released, what was the genre of the movie, which was the production house, production country and what all languages did the movie get released are also provided.

All these details makes the dataset interesting from the perspective of doing an Exploratory Data Analysis to try generating insights.

Data Importing

Fetching the Data from GitHub

We perform the data importing from the github profile where we have hosted the data. The url for the data is set to the variable url and the data is read into the object df

#URL to read the data from
url <- "https://raw.githubusercontent.com/rengalv/Movies-Data-Analysis-Grab-a-Popcorn/master/tmdb_5000_movies.csv"

#Reading the csv file from the URL
movies <- read_csv(url,col_names = TRUE,na = "NA")

#Preview of the data dimensions and column names
dim(movies)
## [1] 4803   20
#Examining the column names in the dataset
colnames(movies)
##  [1] "budget"               "genres"               "homepage"            
##  [4] "id"                   "keywords"             "original_language"   
##  [7] "original_title"       "overview"             "popularity"          
## [10] "production_companies" "production_countries" "release_date"        
## [13] "revenue"              "runtime"              "spoken_languages"    
## [16] "status"               "tagline"              "title"               
## [19] "vote_average"         "vote_count"

Looking at the structure of the Dataset

When we examine the Structure of the dataset, we find that the columns can be in any of the following datatypes:

  • Interger
  • Numeric
  • Character
  • Date

We also find that even though some columns have a class as chr, they are actually in JSON format which needs to be converted to columns with one of the base r datatypes.

Data Cleaning

Removing Duplicates

The first thing we wanted to do was to remove the duplicate values from the dataset. We did this by checking if there were two rows in the dataset that had the same movie name.

movies <- movies[!duplicated(movies$title), ]

The de-duplicated dataset has the following dimensions:

dim(movies)
## [1] 4800   20

Working with the JSON Format

We notice from the dataset that it has columns with data in the JSON format. So, we need to bring those columns to the base datatypes in r so that we can perform analysis.

Following are the columns found to be in JSON format:

  • Genres: id, name
  • Keywords: id, name
  • Production Companies: name, id
  • Production Countries: iso_3166_1, name
  • Spoken Languages: iso_639_1, name

We worked on converting each of these columns into separate dataframes.

Since the implementation was replicable for each of the columns in the JSON format, we wrote a function to implement the same. Finally we have 5 new data frames which can then be merged with our base movies dataset.

Function to convert the JSON column to a dataframe
#Takes in the dataframe and column name to be trnasformed
json_to_df <- function(df, column){
  column_1 <- df[apply(df[,column],1,nchar)>2,]  #Checks if there is an entry
   
  list_1 <- lapply(column_1[[column]], fromJSON) #Converts the JSON to a list
  values <- data.frame(unlist(lapply(list_1, function(x) paste(x$name,collapse = ",")))) #Collapsing all the values of the list as a key value pair
  
  final_df <- cbind(column_1$id, column_1$title, values) #new data frame with the key and values a s columns
  names(final_df)  <- c("id", "title", column)
  return(final_df)
  
}
Calling the json_to_df() to generate the dataframes for all the JSON Columns
genres_df <- json_to_df(movies, "genres")
keywords_df <- json_to_df(movies, "keywords")
prod_cntry_df <- json_to_df(movies, "production_countries")
prod_cmpny_df <- json_to_df(movies, "production_companies")
spoken_lang_df <- json_to_df(movies, "spoken_languages")

Merging the dataset

Now that we have created them as separate dataframes, we want to combine all these dataframes with the movies dataframe to get the final dataset which we will be used for the analysis going forward

For that, we first remove the JSON columns present in the movies dataset and then combine the new columns we have created for all the JSON columns

#Subset the movies dataframe by removing the JSON columns
movies_1 <- subset(movies, select =  -c(genres,keywords,production_companies, production_countries,spoken_languages))

#Join the columns from all the generated dataframes from previous step
movies_new <- movies_1 %>%
  full_join(genres_df, by = c("id", "title")) %>%
  full_join(keywords_df, by = c("id", "title")) %>%
  full_join(prod_cntry_df, by = c("id", "title")) %>%
  full_join(prod_cmpny_df, by = c("id", "title")) %>%
  full_join(spoken_lang_df, by = c("id", "title"))

#Have a look at the final dataset
glimpse(movies_new)
## Observations: 4,800
## Variables: 20
## $ budget               <int> 237000000, 300000000, 245000000, 25000000...
## $ homepage             <chr> "http://www.avatarmovie.com/", "http://di...
## $ id                   <int> 19995, 285, 206647, 49026, 49529, 559, 38...
## $ original_language    <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ original_title       <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ overview             <chr> "In the 22nd century, a paraplegic Marine...
## $ popularity           <dbl> 150.43758, 139.08262, 107.37679, 112.3129...
## $ release_date         <date> 2009-12-10, 2007-05-19, 2015-10-26, 2012...
## $ revenue              <dbl> 2787965087, 961000000, 880674609, 1084939...
## $ runtime              <int> 162, 169, 148, 165, 132, 139, 100, 141, 1...
## $ status               <chr> "Released", "Released", "Released", "Rele...
## $ tagline              <chr> "Enter the World of Pandora.", "At the en...
## $ title                <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ vote_average         <dbl> 7.2, 6.9, 6.3, 7.6, 6.1, 5.9, 7.4, 7.3, 7...
## $ vote_count           <int> 11800, 4500, 4466, 9106, 2124, 3576, 3330...
## $ genres               <fct> Action,Adventure,Fantasy,Science Fiction,...
## $ keywords             <fct> culture clash,future,space war,space colo...
## $ production_countries <fct> United States of America,United Kingdom, ...
## $ production_companies <fct> Ingenious Film Partners,Twentieth Century...
## $ spoken_languages     <fct> English,Español, English, Français,Englis...
size <- dim(movies_new)

We find that there are 4800 observations and 20 columns.

Missing values

We wanted to check there were how many rows in the data set with complete values for all the columns.

complete_data <- sum(complete.cases(movies_new))

We find that there are 4102 rows with no missing data in the dataset. We did not remove any of the missing values for now. We are planning to look at each column separately and see if we can perform any imputations (if required) while performing the analysis.

Data Preview

The table below is the preview of the final dataset. We have printed the first 100 rows of the dataset.

Each row corresponds to a movie and each column is a feature corresponding to the movie.

movies_new <- select(movies_new, title, everything())
datatable(head(movies_new,100))

Summary of Data

The final dataset after performing data cleaning has the following columns. The class of each of the column is also presented below.

col <- data.frame(sapply(movies_new, class)) #Gettting the class of each column
Row_names <- rownames(col) #Getting the row names which ae the columns of our dataset
class <- col[,1] 

Data_types <- cbind(Column = Row_names, Class = as.character(class)) 
Data_types <- Data_types[2:nrow(Data_types),] #Removing the first entry alone because it is for the row number

kable(Data_types) #Displays the table as shown below
Column Class
budget integer
homepage character
id integer
original_language character
original_title character
overview character
popularity numeric
release_date Date
revenue numeric
runtime integer
status character
tagline character
vote_average numeric
vote_count integer
genres factor
keywords factor
production_countries factor
production_companies factor
spoken_languages factor

Proposed EDA

Given the dataset, we can perform a plethora of analysis is what we believe. Following are some of the questions we are trying to answer with our exploratory data analysis.

  • How genres contribute to the ROI(Return on Investment): We are planning to compute a parameter called the ROI and comparing the ROI between different genres.
  • Which movies had a wide reach in terms of the languages in which they were released: use a bar plot to visualize the distribution of number of movies between different languages.
  • How has the industry performed in terms of number of movies over the years. We believe there might be more movies released during vacations. We want to understand if there is a seasonality in the number of movies released over the years.
  • How is the distribution of movies with repect to ratings, vote counts and several other factors present in the dataset. We are planning to use scatter plots or boxplots at the basic level to get some feel of the data.
  • Are there unicorn production companies that have better track record in terms of producing more successful movies
  • Does the success of the movie depend on the month of release in different countries? We want to identify factors that determine the sucess of a movie.