Every year hundreds of movies get released but not all of them are successful. The aim of the project is to analyze the TMDB movie dataset which has details about the movies, their production cost and revenue generated along with rating inforamtion. We want to come up with insights by analyzing the dataset.
As mentioned on the Kaggle website, the major question we are trying to answer here is what can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry.
We are trying to analyze the dataset to find answers to the questions posed in the problem statement. We will start with a univariate analysis and then move ahead with a mutivariate analysis to understand the impact of certain factors in determining the success of the movie. The success of a movie could be measured in terms of the following metrics:
To start with the data analysis, we have used the following R packages:
Here we are checking if the package is installed on your PC. Else, dont worry, we are installing it or you!!
#Checking if the package is installed on your PC. ELse installing it or you!!
packages <- c("readr", "jsonlite","tidyr","dplyr","DT","knitr")
for (i in packages){
if( i %in% rownames(installed.packages()) == FALSE)
{install.packages(i)
}
}
Loading the required libraries.
#Loading the required packages
library(readr)
library(jsonlite)
library(tidyr)
library(dplyr)
library(DT)
library(knitr)
The dataset is obtained from Kaggle.
We have downloaded the dataset from this source and hosted in our custom GitHub profile for creating a robust source of data. This will make sure that we can even have mutiple versions of the data along with corresponding analysis making it easier for code sharing.
The TMDB dataset was generated from The Movie Database API. It has a set of columns that were collated to understand describe the success of movies.
All these details makes the dataset interesting from the perspective of doing an Exploratory Data Analysis to try generating insights.
We perform the data importing from the github profile where we have hosted the data. The url for the data is set to the variable url and the data is read into the object df
#URL to read the data from
url <- "https://raw.githubusercontent.com/rengalv/Movies-Data-Analysis-Grab-a-Popcorn/master/tmdb_5000_movies.csv"
#Reading the csv file from the URL
movies <- read_csv(url,col_names = TRUE,na = "NA")
#Preview of the data dimensions and column names
dim(movies)
## [1] 4803 20
#Examining the column names in the dataset
colnames(movies)
## [1] "budget" "genres" "homepage"
## [4] "id" "keywords" "original_language"
## [7] "original_title" "overview" "popularity"
## [10] "production_companies" "production_countries" "release_date"
## [13] "revenue" "runtime" "spoken_languages"
## [16] "status" "tagline" "title"
## [19] "vote_average" "vote_count"
When we examine the Structure of the dataset, we find that the columns can be in any of the following datatypes:
We also find that even though some columns have a class as chr, they are actually in JSON format which needs to be converted to columns with one of the base r datatypes.
The first thing we wanted to do was to remove the duplicate values from the dataset. We did this by checking if there were two rows in the dataset that had the same movie name.
movies <- movies[!duplicated(movies$title), ]
The de-duplicated dataset has the following dimensions:
dim(movies)
## [1] 4800 20
We notice from the dataset that it has columns with data in the JSON format. So, we need to bring those columns to the base datatypes in r so that we can perform analysis.
Following are the columns found to be in JSON format:
We worked on converting each of these columns into separate dataframes.
Since the implementation was replicable for each of the columns in the JSON format, we wrote a function to implement the same. Finally we have 5 new data frames which can then be merged with our base movies dataset.
#Takes in the dataframe and column name to be trnasformed
json_to_df <- function(df, column){
column_1 <- df[apply(df[,column],1,nchar)>2,] #Checks if there is an entry
list_1 <- lapply(column_1[[column]], fromJSON) #Converts the JSON to a list
values <- data.frame(unlist(lapply(list_1, function(x) paste(x$name,collapse = ",")))) #Collapsing all the values of the list as a key value pair
final_df <- cbind(column_1$id, column_1$title, values) #new data frame with the key and values a s columns
names(final_df) <- c("id", "title", column)
return(final_df)
}
genres_df <- json_to_df(movies, "genres")
keywords_df <- json_to_df(movies, "keywords")
prod_cntry_df <- json_to_df(movies, "production_countries")
prod_cmpny_df <- json_to_df(movies, "production_companies")
spoken_lang_df <- json_to_df(movies, "spoken_languages")
Now that we have created them as separate dataframes, we want to combine all these dataframes with the movies dataframe to get the final dataset which we will be used for the analysis going forward
For that, we first remove the JSON columns present in the movies dataset and then combine the new columns we have created for all the JSON columns
#Subset the movies dataframe by removing the JSON columns
movies_1 <- subset(movies, select = -c(genres,keywords,production_companies, production_countries,spoken_languages))
#Join the columns from all the generated dataframes from previous step
movies_new <- movies_1 %>%
full_join(genres_df, by = c("id", "title")) %>%
full_join(keywords_df, by = c("id", "title")) %>%
full_join(prod_cntry_df, by = c("id", "title")) %>%
full_join(prod_cmpny_df, by = c("id", "title")) %>%
full_join(spoken_lang_df, by = c("id", "title"))
#Have a look at the final dataset
glimpse(movies_new)
## Observations: 4,800
## Variables: 20
## $ budget <int> 237000000, 300000000, 245000000, 25000000...
## $ homepage <chr> "http://www.avatarmovie.com/", "http://di...
## $ id <int> 19995, 285, 206647, 49026, 49529, 559, 38...
## $ original_language <chr> "en", "en", "en", "en", "en", "en", "en",...
## $ original_title <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ overview <chr> "In the 22nd century, a paraplegic Marine...
## $ popularity <dbl> 150.43758, 139.08262, 107.37679, 112.3129...
## $ release_date <date> 2009-12-10, 2007-05-19, 2015-10-26, 2012...
## $ revenue <dbl> 2787965087, 961000000, 880674609, 1084939...
## $ runtime <int> 162, 169, 148, 165, 132, 139, 100, 141, 1...
## $ status <chr> "Released", "Released", "Released", "Rele...
## $ tagline <chr> "Enter the World of Pandora.", "At the en...
## $ title <chr> "Avatar", "Pirates of the Caribbean: At W...
## $ vote_average <dbl> 7.2, 6.9, 6.3, 7.6, 6.1, 5.9, 7.4, 7.3, 7...
## $ vote_count <int> 11800, 4500, 4466, 9106, 2124, 3576, 3330...
## $ genres <fct> Action,Adventure,Fantasy,Science Fiction,...
## $ keywords <fct> culture clash,future,space war,space colo...
## $ production_countries <fct> United States of America,United Kingdom, ...
## $ production_companies <fct> Ingenious Film Partners,Twentieth Century...
## $ spoken_languages <fct> English,Español, English, Français,Englis...
size <- dim(movies_new)
We find that there are 4800 observations and 20 columns.
We wanted to check there were how many rows in the data set with complete values for all the columns.
complete_data <- sum(complete.cases(movies_new))
We find that there are 4102 rows with no missing data in the dataset. We did not remove any of the missing values for now. We are planning to look at each column separately and see if we can perform any imputations (if required) while performing the analysis.
The table below is the preview of the final dataset. We have printed the first 100 rows of the dataset.
Each row corresponds to a movie and each column is a feature corresponding to the movie.
movies_new <- select(movies_new, title, everything())
datatable(head(movies_new,100))
The final dataset after performing data cleaning has the following columns. The class of each of the column is also presented below.
col <- data.frame(sapply(movies_new, class)) #Gettting the class of each column
Row_names <- rownames(col) #Getting the row names which ae the columns of our dataset
class <- col[,1]
Data_types <- cbind(Column = Row_names, Class = as.character(class))
Data_types <- Data_types[2:nrow(Data_types),] #Removing the first entry alone because it is for the row number
kable(Data_types) #Displays the table as shown below
| Column | Class |
|---|---|
| budget | integer |
| homepage | character |
| id | integer |
| original_language | character |
| original_title | character |
| overview | character |
| popularity | numeric |
| release_date | Date |
| revenue | numeric |
| runtime | integer |
| status | character |
| tagline | character |
| vote_average | numeric |
| vote_count | integer |
| genres | factor |
| keywords | factor |
| production_countries | factor |
| production_companies | factor |
| spoken_languages | factor |
Given the dataset, we can perform a plethora of analysis is what we believe. Following are some of the questions we are trying to answer with our exploratory data analysis.