Analysis of TMDB Movies Data

Introduction

  1. Problem statement: To analyze TMDB 5000 Movie data set obatined from Kaggle in order to provide insights into such as
  • Difference in movies produced by different countries and production companies in terms of number produced, duration, genres, languages, budget, revenue, etc.
  • Analyze trend over the years in terms of number produced, genres, duration, etc.
  1. Proposed Approach:
  • Preliminary analysis of data to see number of missing and abnormal values
  • Cleaning of data
  • Data Manipulation - the data set has few columns in JSON format from which data has to be extracted. The resulting data frames will be joined to existing data set.
  • Data Visualization - Using ggplots to visualize data
  1. How is it useful to consumers: The analysis will help consumers understand what languages, genres, duration, etc generates more revenue.

Packages Required

Packages used:

  • Dplyr - For data manipulation
  • Jsonlite - To extract JSON data
  • Tidyr - To clean data
  • DT - To create data table so as to view data
library(dplyr)
library(jsonlite)
library(tidyr)
library(DT)

I have included the packages that I have used so far.

Data Preparation

Data Source

Source of data is Kaggle - TMDB 5000 Movie Data set

The data set was created to answer different questions about movies such as what can be said about success of a movie, have certain production companies found a formula for success of a movie, why does a film with very high budget fail at box office and many more such questions.

It contains 20 variables and 4803 observations.

Data Import code:

movies <- read.csv("tmdb_5000_movies.csv", header = TRUE, stringsAsFactors = FALSE)

class(movies)
## [1] "data.frame"
dim(movies)
## [1] 4803   20
colnames(movies)
##  [1] "budget"               "genres"               "homepage"            
##  [4] "id"                   "keywords"             "original_language"   
##  [7] "original_title"       "overview"             "popularity"          
## [10] "production_companies" "production_countries" "release_date"        
## [13] "revenue"              "runtime"              "spoken_languages"    
## [16] "status"               "tagline"              "title"               
## [19] "vote_average"         "vote_count"
colnames(movies)[4] <- "movie_id"    #Renaming id column as movie_id

Summary of Variables

The original data set has 20 variables and their data types are mentioned in brackets. They are as follows:

  • Budget (dbl) - Budget of a movie
  • Genres (chr)- Genre of a movie such as comedy, action, etc
  • Homepage (chr) - url of a movie
  • id (dbl) - ID of a movie
  • Keywords (dbl) - Major keywords describing a movie
  • Original_language (chr) - Original language in which a movie was made
  • Original_title (chr) - Title of a movie
  • Overview (chr) - Brief description of a movie
  • Popularity (dbl)
  • Production_companies (chr) - Names of the companies producing a movie
  • Production_countries (chr) - Names of the countries producing a movie
  • Release_date (dbl) - Release date of a movie
  • Revenue (dbl) - Revenue eanred by a movie
  • Runtime (dbl) - Duration of a movie
  • Spoken_languages (chr) - Languages in which movie is available
  • Status (chr) - Status of a movie, if it’s released, in production or rumoured to be produced
  • Tagline (chr) - Tagline of a movie
  • Title (chr) - Actual title of a movie
  • Vote_average (dbl) - Average vote received for a movie in IMDB
  • Vote_count(dbl) - Number of votes received by a movie

Original data set

Preview of first 100 rows of original data:

datatable(head(movies, n = 100))

Data Cleaning

Missing Values:

There are several missing values in the data set. I have assigned NA to observations having blank, 0 and [] values. The number of complete cases are as follows:

#Assign NA to blank values
movies[movies == ""] <- NA
movies[movies == "[]"] <- NA
movies[movies == 0] <- NA

sum(complete.cases(movies))
## [1] 1225

However, I am retaining all the observations now. I will delete them later during data manipulation

Duplicate Values:

Checking for duplicate movie titles using title column and removing the duplicate observations.

#Checking for duplicate movie title and removing duplicate values
movies <- movies[!duplicated(movies$title), ]
dim(movies)
## [1] 4800   20

There were three duplicate rows and they were removed.

There are five columns in the original data set which have JSON values. They have been extracted and multiple values are stored in one column separated by comma.

#This code has been inspired from kernel 'Tidydata Movie Data set exploration' from Kaggle
#Creating a tibble, Keywords, which stores keywords
keywords <- movies %>%    
  filter(nchar(keywords) > 2) %>%                 # fiter out blank keywords field
  mutate(                                         # create a new field 
    js = lapply(keywords, fromJSON)               # containing a LIST of keyword and value pairs
  ) %>%                                           # called id and name
  unnest(js) %>%                                  # turn each keyword/value pairs in the LIST into a row
  select(movie_id, title, keywords = name)

#Combining the keywords of a movie in a single column
keywords <- aggregate(keywords ~.,data = keywords, paste, collapse = ",")

#Creating a tibble, genres, which stores genres
genres <- movies %>%    
  filter(nchar(genres) > 2) %>%                   
  mutate(                                          
    js = lapply(genres, fromJSON)                 
  ) %>%                                           
  unnest(js) %>%                                  
  select(movie_id, title, genres = name) 

#Combining genres of a movie in a single column
genres <- aggregate(genres ~.,data = genres, paste, collapse = ",")

#Creating a tibble, production_companies, which stores production companies
production_companies <- movies %>%    
  filter(nchar(production_companies) > 2) %>%     
  mutate(                                         
    js = lapply(production_companies, fromJSON)   
  ) %>%                                           
  unnest(js) %>%                                  
  select(movie_id, title, production_companies = name)

#Combining production_companies of a movie in a single column
production_companies <- aggregate(production_companies ~.,data = production_companies, paste, collapse = ",")

#Creating a tibble, production_countries, which stores production countries
production_countries <- movies %>%    
  filter(nchar(production_countries) > 2) %>%     
  mutate(                                         
    js = lapply(production_countries, fromJSON)   
  ) %>%                                          
  unnest(js) %>%                                  
  select(movie_id, title, production_countries = name) 

#Combining production_countries of a movie in a single column
production_countries <- aggregate(production_countries ~.,data = production_countries, paste, collapse = ",")


#Creating a tibble, spoken_languages, which stores languages of the movies
spoken_languages <- movies %>%    
  filter(nchar(spoken_languages) > 2) %>%        
  mutate(                                         
    js = lapply(spoken_languages, fromJSON)      
  ) %>%                                          
  unnest(js) %>%                                 
  select(movie_id, title, spoken_languages = iso_639_1) 

#Combining spoken_languages of a movie in a single column
spoken_languages <- aggregate(spoken_languages ~.,data = spoken_languages, paste, collapse = ",")

#Dropping existing columns - keywords, genres, production_companies, production_countries, spoken_languages
movies <- movies %>%
  select(budget, homepage, movie_id, original_language, original_title, overview, popularity, release_date,
         revenue, runtime, status, tagline, title, vote_average, vote_count)

#Attaching columns - keywords, genres, production_companies, production_countries, spoken_languages using full_join in order to retain all observations.
movies <- movies %>%
            full_join(keywords, by = c("movie_id", "title")) %>%
            full_join(genres, by = c("movie_id", "title")) %>%
            full_join(production_companies, by = c("movie_id", "title")) %>%
            full_join(production_countries, by = c("movie_id", "title")) %>%
            full_join(spoken_languages, by = c("movie_id", "title"))

Data Preview

Preview of first 100 rows of cleaned data:

datatable(head(movies, n = 100))

Proposed Exploratory Data Analysis

Some of the analysis I plan to do include:

  • Basic analysis such as popular genres and languages
  • Analyzing duration, budget and revenue of movies produced by different countries and production companies
  • Trend analysis over the years in terms of number of movies produced, genres, duration and languages of the movies
  • To see which variables influence revenue of a movie using linear regression models

Plots and tables which will help in analysis are:

  • Bar charts
  • Scatter plots
  • Ggplots

What I need to learn:

  • Ggplots