Data Wrangling Project

Analysis of TMDB Movies Data

Introduction

Problem statement: To analyze TMDB 5000 Movie data set obatined from Kaggle in order to provide insights into such as

Difference in movies produced by different countries and production companies in terms of number produced, duration, genres, languages, budget, revenue, etc.
Analyze trend over the years in terms of number produced, genres, duration, etc.

Proposed Approach:

Preliminary analysis of data to see number of missing and abnormal values
Cleaning of data
Data Manipulation - the data set has few columns in JSON format from which data has to be extracted. The resulting data frames will be joined to existing data set.
Data Visualization - Using ggplots to visualize data

How is it useful to consumers: The analysis will help consumers understand what languages, genres, duration, etc generates more revenue.

Packages Required

Packages used:

Dplyr - For data manipulation
Jsonlite - To extract JSON data
Tidyr - To clean data
DT - To create data table so as to view data

library(dplyr)
library(jsonlite)
library(tidyr)
library(DT)

I have included the packages that I have used so far.

Data Preparation

Data Source

Source of data is Kaggle - TMDB 5000 Movie Data set

The data set was created to answer different questions about movies such as what can be said about success of a movie, have certain production companies found a formula for success of a movie, why does a film with very high budget fail at box office and many more such questions.

It contains 20 variables and 4803 observations.

Data Import code:

movies <- read.csv("tmdb_5000_movies.csv", header = TRUE, stringsAsFactors = FALSE)

class(movies)

## [1] "data.frame"

dim(movies)

## [1] 4803   20

colnames(movies)

##  [1] "budget"               "genres"               "homepage"            
##  [4] "id"                   "keywords"             "original_language"   
##  [7] "original_title"       "overview"             "popularity"          
## [10] "production_companies" "production_countries" "release_date"        
## [13] "revenue"              "runtime"              "spoken_languages"    
## [16] "status"               "tagline"              "title"               
## [19] "vote_average"         "vote_count"

colnames(movies)[4] <- "movie_id"    #Renaming id column as movie_id

Summary of Variables

The original data set has 20 variables and their data types are mentioned in brackets. They are as follows:

Budget (dbl) - Budget of a movie
Genres (chr)- Genre of a movie such as comedy, action, etc
Homepage (chr) - url of a movie
id (dbl) - ID of a movie
Keywords (dbl) - Major keywords describing a movie
Original_language (chr) - Original language in which a movie was made
Original_title (chr) - Title of a movie
Overview (chr) - Brief description of a movie
Popularity (dbl)
Production_companies (chr) - Names of the companies producing a movie
Production_countries (chr) - Names of the countries producing a movie
Release_date (dbl) - Release date of a movie
Revenue (dbl) - Revenue eanred by a movie
Runtime (dbl) - Duration of a movie
Spoken_languages (chr) - Languages in which movie is available
Status (chr) - Status of a movie, if it’s released, in production or rumoured to be produced
Tagline (chr) - Tagline of a movie
Title (chr) - Actual title of a movie
Vote_average (dbl) - Average vote received for a movie in IMDB
Vote_count(dbl) - Number of votes received by a movie

Original data set

Preview of first 100 rows of original data:

datatable(head(movies, n = 100))

Data Cleaning

Missing Values:

There are several missing values in the data set. I have assigned NA to observations having blank, 0 and [] values. The number of complete cases are as follows:

#Assign NA to blank values
movies[movies == ""] <- NA
movies[movies == "[]"] <- NA
movies[movies == 0] <- NA

sum(complete.cases(movies))

## [1] 1225

However, I am retaining all the observations now. I will delete them later during data manipulation

Duplicate Values:

Checking for duplicate movie titles using title column and removing the duplicate observations.

#Checking for duplicate movie title and removing duplicate values
movies <- movies[!duplicated(movies$title), ]
dim(movies)

## [1] 4800   20

There were three duplicate rows and they were removed.

There are five columns in the original data set which have JSON values. They have been extracted and multiple values are stored in one column separated by comma.

#This code has been inspired from kernel 'Tidydata Movie Data set exploration' from Kaggle
#Creating a tibble, Keywords, which stores keywords
keywords <- movies %>%    
  filter(nchar(keywords) > 2) %>%                 # fiter out blank keywords field
  mutate(                                         # create a new field 
    js = lapply(keywords, fromJSON)               # containing a LIST of keyword and value pairs
  ) %>%                                           # called id and name
  unnest(js) %>%                                  # turn each keyword/value pairs in the LIST into a row
  select(movie_id, title, keywords = name)

#Combining the keywords of a movie in a single column
keywords <- aggregate(keywords ~.,data = keywords, paste, collapse = ",")

#Creating a tibble, genres, which stores genres
genres <- movies %>%    
  filter(nchar(genres) > 2) %>%                   
  mutate(                                          
    js = lapply(genres, fromJSON)                 
  ) %>%                                           
  unnest(js) %>%                                  
  select(movie_id, title, genres = name) 

#Combining genres of a movie in a single column
genres <- aggregate(genres ~.,data = genres, paste, collapse = ",")

#Creating a tibble, production_companies, which stores production companies
production_companies <- movies %>%    
  filter(nchar(production_companies) > 2) %>%     
  mutate(                                         
    js = lapply(production_companies, fromJSON)   
  ) %>%                                           
  unnest(js) %>%                                  
  select(movie_id, title, production_companies = name)

#Combining production_companies of a movie in a single column
production_companies <- aggregate(production_companies ~.,data = production_companies, paste, collapse = ",")

#Creating a tibble, production_countries, which stores production countries
production_countries <- movies %>%    
  filter(nchar(production_countries) > 2) %>%     
  mutate(                                         
    js = lapply(production_countries, fromJSON)   
  ) %>%                                          
  unnest(js) %>%                                  
  select(movie_id, title, production_countries = name) 

#Combining production_countries of a movie in a single column
production_countries <- aggregate(production_countries ~.,data = production_countries, paste, collapse = ",")


#Creating a tibble, spoken_languages, which stores languages of the movies
spoken_languages <- movies %>%    
  filter(nchar(spoken_languages) > 2) %>%        
  mutate(                                         
    js = lapply(spoken_languages, fromJSON)      
  ) %>%                                          
  unnest(js) %>%                                 
  select(movie_id, title, spoken_languages = iso_639_1) 

#Combining spoken_languages of a movie in a single column
spoken_languages <- aggregate(spoken_languages ~.,data = spoken_languages, paste, collapse = ",")

#Dropping existing columns - keywords, genres, production_companies, production_countries, spoken_languages
movies <- movies %>%
  select(budget, homepage, movie_id, original_language, original_title, overview, popularity, release_date,
         revenue, runtime, status, tagline, title, vote_average, vote_count)

#Attaching columns - keywords, genres, production_companies, production_countries, spoken_languages using full_join in order to retain all observations.
movies <- movies %>%
            full_join(keywords, by = c("movie_id", "title")) %>%
            full_join(genres, by = c("movie_id", "title")) %>%
            full_join(production_companies, by = c("movie_id", "title")) %>%
            full_join(production_countries, by = c("movie_id", "title")) %>%
            full_join(spoken_languages, by = c("movie_id", "title"))

Data Preview

Preview of first 100 rows of cleaned data:

datatable(head(movies, n = 100))

Proposed Exploratory Data Analysis

Some of the analysis I plan to do include:

Basic analysis such as popular genres and languages
Analyzing duration, budget and revenue of movies produced by different countries and production companies
Trend analysis over the years in terms of number of movies produced, genres, duration and languages of the movies
To see which variables influence revenue of a movie using linear regression models

Plots and tables which will help in analysis are:

Bar charts
Scatter plots
Ggplots

What I need to learn:

Ggplots