Packages used:
library(dplyr)
library(jsonlite)
library(tidyr)
library(DT)
I have included the packages that I have used so far.
Source of data is Kaggle - TMDB 5000 Movie Data set
The data set was created to answer different questions about movies such as what can be said about success of a movie, have certain production companies found a formula for success of a movie, why does a film with very high budget fail at box office and many more such questions.
It contains 20 variables and 4803 observations.
Data Import code:
movies <- read.csv("tmdb_5000_movies.csv", header = TRUE, stringsAsFactors = FALSE)
class(movies)
## [1] "data.frame"
dim(movies)
## [1] 4803 20
colnames(movies)
## [1] "budget" "genres" "homepage"
## [4] "id" "keywords" "original_language"
## [7] "original_title" "overview" "popularity"
## [10] "production_companies" "production_countries" "release_date"
## [13] "revenue" "runtime" "spoken_languages"
## [16] "status" "tagline" "title"
## [19] "vote_average" "vote_count"
colnames(movies)[4] <- "movie_id" #Renaming id column as movie_id
The original data set has 20 variables and their data types are mentioned in brackets. They are as follows:
Preview of first 100 rows of original data:
datatable(head(movies, n = 100))
Missing Values:
There are several missing values in the data set. I have assigned NA to observations having blank, 0 and [] values. The number of complete cases are as follows:
#Assign NA to blank values
movies[movies == ""] <- NA
movies[movies == "[]"] <- NA
movies[movies == 0] <- NA
sum(complete.cases(movies))
## [1] 1225
However, I am retaining all the observations now. I will delete them later during data manipulation
Duplicate Values:
Checking for duplicate movie titles using title column and removing the duplicate observations.
#Checking for duplicate movie title and removing duplicate values
movies <- movies[!duplicated(movies$title), ]
dim(movies)
## [1] 4800 20
There were three duplicate rows and they were removed.
There are five columns in the original data set which have JSON values. They have been extracted and multiple values are stored in one column separated by comma.
#This code has been inspired from kernel 'Tidydata Movie Data set exploration' from Kaggle
#Creating a tibble, Keywords, which stores keywords
keywords <- movies %>%
filter(nchar(keywords) > 2) %>% # fiter out blank keywords field
mutate( # create a new field
js = lapply(keywords, fromJSON) # containing a LIST of keyword and value pairs
) %>% # called id and name
unnest(js) %>% # turn each keyword/value pairs in the LIST into a row
select(movie_id, title, keywords = name)
#Combining the keywords of a movie in a single column
keywords <- aggregate(keywords ~.,data = keywords, paste, collapse = ",")
#Creating a tibble, genres, which stores genres
genres <- movies %>%
filter(nchar(genres) > 2) %>%
mutate(
js = lapply(genres, fromJSON)
) %>%
unnest(js) %>%
select(movie_id, title, genres = name)
#Combining genres of a movie in a single column
genres <- aggregate(genres ~.,data = genres, paste, collapse = ",")
#Creating a tibble, production_companies, which stores production companies
production_companies <- movies %>%
filter(nchar(production_companies) > 2) %>%
mutate(
js = lapply(production_companies, fromJSON)
) %>%
unnest(js) %>%
select(movie_id, title, production_companies = name)
#Combining production_companies of a movie in a single column
production_companies <- aggregate(production_companies ~.,data = production_companies, paste, collapse = ",")
#Creating a tibble, production_countries, which stores production countries
production_countries <- movies %>%
filter(nchar(production_countries) > 2) %>%
mutate(
js = lapply(production_countries, fromJSON)
) %>%
unnest(js) %>%
select(movie_id, title, production_countries = name)
#Combining production_countries of a movie in a single column
production_countries <- aggregate(production_countries ~.,data = production_countries, paste, collapse = ",")
#Creating a tibble, spoken_languages, which stores languages of the movies
spoken_languages <- movies %>%
filter(nchar(spoken_languages) > 2) %>%
mutate(
js = lapply(spoken_languages, fromJSON)
) %>%
unnest(js) %>%
select(movie_id, title, spoken_languages = iso_639_1)
#Combining spoken_languages of a movie in a single column
spoken_languages <- aggregate(spoken_languages ~.,data = spoken_languages, paste, collapse = ",")
#Dropping existing columns - keywords, genres, production_companies, production_countries, spoken_languages
movies <- movies %>%
select(budget, homepage, movie_id, original_language, original_title, overview, popularity, release_date,
revenue, runtime, status, tagline, title, vote_average, vote_count)
#Attaching columns - keywords, genres, production_companies, production_countries, spoken_languages using full_join in order to retain all observations.
movies <- movies %>%
full_join(keywords, by = c("movie_id", "title")) %>%
full_join(genres, by = c("movie_id", "title")) %>%
full_join(production_companies, by = c("movie_id", "title")) %>%
full_join(production_countries, by = c("movie_id", "title")) %>%
full_join(spoken_languages, by = c("movie_id", "title"))
Preview of first 100 rows of cleaned data:
datatable(head(movies, n = 100))
Some of the analysis I plan to do include:
Plots and tables which will help in analysis are:
What I need to learn: