Required packages

Let’s install the necessary packages to proceed with this project.


library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(stringr)
library(lubridate)
library(editrules)
library(jsonlite)

Executive Summary

The purpose of this project is to apply various data pre-processing techniques to process and tidy up our datasets before conducting the analysis. The two datasets are collected from Kaggle: the first set named IMDB 5000 Movie Dataset contains data of around 5000 movies of 2017 or earlier from IMDb and the second set named TMDb 5000 Movie Dataset contains data of around 5000 movies of 2017 or earlier from TMDb.

Since both the datasets have different data which are required for the further analysis, we’ll start by merging these two datasets into a single dataset and keeping just the required variables for the purpose of our further analysis. In all these datasets, each row represents a single movie. In order to keep the dataset simple for achieving our main aim of applying data pre-processing techniques, we’ll just select 5 variables from the first dataset and 9 variables from the second, thus, 13 variables in the merged one.

Then, we’ll try to understand the data by checking its structure and more. Then, we’ll tidy up and manipulate the data to ensure our final dataset follows the rules of tidy data. Then, we’ll scan the data for missing values, errors, and outliers. We’ll try to get info on outliers and solve them or remove them per the dataset. Finally, we’ll apply some data transformation techniques on the dataset as well.

Data

1. IMDB 5000 Movie Dataset

The dataset IMDB 5000 Movie Dataset contains the movie data from IMDb (Internet Movie Database). It contains 4998 observations and 28 variables (columns), which are described below. I’m going to use some of them as detailed later.

Data Source: IMDB 5000 Movie Dataset from Kaggle.


# importing the first dataset
movies_data1 <- read_csv("imdb_5000_movies.csv")

# remove duplicate rows
movies_data1 <- distinct(movies_data1)

# check the dimensions
dim(movies_data1)
[1] 4998   28
# preview the dataset
head(movies_data1)
NA

2. TMDB 5000 Movie Dataset

The dataset TMDB 5000 Movie Dataset contains the movie data from TMDb (The Movie Database). It contains 4803 observations and 20 variables (columns), which are described below. I’m going to use some of them as detailed later.

Data Source: TMDB 5000 Movie Dataset from Kaggle.


# importing the second dataset
movies_data2 <- read_csv("tmdb_5000_movies.csv")

# remove duplicate rows
movies_data2 <- distinct(movies_data2)

# check the dimensions
dim(movies_data2)
[1] 4803   20
# preview the dataset
head(movies_data2)
NA

3. Merged Dataset: movies

First of all, let’s extract and rename the necessary columns in both the datasets before merging them in the final dataset. Then finally, let’s merge both the datasets to create the movies data frame, whose columns (13 in count) are described below.

# Variable Description
1 title Title of the movie
2 country Country of the movie
3 genres Genres of the movie
4 imdb_score IMDb score for the movie
5 fb_likes FB likes of the movie
6 tagline Tagline of the movie
7 release_date Release date of the movie
8 language Language of the movie
9 keywords Keywords for the movie
10 popularity Popularity of the movie
11 kmdb_vote Average vote for the movie
12 budget Budget of the movie (in USD)
13 revenue Gross revenue of the movie (in USD)

# select and re-order the variables

movies_data1 <- movies_data1[, c(12, 21, 10, 26, 28)]

movies_data2 <- movies_data2[, c(18, 17, 12, 6, 5, 9, 19, 1, 13)]

# rename the selected column names

movies_data1 <- rename(movies_data1, title = movie_title, fb_likes = movie_facebook_likes)

movies_data2 <- rename(movies_data2, language = original_language, tmdb_score = vote_average)

# join both of the above datasets
movies <- inner_join(movies_data2, movies_data1, by = "title")

# check the dimensions
dim(movies)
[1]  0 13
# preview the dataset
head(movies)
NA


Why there are 0 rows? Maybe, data is untidy. It requires a fix!

The merged dataset has 0 rows or observations, but the other two datasets have thousands of rows.

So, what’s the reason of this error? The first data frame movies_data from the dataset imdb_5000_movies.csv has data errors. Its title column has some unwanted characters, which was very difficult to find (debug) since R shows it as a blank space, unfortunately!

I found and solved the problem using the below steps:

  1. Print the first title from both the datasets.
  2. movies_data1$title shows a space at last. It shouldn’t be since read_csv() trims spaces.
  3. Nevertheless, replace the space with nothing using str_replace_all on movies_data1$title.

  4. It didn’t work as well - no rows in the merged dataset yet.

  5. So, let’s print the first title as UTF-8 using utf8ToInt().
  6. It shows that the space-looking character has an int value = 160.
  7. However, the space (" ") has an int value = 32, so it’s not a space.
  8. Finally, replace that character with nothing using str_replace_all on movies_data1$title.

  9. And at last, it worked! After merging the two datasets now, the merged dataset shows rows.


# let's print title from both datasets

movies_data1$title[1]
[1] "Avatar "
movies_data2$title[1]
[1] "Avatar"
# movies_data1$title has a space at last, so let's fix it
movies_data1$title <- str_replace_all(movies_data1$title, " $", "")

# join both of the above datasets
movies <- inner_join(movies_data2, movies_data1, by = "title")

# check the dimensions
dim(movies)
[1]  0 13
# preview the dataset
head(movies)

# it doesn't work - still, so is it not a space?

# let's print it in UTF-8 to understand the issue
utf8ToInt(movies_data1$title[1])
[1]  65 118  97 116  97 114 160
# it's not a space, since space != 160
utf8ToInt(" ")
[1] 32
# so, let's fix by replacing it
movies_data1$title <- str_replace_all(movies_data1$title, intToUtf8(160), "")

# join both of the above datasets
movies <- inner_join(movies_data2, movies_data1, by = "title")

# remove duplicate entries
movies <- distinct(movies)

# check the dimensions
dim(movies)
[1] 4602   13
# preview the dataset
head(movies)
NA

Understand

Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.

Let’s understand the data - the data frame movies in this section.

  1. We’ll start with dim() to understand its dimensions.
  2. Then, we’ll check its structure using str().
  3. Lastly, we’ll check the labels for two variables - language and country.

As visualized below, movies has 4602 observations (rows) and 13 variables (columns). Also, it has different types of data including character (chr), numeric (num), and date (Date), as it’s shown below in the output of the str() function. Then, we’ll do these:

  1. We’ll add factors (with labels) to language since its data is two-letter country code.
  2. We’ll add factors (without labels) to country since its data is already labelled.
  3. Finally, we’ll check the levels and the structure of the movies to confirm the changes.

# check the dimensions
dim(movies)
[1] 4602   13
# check the structure
str(movies)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   4602 obs. of  13 variables:
 $ title       : chr  "Avatar" "Pirates of the Caribbean: At World's End" "Spectre" "The Dark Knight Rises" ...
 $ tagline     : chr  "Enter the World of Pandora." "At the end of the world, the adventure begins." "A Plan No One Escapes" "The Legend Ends" ...
 $ release_date: Date, format: "2009-12-10" "2007-05-19" "2015-10-26" ...
 $ language    : chr  "en" "en" "en" "en" ...
 $ keywords    : chr  "[{\"id\": 1463, \"name\": \"culture clash\"}, {\"id\": 2964, \"name\": \"future\"}, {\"id\": 3386, \"name\": \""| __truncated__ "[{\"id\": 270, \"name\": \"ocean\"}, {\"id\": 726, \"name\": \"drug abuse\"}, {\"id\": 911, \"name\": \"exotic "| __truncated__ "[{\"id\": 470, \"name\": \"spy\"}, {\"id\": 818, \"name\": \"based on novel\"}, {\"id\": 4289, \"name\": \"secr"| __truncated__ "[{\"id\": 849, \"name\": \"dc comics\"}, {\"id\": 853, \"name\": \"crime fighter\"}, {\"id\": 949, \"name\": \""| __truncated__ ...
 $ popularity  : num  150.4 139.1 107.4 112.3 43.9 ...
 $ tmdb_score  : num  7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
 $ budget      : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 2.60e+08 2.58e+08 2.60e+08 2.80e+08 2.50e+08 2.50e+08 ...
 $ revenue     : num  2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
 $ country     : chr  "USA" "USA" "UK" "USA" ...
 $ genres      : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
 $ imdb_score  : num  7.9 7.1 6.8 8.5 6.6 6.2 7.8 7.5 7.5 6.9 ...
 $ fb_likes    : num  33000 0 85000 164000 24000 0 29000 118000 10000 197000 ...
# check the levels

levels(movies$language)
NULL
levels(movies$country)
NULL
# add factors to movies$language
movies <- mutate(movies, language = factor(language, 
                                           levels = c("af", "ar", "cn", "cs", "da", "de", "el", "en", "es", "fa", "fr", "he", "hi", "hu", 
                                                      "id", "is", "it", "ja", "ko", "ky", "nb", "nl", "no", "pl", "ps", "pt", "ro", "ru", 
                                                      "sl", "sv", "ta", "te", "th", "tr", "vi", "zh"), 
                                           labels = c("Afrikaans", "Arabic", "Chinese", "Czech", "Danish", "German", "Greek, Modern", 
                                                      "English", "Spanish", "Persian", "French", "Hebrew", "Hindi", "Hungarian", 
                                                      "Indonesian", "Icelandic", "Italian", "Japanese", "Korean", "Kyrgyz", "Norwegian", 
                                                      "Dutch", "Norwegian", "Polish", "Pashto", "Portuguese", "Romanian", "Russian", 
                                                      "Slovenian", "Swedish", "Tamil", "Telugu", "Thai", "Turkish", "Vietnamese", 
                                                      "Chinese")
                                           ))

movies <- mutate(movies, country = factor(country))

# check the structure
str(movies)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   4602 obs. of  13 variables:
 $ title       : chr  "Avatar" "Pirates of the Caribbean: At World's End" "Spectre" "The Dark Knight Rises" ...
 $ tagline     : chr  "Enter the World of Pandora." "At the end of the world, the adventure begins." "A Plan No One Escapes" "The Legend Ends" ...
 $ release_date: Date, format: "2009-12-10" "2007-05-19" "2015-10-26" ...
 $ language    : Factor w/ 34 levels "Afrikaans","Arabic",..: 8 8 8 8 8 8 8 8 8 8 ...
 $ keywords    : chr  "[{\"id\": 1463, \"name\": \"culture clash\"}, {\"id\": 2964, \"name\": \"future\"}, {\"id\": 3386, \"name\": \""| __truncated__ "[{\"id\": 270, \"name\": \"ocean\"}, {\"id\": 726, \"name\": \"drug abuse\"}, {\"id\": 911, \"name\": \"exotic "| __truncated__ "[{\"id\": 470, \"name\": \"spy\"}, {\"id\": 818, \"name\": \"based on novel\"}, {\"id\": 4289, \"name\": \"secr"| __truncated__ "[{\"id\": 849, \"name\": \"dc comics\"}, {\"id\": 853, \"name\": \"crime fighter\"}, {\"id\": 949, \"name\": \""| __truncated__ ...
 $ popularity  : num  150.4 139.1 107.4 112.3 43.9 ...
 $ tmdb_score  : num  7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
 $ budget      : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 2.60e+08 2.58e+08 2.60e+08 2.80e+08 2.50e+08 2.50e+08 ...
 $ revenue     : num  2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
 $ country     : Factor w/ 61 levels "Afghanistan",..: 60 60 59 60 60 60 60 60 59 60 ...
 $ genres      : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
 $ imdb_score  : num  7.9 7.1 6.8 8.5 6.6 6.2 7.8 7.5 7.5 6.9 ...
 $ fb_likes    : num  33000 0 85000 164000 24000 0 29000 118000 10000 197000 ...
# check the levels

levels(movies$language)
 [1] "Afrikaans"     "Arabic"        "Chinese"       "Czech"         "Danish"        "German"       
 [7] "Greek, Modern" "English"       "Spanish"       "Persian"       "French"        "Hebrew"       
[13] "Hindi"         "Hungarian"     "Indonesian"    "Icelandic"     "Italian"       "Japanese"     
[19] "Korean"        "Kyrgyz"        "Norwegian"     "Dutch"         "Polish"        "Pashto"       
[25] "Portuguese"    "Romanian"      "Russian"       "Slovenian"     "Swedish"       "Tamil"        
[31] "Telugu"        "Thai"          "Turkish"       "Vietnamese"   
levels(movies$country)
 [1] "Afghanistan"    "Argentina"      "Aruba"          "Australia"      "Bahamas"       
 [6] "Belgium"        "Brazil"         "Bulgaria"       "Cambodia"       "Cameroon"      
[11] "Canada"         "Chile"          "China"          "Colombia"       "Czech Republic"
[16] "Denmark"        "Egypt"          "Finland"        "France"         "Georgia"       
[21] "Germany"        "Greece"         "Hong Kong"      "Hungary"        "Iceland"       
[26] "India"          "Iran"           "Ireland"        "Israel"         "Italy"         
[31] "Japan"          "Kenya"          "Kyrgyzstan"     "Libya"          "Mexico"        
[36] "Netherlands"    "New Line"       "New Zealand"    "Nigeria"        "Norway"        
[41] "Official site"  "Pakistan"       "Panama"         "Peru"           "Philippines"   
[46] "Poland"         "Romania"        "Russia"         "Slovakia"       "South Africa"  
[51] "South Korea"    "Soviet Union"   "Spain"          "Sweden"         "Switzerland"   
[56] "Taiwan"         "Thailand"       "Turkey"         "UK"             "USA"           
[61] "West Germany"  

Tidy & Manipulate Data I

According to The Comprehensive R Archive Network, "tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table."

That said, the data frame movies is not tidy since the two columns - keywords and genres are not tidy.

Hence, we need to tidy up these two columns. I followed the below steps to tidy up these columns:

keywords

  1. keywords is a column with JSON strings, for example, [{“id”: 1463, “name”: “culture clash”}, {“id”: 2964, “name”: “future”}].
  2. We’ll first convert this JSON into a delimited string, then separate the data into multiple observations (rows) using unnest().
  3. We’ll use fromJSON() to ease the process of parsing or reading JSON objects or strings, else it will become very cumbersome.
  4. The alternative approach could be to use regular expression to extract the name from the above JSON to form a delimited string.
  5. Then, we’ll use unnest() along with strsplit() on the delimited string to separate the delimited values over multiple rows.

genres

  1. genres is a column with multiple delimited values (genres of the movie), for example, Action|Adventure|Fantasy|Sci-Fi.
  2. We’ll use mutate() along with case_when() and grepl() to form a two case formula as given below:
    • If grepl() matches, then TRUE, i.e., genre is found.
    • Else FALSE, i.e., genre is not found.
  3. Using the above approach, we’ll separate the delimited values over multiple columns - one column for each genre.

# first convert JSON to delimited string

simplifyToString <- function(x) {
  fromJSON(x["keywords"])$name %>% paste(collapse = '|')
}

movies <- mutate(movies, keywords = apply(movies, 1, simplifyToString))

# now unnest the `keywords` column into rows
movies <- movies %>% unnest(keywords = strsplit(keywords, "[|]"))

# now unnest the `genres` column into columns

movies <- movies %>% mutate(action = case_when(grepl("Action", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(adventure = case_when(grepl("Adventure", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(animation = case_when(grepl("Animation", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(biography = case_when(grepl("Biography", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(comedy = case_when(grepl("Comedy", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(crime = case_when(grepl("Crime", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(documentary = case_when(grepl("Documentary", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(drama = case_when(grepl("Drama", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(family = case_when(grepl("Family", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(fantasy = case_when(grepl("Fantasy", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(filmnoir = case_when(grepl("Film-Noir", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(history = case_when(grepl("History", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(horror = case_when(grepl("Horror", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(musical = case_when(grepl("Musical", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(mystery = case_when(grepl("Mystery", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(news = case_when(grepl("News", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(romance = case_when(grepl("Romance", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(scifi = case_when(grepl("Sci-Fi", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(short = case_when(grepl("Short", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(sport = case_when(grepl("Sport", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(thriller = case_when(grepl("Thriller", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(war = case_when(grepl("War", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(western = case_when(grepl("Western", genres) ~ TRUE, TRUE ~ FALSE))

# finally delete the `genres` column
movies <- subset(movies, select = -c(genres))

# preview the dataset
head(movies)
NA

Tidy & Manipulate Data II

Now, let’s create one variable from the existing variables. We have budget and revenue variables in our dataset. We can calculate the gross profit of each movie using the difference of these two columns. That said, we can create a profit variable by substracting budget from revenue. We’ll use the mutate() function to create this new variable (column).


# find profit for each movie
movies <- movies %>% mutate(profit = revenue - budget)

# preview the dataset
head(movies)
NA

Scan I

Let’s check the data for missin values (NAs), inconsistencies, and obvious errors (like ratings are more than 10 or less than 0). As we’ve visualized below, there are two columns having NAs. Their names and how we’ll deal with them is below:

tagline

  1. tagline has just 3379 NAs, which is not a small number seeing that movies has 35065 observations = 0.096%.
  2. Since its percentage is less than 5% and it’s a character value, so it won’t cause any issues in calculations.
  3. Also, it’s tagline of a movie, which may be unavailable for some movies per its production companies.
  4. That said, we’ll leave the tagline with NAs as it is, since nothing can be done about it.

language

  1. language has just 5 NAs, which is very small number seeing that movies has 35065 observations = 0.00014%.
  2. Since its percentage is less than 5%, we can leave it as it’s a character value, so no issues in calculations.
  3. Or, we can replace the NAs in this column with “English” assuming it’s suggested by the analyst.
  4. Of course, “English” is the most popular language for the movies, as it’s calcualted below.

# scan for infinites
sum(is.infinite(as.matrix(movies)))
[1] 0
# scan for NANs
sum(is.nan(as.matrix(movies)))
[1] 0
# scan for NAs
sum(is.na(movies))
[1] 3384
# scan for NAs per variable
colSums(is.na(movies))
       title      tagline release_date     language     keywords   popularity   tmdb_score 
           0         3379            0            5            0            0            0 
      budget      revenue      country   imdb_score     fb_likes       action    adventure 
           0            0            0            0            0            0            0 
   animation    biography       comedy        crime  documentary        drama       family 
           0            0            0            0            0            0            0 
     fantasy     filmnoir      history       horror      musical      mystery         news 
           0            0            0            0            0            0            0 
     romance        scifi        short        sport     thriller          war      western 
           0            0            0            0            0            0            0 
      profit 
           0 
# check the incomplete cases
head(movies[!complete.cases(movies), ])

# check for obvious errors

rules <- editfile("rules.txt", type = "all")
violated <- violatedEdits(rules, movies)
summary(violated)
No violations detected, 0 checks evaluated to NA
NULL
# count the movies per `language`
movies %>% group_by(language) %>% summarise(no_rows = length(language)) %>% 
  arrange(desc(no_rows)) %>% head()
Factor `language` contains implicit NA, consider using `forcats::fct_explicit_na`
# replace `language` with NA to English
movies$language[is.na(movies$language)] <- "English"

Scan II

We’ll produce boxplots for checking outliers in out dataset. As it’s visualized below, our dataset has probable outliers. So, we calculated the z-scores of the variables and remove the outliered observations (rows) based on the absolute value of more than three.

Of course, it was not necessary to remove the outliers since the outliers are less than 5%. For budget , the outliers count is 698, which is just 0.0199% (which is almost negligible). Also, for revenue , the outliers count is 881, which is just 0.025% (it’s mostly negligible).

Then, after removing the outliers from those variables, we’ll produce the boxplots again. However, we visualized that the outliers are still present. That said, based on our observations, these outliers are not the same case because our dataset is not normally distributed, hence the outliers will be present. The reason being the variables we’re dealing here are budget and revenue, which are not normally distributed in the world. Some movies, like Avengers franchise, are produced with very high budgets, and they earn superb revenues as well. Then, some movies are very low budget movies.

Note: We’re going to do all these scannings on just two variables to keep the document length under 20 pages.


# draw boxplots

movies$budget %>%  boxplot(main="Box Plot of Movie's Budget", 
                               ylab="Budget ($)", col = "grey")


movies$revenue %>%  boxplot(main="Box Plot of Movie's Revenue", 
                               ylab="Revenue ($)", col = "grey")


# calculate z scores

zscores.budget <- movies$budget %>%  scores(type = "z")
zscores.budget %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.8026 -0.6945 -0.3703  0.0000  0.2782  7.4119 
zscores.revenue <- movies$revenue %>%  scores(type = "z")
zscores.revenue %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-0.5906 -0.5661 -0.3771  0.0000  0.1185 13.2535 
# number of outliers per z scores

length( which(abs(zscores.budget) >3 ))
[1] 698
length( which(abs(zscores.revenue) >3 ))
[1] 881
# exclude outliers

movies.budget <- movies[- which(abs(zscores.budget) >3 ),]

movies.revenue <- movies[- which(abs(zscores.revenue) >3 ),]

# boxplot excluding outliers

movies.budget$budget %>%  boxplot(main="Box Plot of Movie's Budget (without outliers)", 
                               ylab="Budget ($)", col = "grey")


movies.revenue$revenue %>%  boxplot(main="Box Plot of Movie's Revenue (without outliers)", 
                               ylab="Revenue ($)", col = "grey")

Transform

Let’s produce the histogram of the gross profit (profit variable), which shows that the data is right skewed. So, let’s transpose the profit variable using mutate() to perform a reciprocal transporation for every row (observation). The main idea behind this scheme is to reduce the right skewedness of our distribution and make it a little more normal, which is visible after the transformation.


# transpose the `profit` variable

movies.transpose <- 
  movies %>% mutate(profit = movies$profit^(-1))

# draw histograms for `profit` and transposed `profit`

hist(movies$profit,
     main = "Histogram of Movie's Gross Profit", xlab = "Gross Profit ($)")


hist(movies.transpose$profit,
     main = "Histogram of Transposed Gross Profit", xlab = "Gross Profit Reciprocal")



---
title: "MATH2349 Semester 2, 2019"
author: "Ashutosh Kumar Singh (s3797767)"
subtitle: Assignment 3
output:
  html_notebook: default
  html_document:
    df_print: paged
---



## Required packages 

Let's install the necessary packages to proceed with this project.

```{r message=FALSE}

library(readr)
library(tidyr)
library(dplyr)
library(Hmisc)
library(outliers)
library(stringr)
library(lubridate)
library(editrules)
library(jsonlite)

```



## Executive Summary

The purpose of this project is to apply various data pre-processing techniques to process and tidy up our datasets before conducting the analysis. The two datasets are collected from Kaggle: the first set named *IMDB 5000 Movie Dataset* contains data of around 5000 movies of 2017 or earlier from IMDb and the second set named *TMDb 5000 Movie Dataset* contains data of around 5000 movies of 2017 or earlier from TMDb.

Since both the datasets have different data which are required for the further analysis, we'll start by merging these two datasets into a single dataset and keeping just the required variables for the purpose of our further analysis. In all these datasets, each row represents a single movie. In order to keep the dataset simple for achieving our main aim of applying data pre-processing techniques, we'll just select 5 variables from the first dataset and 9 variables from the second, thus, 13 variables in the merged one.

Then, we'll try to understand the data by checking its structure and more. Then, we'll tidy up and manipulate the data to ensure our final dataset follows the rules of tidy data. Then, we'll scan the data for missing values, errors, and outliers. We'll try to get info on outliers and solve them or remove them per the dataset. Finally, we'll apply some data transformation techniques on the dataset as well.


## Data


### 1. IMDB 5000 Movie Dataset

The dataset *IMDB 5000 Movie Dataset* contains the movie data from IMDb (Internet Movie Database). It contains 4998 observations and 28 variables (columns), which are described below. I'm going to use some of them as detailed later.

**Data Source:** [IMDB 5000 Movie Dataset](https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset) from Kaggle.

```{r message=FALSE}

# importing the first dataset
movies_data1 <- read_csv("imdb_5000_movies.csv")

# remove duplicate rows
movies_data1 <- distinct(movies_data1)

# check the dimensions
dim(movies_data1)

# preview the dataset
head(movies_data1)

```


### 2. TMDB 5000 Movie Dataset

The dataset *TMDB 5000 Movie Dataset* contains the movie data from TMDb (The Movie Database). It contains 4803 observations and 20 variables (columns), which are described below. I'm going to use some of them as detailed later.

**Data Source:** [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata/data) from Kaggle.

```{r message=FALSE}

# importing the second dataset
movies_data2 <- read_csv("tmdb_5000_movies.csv")

# remove duplicate rows
movies_data2 <- distinct(movies_data2)

# check the dimensions
dim(movies_data2)

# preview the dataset
head(movies_data2)

```

### 3. Merged Dataset: `movies`

First of all, let's extract and rename the necessary columns in both the datasets before merging them in the final dataset. Then finally, let's merge both the datasets to create the `movies` data frame, whose columns (13 in count) are described below.

| # | Variable | Description |
|:--|:---------|:------------|
|1|title|Title of the movie|
|2|country|Country of the movie|
|3|genres|Genres of the movie|
|4|imdb_score|IMDb score for the movie|
|5|fb_likes|FB likes of the movie|
|6|tagline|Tagline of the movie|
|7|release_date|Release date of the movie|
|8|language|Language of the movie|
|9|keywords|Keywords for the movie|
|10|popularity|Popularity of the movie|
|11|kmdb_vote|Average vote for the movie|
|12|budget|Budget of the movie (in USD)|
|13|revenue|Gross revenue of the movie (in USD)|

```{r message=FALSE}

# select and re-order the variables

movies_data1 <- movies_data1[, c(12, 21, 10, 26, 28)]

movies_data2 <- movies_data2[, c(18, 17, 12, 6, 5, 9, 19, 1, 13)]

# rename the selected column names

movies_data1 <- rename(movies_data1, title = movie_title, fb_likes = movie_facebook_likes)

movies_data2 <- rename(movies_data2, language = original_language, tmdb_score = vote_average)

# join both of the above datasets
movies <- inner_join(movies_data2, movies_data1, by = "title")

# check the dimensions
dim(movies)

# preview the dataset
head(movies)

```

<br>

### Why there are *0 rows*? Maybe, **data is untidy. It requires a fix!**

The merged dataset has 0 rows or observations, but the other two datasets have thousands of rows.

So, what's the reason of this error? The first data frame `movies_data` from the dataset *imdb_5000_movies.csv* has data errors. Its `title` column has some unwanted characters, which was very difficult to find (debug) since R shows it as a blank space, unfortunately!

I found and solved the problem using the below steps:

1. Print the first `title` from both the datasets.
2. `movies_data1$title` shows a space at last. It shouldn't be since `read_csv()` trims spaces.
3. Nevertheless, replace the space with nothing using `str_replace_all` on `movies_data1$title`.

4. It didn't work as well - no rows in the merged dataset yet.

5. So, let's print the first `title` as UTF-8 using `utf8ToInt()`.
6. It shows that the space-looking character has an int value = 160.
7. However, the space (" ") has an int value = 32, so it's not a space.
8. Finally, replace that character with nothing using `str_replace_all` on `movies_data1$title`.

9. And at last, it worked! After merging the two datasets now, the merged dataset shows rows.

```{r message=FALSE}

# let's print title from both datasets

movies_data1$title[1]
movies_data2$title[1]

# movies_data1$title has a space at last, so let's fix it
movies_data1$title <- str_replace_all(movies_data1$title, " $", "")

# join both of the above datasets
movies <- inner_join(movies_data2, movies_data1, by = "title")

# check the dimensions
dim(movies)

# preview the dataset
head(movies)

# it doesn't work - still, so is it not a space?

# let's print it in UTF-8 to understand the issue
utf8ToInt(movies_data1$title[1])

# it's not a space, since space != 160
utf8ToInt(" ")

# so, let's fix by replacing it
movies_data1$title <- str_replace_all(movies_data1$title, intToUtf8(160), "")

# join both of the above datasets
movies <- inner_join(movies_data2, movies_data1, by = "title")

# remove duplicate entries
movies <- distinct(movies)

# check the dimensions
dim(movies)

# preview the dataset
head(movies)

```



## Understand 

Summarise the types of variables and data structures, check the attributes in the data and apply proper data type conversions. In addition to the R codes and outputs, explain briefly the steps that you have taken. In this section, show that you have fulfilled minimum requirements 2-4.

Let's understand the data - the data frame `movies` in this section.

1. We'll start with `dim()` to understand its dimensions.
2. Then, we'll check its structure using `str()`.
3. Lastly, we'll check the labels for two variables - `language` and `country`.

As visualized below, `movies` has 4602 observations (rows) and 13 variables (columns). Also, it has different types of data including character (chr), numeric (num), and date (Date), as it's shown below in the output of the `str()` function. Then, we'll do these:

4. We'll add factors (with labels) to `language` since its data is two-letter country code.
5. We'll add factors (without labels) to `country` since its data is already labelled.
6. Finally, we'll check the levels and the structure of the `movies` to confirm the changes.

```{r}

# check the dimensions
dim(movies)

# check the structure
str(movies)

# check the levels

levels(movies$language)

levels(movies$country)

# add factors to movies$language
movies <- mutate(movies, language = factor(language, 
                                           levels = c("af", "ar", "cn", "cs", "da", "de", "el", "en", "es", "fa", "fr", "he", "hi", "hu", 
                                                      "id", "is", "it", "ja", "ko", "ky", "nb", "nl", "no", "pl", "ps", "pt", "ro", "ru", 
                                                      "sl", "sv", "ta", "te", "th", "tr", "vi", "zh"), 
                                           labels = c("Afrikaans", "Arabic", "Chinese", "Czech", "Danish", "German", "Greek, Modern", 
                                                      "English", "Spanish", "Persian", "French", "Hebrew", "Hindi", "Hungarian", 
                                                      "Indonesian", "Icelandic", "Italian", "Japanese", "Korean", "Kyrgyz", "Norwegian", 
                                                      "Dutch", "Norwegian", "Polish", "Pashto", "Portuguese", "Romanian", "Russian", 
                                                      "Slovenian", "Swedish", "Tamil", "Telugu", "Thai", "Turkish", "Vietnamese", 
                                                      "Chinese")
                                           ))

movies <- mutate(movies, country = factor(country))

# check the structure
str(movies)

# check the levels

levels(movies$language)

levels(movies$country)

```



##	Tidy & Manipulate Data I 

According to [The Comprehensive R Archive Network](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html), "*tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:*

1. *Each variable forms a column.*
2. *Each observation forms a row.*
3. *Each type of observational unit forms a table.*"

That said, the data frame **`movies` is not tidy** since the two columns - `keywords` and `genres` are not tidy.

* `keywords`: It contains data in the JSON format, and again, multiple values in a single variable.
* `genres`: It contains multiple data values in a single variable separated by '|' character.

Hence, we need to tidy up these two columns. I followed the below steps to tidy up these columns:

#### `keywords`

1. `keywords` is a column with JSON strings, for example, *[{"id": 1463, "name": "culture clash"}, {"id": 2964, "name": "future"}]*.
2. We'll first convert this JSON into a delimited string, then separate the data into multiple observations (rows) using `unnest()`.
3. We'll use `fromJSON()` to ease the process of parsing or reading JSON objects or strings, else it will become very cumbersome.
4. The alternative approach could be to use regular expression to extract the `name` from the above JSON to form a delimited string.
5. Then, we'll use `unnest()` along with `strsplit()` on the delimited string to separate the delimited values over multiple rows.
    
#### `genres`

1. `genres` is a column with multiple delimited values (genres of the movie), for example, *Action|Adventure|Fantasy|Sci-Fi*.
2. We'll use `mutate()` along with `case_when()` and `grepl()` to form a two case formula as given below:
    * If `grepl()` matches, then `TRUE`, i.e., genre is found.
    * Else `FALSE`, i.e., genre is not found.
3. Using the above approach, we'll separate the delimited values over multiple columns - one column for each genre.

```{r warning=FALSE}

# first convert JSON to delimited string

simplifyToString <- function(x) {
  fromJSON(x["keywords"])$name %>% paste(collapse = '|')
}

movies <- mutate(movies, keywords = apply(movies, 1, simplifyToString))

# now unnest the `keywords` column into rows
movies <- movies %>% unnest(keywords = strsplit(keywords, "[|]"))

# now unnest the `genres` column into columns

movies <- movies %>% mutate(action = case_when(grepl("Action", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(adventure = case_when(grepl("Adventure", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(animation = case_when(grepl("Animation", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(biography = case_when(grepl("Biography", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(comedy = case_when(grepl("Comedy", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(crime = case_when(grepl("Crime", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(documentary = case_when(grepl("Documentary", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(drama = case_when(grepl("Drama", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(family = case_when(grepl("Family", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(fantasy = case_when(grepl("Fantasy", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(filmnoir = case_when(grepl("Film-Noir", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(history = case_when(grepl("History", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(horror = case_when(grepl("Horror", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(musical = case_when(grepl("Musical", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(mystery = case_when(grepl("Mystery", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(news = case_when(grepl("News", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(romance = case_when(grepl("Romance", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(scifi = case_when(grepl("Sci-Fi", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(short = case_when(grepl("Short", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(sport = case_when(grepl("Sport", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(thriller = case_when(grepl("Thriller", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(war = case_when(grepl("War", genres) ~ TRUE, TRUE ~ FALSE))
movies <- movies %>% mutate(western = case_when(grepl("Western", genres) ~ TRUE, TRUE ~ FALSE))

# finally delete the `genres` column
movies <- subset(movies, select = -c(genres))

# preview the dataset
head(movies)

```



##	Tidy & Manipulate Data II 

Now, let's create one variable from the existing variables. We have `budget` and `revenue` variables in our dataset. We can calculate the gross profit of each movie using the difference of these two columns. That said, we can create a `profit` variable by substracting `budget` from `revenue`. We'll use the `mutate()` function to create this new variable (column).

```{r}

# find profit for each movie
movies <- movies %>% mutate(profit = revenue - budget)

# preview the dataset
head(movies)

```



##	Scan I

Let's check the data for missin values (NAs), inconsistencies, and obvious errors (like ratings are more than 10 or less than 0). As we've visualized below, there are two columns having `NA`s. Their names and how we'll deal with them is below:

#### `tagline`

1. `tagline` has just 3379 NAs, which is not a small number seeing that movies has 35065 observations = 0.096%.
2. Since its percentage is less than 5% and it's a character value, so it won't cause any issues in calculations.
3. Also, it's tagline of a movie, which may be unavailable for some movies per its production companies.
4. That said, we'll leave the tagline with NAs as it is, since nothing can be done about it.

#### `language`

1. `language` has just 5 NAs, which is very small number seeing that movies has 35065 observations = 0.00014%.
2. Since its percentage is less than 5%, we can leave it as it's a character value, so no issues in calculations.
3. Or, we can replace the NAs in this column with "English" assuming it's suggested by the analyst.
4. Of course, "English" is the most popular language for the movies, as it's calcualted below.

```{r}

# scan for infinites
sum(is.infinite(as.matrix(movies)))

# scan for NANs
sum(is.nan(as.matrix(movies)))

# scan for NAs
sum(is.na(movies))

# scan for NAs per variable
colSums(is.na(movies))

# check the incomplete cases
head(movies[!complete.cases(movies), ])

# check for obvious errors

rules <- editfile("rules.txt", type = "all")
violated <- violatedEdits(rules, movies)
summary(violated)

# count the movies per `language`
movies %>% group_by(language) %>% summarise(no_rows = length(language)) %>% 
  arrange(desc(no_rows)) %>% head()

# replace `language` with NA to English
movies$language[is.na(movies$language)] <- "English"

```



##	Scan II

We’ll produce boxplots for checking outliers in out dataset. As it’s visualized below, our dataset has probable outliers. So, we calculated the z-scores of the variables and remove the outliered observations (rows) based on the absolute value of more than three.

Of course, it was not necessary to remove the outliers since the outliers are less than 5%. For budget , the
outliers count is 698, which is just 0.0199% (which is almost negligible). Also, for revenue , the outliers count is 881, which is just 0.025% (it's mostly negligible).

Then, after removing the outliers from those variables, we’ll produce the boxplots again. However, we
visualized that the outliers are still present. That said, based on our observations, these outliers are not the same case because our dataset is not normally distributed, hence the outliers will be present. The reason being the variables we're dealing here are budget and revenue, which are not normally distributed in the world. Some movies, like Avengers franchise, are produced with very high budgets, and they earn superb
revenues as well. Then, some movies are very low budget movies.

Note: We’re going to do all these scannings on just two variables to keep the document length under 20
pages.

```{r}

# draw boxplots

movies$budget %>%  boxplot(main="Box Plot of Movie's Budget", 
                               ylab="Budget ($)", col = "grey")

movies$revenue %>%  boxplot(main="Box Plot of Movie's Revenue", 
                               ylab="Revenue ($)", col = "grey")

# calculate z scores

zscores.budget <- movies$budget %>%  scores(type = "z")
zscores.budget %>% summary()

zscores.revenue <- movies$revenue %>%  scores(type = "z")
zscores.revenue %>% summary()

# number of outliers per z scores

length( which(abs(zscores.budget) >3 ))

length( which(abs(zscores.revenue) >3 ))

# exclude outliers

movies.budget <- movies[- which(abs(zscores.budget) >3 ),]

movies.revenue <- movies[- which(abs(zscores.revenue) >3 ),]

# boxplot excluding outliers

movies.budget$budget %>%  boxplot(main="Box Plot of Movie's Budget (without outliers)", 
                               ylab="Budget ($)", col = "grey")

movies.revenue$revenue %>%  boxplot(main="Box Plot of Movie's Revenue (without outliers)", 
                               ylab="Revenue ($)", col = "grey")

```



##	Transform 

Let's produce the histogram of the gross profit (`profit` variable), which shows that the data is right skewed. So, let's transpose the `profit` variable using `mutate()` to perform a reciprocal transporation for every row (observation). The main idea behind this scheme is to reduce the right skewedness of our distribution and make it a little more normal, which is visible after the transformation.

```{r}

# transpose the `profit` variable

movies.transpose <- 
  movies %>% mutate(profit = movies$profit^(-1))

# draw histograms for `profit` and transposed `profit`

hist(movies$profit,
     main = "Histogram of Movie's Gross Profit", xlab = "Gross Profit ($)")

hist(movies.transpose$profit,
     main = "Histogram of Transposed Gross Profit", xlab = "Gross Profit Reciprocal")

```

<br>
<br>
