library(readr)
library(lubridate)
library(tidyr)
library(dplyr)
library(outliers)
library(forecast)
library(stringr)
This report is a data preprocessing report, which details and explains all necessary steps taken to preprocess the data. The 5 major task of data preprocessing framework was carried out, namely: Get, Understand, Tidy & Manipulate, Scan, Transform.
In this report, two open, publicly available datasets were merged into one dataset, and all variables in the data were studied. Variable type conversion was carried out to convert all data to its appropriate data type. Next, Hadley Wickham’s Tidy Data Principles was followed while tidying and manipulating the variables in the data. The tidied data was then scanned for missing values, special values, obvious consistencies and outliers. Necessary steps were taken to handle these missing/incomplete information. Lastly, data transformation was carried out to transform the variable from a non-normal distribution into a normal distribution, which is important as most statistical tools require normally distributed variables and homogenous variances.
Netflix data and IMDb data were both imported using readr function, read_csv(). Variables which are self-explanatory will not be explained in detail.
Netflix dataset - Variable description
IMDb dataset - Variable description
Sources:
netflix <- read_csv("netflix_titles.csv")
Parsed with column specification:
cols(
show_id = [32mcol_double()[39m,
type = [31mcol_character()[39m,
title = [31mcol_character()[39m,
director = [31mcol_character()[39m,
cast = [31mcol_character()[39m,
country = [31mcol_character()[39m,
date_added = [31mcol_character()[39m,
release_year = [32mcol_double()[39m,
rating = [31mcol_character()[39m,
duration = [31mcol_character()[39m,
listed_in = [31mcol_character()[39m,
description = [31mcol_character()[39m
)
str(netflix)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 6234 obs. of 12 variables:
$ show_id : num 81145628 80117401 70234439 80058654 80125979 ...
$ type : chr "Movie" "Movie" "TV Show" "TV Show" ...
$ title : chr "Norm of the North: King Sized Adventure" "Jandino: Whatever it Takes" "Transformers Prime" "Transformers: Robots in Disguise" ...
$ director : chr "Richard Finn, Tim Maltby" NA NA NA ...
$ cast : chr "Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Duru"| __truncated__ "Jandino Asporaat" "Peter Cullen, Sumalee Montano, Frank Welker, Jeffrey Combs, Kevin Michael Richardson, Tania Gunadi, Josh Keaton"| __truncated__ "Will Friedle, Darren Criss, Constance Zimmer, Khary Payton, Mitchell Whitfield, Stuart Allan, Ted McGinley, Peter Cullen" ...
$ country : chr "United States, India, South Korea, China" "United Kingdom" "United States" "United States" ...
$ date_added : chr "September 9, 2019" "September 9, 2016" "September 8, 2018" "September 8, 2018" ...
$ release_year: num 2019 2016 2013 2016 2017 ...
$ rating : chr "TV-PG" "TV-MA" "TV-Y7-FV" "TV-Y7" ...
$ duration : chr "90 min" "94 min" "1 Season" "1 Season" ...
$ listed_in : chr "Children & Family Movies, Comedies" "Stand-Up Comedy" "Kids' TV" "Kids' TV" ...
$ description : chr "Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from"| __truncated__ "Jandino Asporaat riffs on the challenges of raising kids and serenades the audience with a rousing rendition of"| __truncated__ "With the help of three human allies, the Autobots once again protect Earth from the onslaught of the Decepticon"| __truncated__ "When a prison ship crash unleashes hundreds of Decepticons on Earth, Bumblebee leads a new Autobot force to protect humankind." ...
- attr(*, "spec")=
.. cols(
.. show_id = [32mcol_double()[39m,
.. type = [31mcol_character()[39m,
.. title = [31mcol_character()[39m,
.. director = [31mcol_character()[39m,
.. cast = [31mcol_character()[39m,
.. country = [31mcol_character()[39m,
.. date_added = [31mcol_character()[39m,
.. release_year = [32mcol_double()[39m,
.. rating = [31mcol_character()[39m,
.. duration = [31mcol_character()[39m,
.. listed_in = [31mcol_character()[39m,
.. description = [31mcol_character()[39m
.. )
head(netflix)
imdb <- read_csv("IMDB-Movie-Data.csv")
Parsed with column specification:
cols(
Rank = [32mcol_double()[39m,
Title = [31mcol_character()[39m,
Genre = [31mcol_character()[39m,
Description = [31mcol_character()[39m,
Director = [31mcol_character()[39m,
Actors = [31mcol_character()[39m,
Year = [32mcol_double()[39m,
`Runtime (Minutes)` = [32mcol_double()[39m,
Rating = [32mcol_double()[39m,
Votes = [32mcol_double()[39m,
`Revenue (Millions)` = [32mcol_double()[39m,
Metascore = [32mcol_double()[39m
)
str(imdb)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1000 obs. of 12 variables:
$ Rank : num 1 2 3 4 5 6 7 8 9 10 ...
$ Title : chr "Guardians of the Galaxy" "Prometheus" "Split" "Sing" ...
$ Genre : chr "Action,Adventure,Sci-Fi" "Adventure,Mystery,Sci-Fi" "Horror,Thriller" "Animation,Comedy,Family" ...
$ Description : chr "A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control "| __truncated__ "Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize the"| __truncated__ "Three girls are kidnapped by a man with a diagnosed 23 distinct personalities. They must try to escape before t"| __truncated__ "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing compe"| __truncated__ ...
$ Director : chr "James Gunn" "Ridley Scott" "M. Night Shyamalan" "Christophe Lourdelet" ...
$ Actors : chr "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana" "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron" "James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula" "Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson" ...
$ Year : num 2014 2012 2016 2016 2016 ...
$ Runtime (Minutes) : num 121 124 117 108 123 103 128 89 141 116 ...
$ Rating : num 8.1 7 7.3 7.2 6.2 6.1 8.3 6.4 7.1 7 ...
$ Votes : num 757074 485820 157606 60545 393727 ...
$ Revenue (Millions): num 333 126 138 270 325 ...
$ Metascore : num 76 65 62 59 40 42 93 71 78 41 ...
- attr(*, "spec")=
.. cols(
.. Rank = [32mcol_double()[39m,
.. Title = [31mcol_character()[39m,
.. Genre = [31mcol_character()[39m,
.. Description = [31mcol_character()[39m,
.. Director = [31mcol_character()[39m,
.. Actors = [31mcol_character()[39m,
.. Year = [32mcol_double()[39m,
.. `Runtime (Minutes)` = [32mcol_double()[39m,
.. Rating = [32mcol_double()[39m,
.. Votes = [32mcol_double()[39m,
.. `Revenue (Millions)` = [32mcol_double()[39m,
.. Metascore = [32mcol_double()[39m
.. )
head(imdb)
For both netflix and imdb datasets, type conversion of variables were carried out to convert the data to its suitable data type, and column names of variables are re-labelled so that the 2 datasets can be joined easily through mutating joins.
For variables in netflix:
type
and rating
is converted from character to factor, using factor()date_added
is converted from character to an appropriate POSIXct formatcountry
is NOT converted into a factor although it is a categorical variable because the data has not been tidied. It has multiple values in one column – refer to Tidy & Manipulate Data IFor variables in imdb:
Metascore
is converted to int using as.integer()Genre
and Actors
are also NOT converted into factor because the data has not been tidied – refer to Tidy & Manipulate Data I# Netflix data
# Changing variable type for netflix
netflix$type <- factor(netflix$type, label = c('Movie', 'TV Show'))
netflix$rating <- factor(netflix$rating)
netflix$date_added <- parse_date_time(netflix$date_added, orders = "mdy")
# Changing variable name for netflix
names(netflix)[names(netflix) == 'title'] <- 'Title'
names(netflix)[names(netflix) == 'date_added'] <- 'Date Added to Netflix'
names(netflix)[names(netflix) == 'rating'] <- 'Movie Rating Guides'
names(netflix)[names(netflix) == 'release_year'] <- 'Released_Year'
names(netflix)[names(netflix) == 'country'] <- 'Country_of_Production'
# IMDb data
# Changing variable type for imdb
imdb$Metascore <- as.integer(imdb$Metascore)
# Changing variable name for imdb
names(imdb)[names(imdb) == 'Rating'] <- 'IMDb Ratings'
names(imdb)[names(imdb) == 'Year'] <- 'Released_Year'
According to Hadley Wickham, Tidy Data Principles require that each variable have its own column, each observation have its own row, and each value have its own cells. Both netlflix and IMDb dataset are untidy as they both violates the 3rd rule of Tidy Data Principles, where there are multiple values are in one cell for variable Country_of_Production
in the netflix dataset, and Actors
, Genre
in the IMDb dataset.
Since all of these variables are separated by comma in a string. Stringr manipulation is used to split the values by using the function str_split(). The splitted values are subsequently added (by using mutate() function) into the join
dataset. join
dataset joins both netflix dataset and IMDb dataset together through inner_join() function, where both datasets are joined through common variables, Title
and Released_Year
. Since there are many movies with the same movie title, joining the datasets through Title
and Released_Year
are essential to ensure that the movies joined are released in the same year, as it is unlikely for 2 movies of the same movie title to be released in the same year.
# Multiple values in one cell for 'Country_of_Production'
head(netflix$`Country_of_Production`)
[1] "United States, India, South Korea, China" "United Kingdom" "United States"
[4] "United States" "United States" "Spain"
country <- as.data.frame(str_split(netflix$`Country_of_Production`, pattern = ",", simplify = TRUE), stringsAsFactors = FALSE)
# Tidying netflix data before joining
netflix <- netflix %>% mutate(Country_of_Production = country[,1]) %>%
select(Title, type, Country_of_Production, `Date Added to Netflix`:`Movie Rating Guides`) %>%
filter(Released_Year <= 2016) %>%
arrange(desc(Released_Year))
# Multiple values in one cell for 'Actor'
head(imdb$Actors)
[1] "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana"
[2] "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron"
[3] "James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula"
[4] "Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson"
[5] "Will Smith, Jared Leto, Margot Robbie, Viola Davis"
[6] "Matt Damon, Tian Jing, Willem Dafoe, Andy Lau"
actorsplit <- as.data.frame(str_split(imdb$Actors, pattern = ",", simplify = TRUE), stringsAsFactors = FALSE)
colnames(actorsplit) <- c("Actor 1", "Actor 2", "Actor 3", "Actor 4")
# Multiple values in one cell for 'Genre'
head(imdb$Genre)
[1] "Action,Adventure,Sci-Fi" "Adventure,Mystery,Sci-Fi" "Horror,Thriller" "Animation,Comedy,Family"
[5] "Action,Adventure,Fantasy" "Action,Adventure,Fantasy"
genresplit <- as.data.frame(str_split(imdb$Genre, pattern = ",", simplify = TRUE), stringsAsFactors = FALSE)
colnames(genresplit) <- c("Genre 1", "Genre 2", "Genre 3")
# Tidying imdb data before joining - since most movies have only 2 genres, only Genre1 and Genre2 will be included in the table
imdb <- imdb %>% mutate(Actor1 = actorsplit[,1], Actor2 = actorsplit[,2],
Actor3 = actorsplit[,3], Actor4 = actorsplit[,4],
Genre1 = genresplit[,1], Genre2 = genresplit[,2]) %>%
select(Title, Released_Year:`IMDb Ratings`, `Revenue (Millions)`, Metascore, Director,
Actor1, Actor2, Actor3, Actor4, Genre1, Genre2, Description)
# Joining 2 datasets through common variables - 'Title' and 'Released_Year'
join <- inner_join(imdb, netflix)
Joining, by = c("Title", "Released_Year")
str(join)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 157 obs. of 18 variables:
$ Title : chr "Moonlight" "The Last Face" "The Autopsy of Jane Doe" "Wakefield" ...
$ Released_Year : num 2016 2016 2016 2016 2009 ...
$ Runtime (Minutes) : num 111 130 86 106 153 148 187 105 151 132 ...
$ IMDb Ratings : num 7.5 3.7 6.8 7.5 8.3 8.8 7.8 6.3 8.5 7 ...
$ Revenue (Millions) : num 27.85 NA NA 0.01 120.52 ...
$ Metascore : int 99 16 65 61 69 74 68 77 85 55 ...
$ Director : chr "Barry Jenkins" "Sean Penn" "André Øvredal" "Robin Swicord" ...
$ Actor1 : chr "Mahershala Ali" "Charlize Theron" "Brian Cox" "Bryan Cranston" ...
$ Actor2 : chr " Shariff Earp" " Javier Bardem" " Emile Hirsch" " Jennifer Garner" ...
$ Actor3 : chr " Duan Sanderson" " Adèle Exarchopoulos" " Ophelia Lovibond" " Beverly D'Angelo" ...
$ Actor4 : chr " Alex R. Hibbert" "Jared Harris" " Michael McElhatton" "Jason O'Mara" ...
$ Genre1 : chr "Drama" "Drama" "Horror" "Drama" ...
$ Genre2 : chr "" "" "Mystery" "" ...
$ Description : chr "A chronicle of the childhood, adolescence and burgeoning adulthood of a young, African-American, gay man growin"| __truncated__ "A director (Charlize Theron) of an international aid agency in Africa meets a relief aid doctor (Javier Bardem)"| __truncated__ "A father and son, both coroners, are pulled into a complex mystery while attempting to identify the body of a y"| __truncated__ "A man's nervous breakdown causes him to leave his wife and live in his attic for several months." ...
$ type : Factor w/ 2 levels "Movie","TV Show": 1 1 1 1 1 1 1 1 1 1 ...
$ Country_of_Production: chr "United States" "United States" "United Kingdom" "United States" ...
$ Date Added to Netflix: POSIXct, format: "2019-05-21" "2020-01-13" "2018-12-30" "2019-03-02" ...
$ Movie Rating Guides : Factor w/ 14 levels "G","NC-17","NR",..: 6 6 6 6 6 5 6 6 6 6 ...
# Converting variable types
join$Country_of_Production <- factor(join$Country_of_Production)
join$`Revenue (Millions)` <- round(join$`Revenue (Millions)`, 2)
In the join
dataset, it was found that for some movies, the year that the movie is added into Netflix differs from its released years. Hence, a variable Delay
is created to calculate the delay in years that the movie was added into Netflix from its released years.
# Tidying joined datasets & adding a new variable, `Delay`
join <- join %>%
select(Title, `Date Added to Netflix`, Released_Year:Genre2, Country_of_Production, `Movie Rating Guides`, type) %>%
mutate(Delay = year(`Date Added to Netflix`) - Released_Year) %>%
arrange(desc(Released_Year))
All numeric variables in the dataset are scanned for missing values, special values, and obvious inconsistencies. It was found that 2 variables, Revenue and Metascore contains missing value. Since it’d be a waste to omit all information just because one value is missing, these missing values were replaced by the mean value of its respective variable.
# Scan for Missing Values for each numeric variables
sapply(join[,c(4:7, 18)], function(x) sum(is.na(x)))
Runtime (Minutes) IMDb Ratings Revenue (Millions) Metascore Delay
0 0 33 11 0
# Checking each numerical column for special values
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(join[,c(4:7, 18)], function(x) sum(is.special(x)))
Runtime (Minutes) IMDb Ratings Revenue (Millions) Metascore Delay
0 0 0 0 0
# Replacing NA values in Revenue with mean(Revenue)
which(is.na(join$`Revenue (Millions)`))
[1] 2 3 7 11 13 17 18 20 22 23 24 25 28 30 32 34 38 39 40 41 43 44 45 46 47 48 52 60 61 69 70 75 111
join$`Revenue (Millions)`[is.na(join$`Revenue (Millions)`)] <- mean(join$`Revenue (Millions)`, na.rm = TRUE)
# Replacing NA values in Metascore with mean(Metascore)
which(is.na(join$Metascore))
[1] 8 17 18 20 32 75 98 104 111 130 143
join$Metascore[is.na(join$Metascore)] <- mean(join$Metascore, na.rm = TRUE)
All numeric variables in the dataset are also scanned for outliers. Outliers are detected using z-score if the underlying data of the variables are assumed to be normally distributed. Hence, the variable IMDb Ratings
, Revenue (Millions)
, Metascore
and Runtime (Minutes)
are checked for outliers using z-score. Values with abs(z-score) > 3 are considered as outliers, and these outliers were dealt with by using capping() method to reduce the error variance.
Whereas for the variable Delay
,it is checked for outlier using “Tukey’s Method of Outlier Detection” since an underlying normal distribution is not assumed. This can be done easily by using boxplot().
# Tukey's method of outlier detection - for `Delay`
boxplot(join$Delay, ylab = "Delay (years)", main = "Boxplot of Delay (years)") # No outliers
# We first check that each of the 4 variables have an approximately normal distribution using hist()
par(mfrow = c(2,2))
hist(join$`IMDb Ratings`, xlab = "IMDb Ratings", main = "Histogram of IMDb Ratings")
hist(join$`Revenue (Millions)`, xlab = 'Revenue (Millions)', main = "Histogram of Revenue") # very right skewed
hist(join$Metascore, xlab = "Metascore", main = "Histogram of Metascore")
hist(join$`Runtime (Minutes)`, xlab = "Runtime(mins)", main = "Histogram of Runtime(Minutes)")
# Checking for outliers using z-score
z_imdbratings <- join$`IMDb Ratings` %>% scores(type = "z")
z_imdbratings %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.0792 -0.6146 0.1437 0.0000 0.7125 1.9447
length(which( abs(z_imdbratings) >3 )) # 1 outlier
[1] 1
z_revenue <- join$`Revenue (Millions)` %>% scores(type = "z")
z_revenue %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.8935 -0.7440 -0.1672 0.0000 0.1397 4.8891
length(which( abs(z_revenue) > 3)) # 3 outliers
[1] 3
z_metascore <- join$Metascore %>% scores(type = "z")
z_metascore %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.5474 -0.6563 0.0314 0.0000 0.7191 2.2091
length(which( abs(z_metascore) >3 )) # No outliers
[1] 0
z_runtime <- join$`Runtime (Minutes)` %>% scores(type = "z")
z_runtime %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.5953 -0.7532 -0.1587 0.0000 0.4357 3.6557
length(which(abs(z_runtime) > 3)) # 1 outlier
[1] 1
# Using capping() method to deal with outliers
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
# Since `Metascore` has no outliers, it does not need to be capped
join$`IMDb Ratings`<- join$`IMDb Ratings` %>% cap()
join$`Revenue (Millions)` <- join$`Revenue (Millions)` %>% cap()
join$`Runtime (Minutes)` <- join$`Runtime (Minutes)` %>% cap()
# z-score of each of the capped variables are calculated again to ensure that there are no outliers
z_imdbratings2 <- join$`IMDb Ratings` %>% scores(type = "z") %>% summary()
z_revenue2 <- join$`Revenue (Millions)` %>% scores(type = "z") %>% summary()
z_runtime2 <- join$`Runtime (Minutes)` %>% scores(type = "z") %>% summary()
length(which( abs(z_imdbratings2) >3 )) # No outliers anymore
[1] 0
length(which( abs(z_revenue2) > 3)) # No outliers anymore
[1] 0
length(which(abs(z_runtime2) > 3)) # No outleirs anymore
[1] 0
Two variables, IMDb Ratings
and Revenue (Millions)
contains insightful information which could potentially be used for further analysis. Hence, this 2 variables are chosen for data transformation. From the histogram, we can see that IMDb Ratings
are left-skewed, whereas Revenue (Millions)
are right-skewed. To reduce left-skewness of IMDb Ratings
, squares, cubes and higher power transformation were tried out to select for the best transformation, whereas reciprocal, logarithms, and roots transformation which reduces right-skewness were tried out for Revenue (Millions)
to select for the best transformation.
# Check the distribution of both IMDb Ratings and Revenue
par(mfrow = c(1,2))
hist(join$`IMDb Ratings`, xlab = "IMDb Ratings", xlim = c(0,10), main = "Histogram of IMDb Ratings")
hist(join$`Revenue (Millions)`, xlab = 'Revenue (Millions)', main = "Histogram of Revenue") #right-skewed
# Look for a suitable transformation for IMDb Ratings
par(mfrow = c(2,2))
hist(join$`IMDb Ratings`, xlab = "IMDb Ratings", xlim = c(0,10), main = "Histogram of IMDb Ratings")
hist(join$`IMDb Ratings`^(2), xlab = "IMDb Ratings^2", main = "Histogram of (IMDb Ratings^2)")
hist(join$`IMDb Ratings`^(3), xlab = "IMDb Ratings^3", main = "Histogram of (IMDb Ratings^3)")
hist(join$`IMDb Ratings`^(4), xlab = "IMDb Ratings^4", main = "Histogram of (IMDb Ratings^4)")
# Apply cube transformation for IMDb Ratings
par(mfrow = c(1,1))
join$`IMDb Ratings`<- (join$`IMDb Ratings`^(2))
hist(join$`IMDb Ratings`, xlab = "IMDb Ratings^2", main = "Histogram of IMDb Ratings")
# Look for a suitable transformation for Revenue
par(mfrow = c(2,2))
hist(join$`Revenue (Millions)`, xlab = "1/(Revenue^2)", main = "Histogram of Revenue")
hist((join$`Revenue (Millions)`)^(1/3), xlab = 'Revenue^(1/3)', main = "Histogram of Revenue^(1/3)")
hist((join$`Revenue (Millions)`)^(1/2), xlab = 'sqrt(Revenue)', main = "Histogram of sqrt(Revenue)")
hist(log10(join$`Revenue (Millions)`), xlab = 'log10(Revenue)', main = "Histogram of log10(Revenue)")
# Apply sqrt trasnformation on Revenue
par(mfrow = c(1,1))
join$`Revenue (Millions)` <- (join$`Revenue (Millions)`^(1/2))
hist((join$`Revenue (Millions)`)^(1/2), xlab = 'sqrt(Revenue (Millions))', main = "Histogram of sqrt(Revenue)")
From the histograms, we can see that IMDb ratings are most suitable for a cube transformation, and Revenue are most suitable for a sqrt() transformation. Both IMDb Ratings
and Revenue
can now be used for further analysis since they both have an underlying normal distribution.