This assignment is aimed to understand all the steps involved in Data Pre-processing (Data Wrangling):
Get Understand Tidy & Manipulate Scan Transform
# Packages Used for the report:
library(readr)
library(gdata)
library(rvest)
library(forecast)
library(Hmisc)
library(outliers)
library(editrules)
library(magrittr)
library(dplyr)
library(tidyr)
The report/assignment covers important functions and techniques of data pre-processing. We have used two data sets for this purpose. Upon using basic pre-processing techniques like observing the structure and dimensions of all the variables, we will check if the data fulfills all the requirements asked as part of our report.Further we will join the data sets according to the common variables. Moving to the understand and tidy processes we will make sure the data follows all the tidy principles like correct format. We will achieve this by changing variables to their correct data type, factorizing the variables. As part of the scanning section, we will check for all the NA/Missing and special values. Missing values for the categorical variables will be replaced with the mode and missing values for numerical variables will be replaced with the mean. Apart from this, we will find outliers using the z score method for all numeric variables. The outliers will be dealt with using the capping process because the data is sensitive, and it may have extreme values which should not be removed. In the final section, we will use an appropriate transformation technique on numerical variables to remove the skewness to make the data symmetric.
This dataset combines information about data sources from Rotten Tomatoes, IMDB, posters, box office information, trailers on YouTube, and more by using various types of APIs. There are multiple data that give information for the past research based on their Language and series by their runtime. Since the observations of this dataset are based on different languages based on different kinds of Genres provided by the title. Hence for the second data, provides information about the rating score. There’s one unique metric called “Hidden Gem Store”, which gives calculation for low review count and high review count. Wherein Lower the review count and higher the user rating, the higher the hidden gem score. The below data set helps us to provide a better understanding of the field of Entertainment, movies and TV shows, search engines, and classification.
Source of the dataset is captured from – URL: https://www.kaggle.com/static.s3/netflix/kk0_Media Further Reference : Mohan Gupta,2021, Arts and Entertainment
The data set contains 15480 observation and 4 variables.
Title : chr datatype Languages : chr datatype Series or Movies : chr datatype Runtime : chr datatype
The data set contains 15480 observation and 9 variables.
Title : char datatype IMDb score : num datatype Hidden Gem Score : num datatype View Rating : chr datatype Rotten Tomatoes Score : int datatype Metacritic Score : int datatype Boxoffice : chr datatype Release Date : chr datatype IMDb votes : int datatype
## Setting Directory
getwd()
## [1] "C:/Users/CLOUD/Downloads"
setwd("C:/Users/CLOUD/Downloads")
getwd()
## [1] "C:/Users/CLOUD/Downloads"
## Importing Dataset 1 from the Directory
Media <- read.csv("Media_Details.csv")
head(Media)
## Importing Dataset 2 from the Directory
Digital_Media <- read.csv("Digital_Media_Ratings.csv")
head(Digital_Media)
## Merging the 2 data Sets
Media_Digital_Media <- inner_join(Media,Digital_Media,by=c("Title"))
Media_Digital_Media
##Summarize the types of variables and data structures
summary(Media_Digital_Media)
## Title Languages Type Runtime
## Length:16386 Length:16386 Length:16386 Length:16386
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## IMDb.Score Hidden.Gem.Score View.Rating Rotten.Tomatoes.Score
## Min. :1.000 Min. :0.60 Length:16386 Min. : 0.00
## 1st Qu.:5.800 1st Qu.:3.80 Class :character 1st Qu.: 38.00
## Median :6.600 Median :6.70 Mode :character Median : 63.00
## Mean :6.502 Mean :5.89 Mean : 59.41
## 3rd Qu.:7.300 3rd Qu.:7.90 3rd Qu.: 82.00
## Max. :9.700 Max. :9.80 Max. :100.00
## NA's :2143 NA's :2145 NA's :9529
## Metacritic.Score Boxoffice Release.Date IMDb.Votes
## Min. : 5.00 Min. : 72 Length:16386 Min. : 5
## 1st Qu.: 44.00 1st Qu.: 858576 Class :character 1st Qu.: 430
## Median : 57.00 Median : 20304578 Mode :character Median : 2521
## Mean : 56.79 Mean : 46113563 Mean : 44011
## 3rd Qu.: 69.00 3rd Qu.: 60472742 3rd Qu.: 23186
## Max. :100.00 Max. :659363944 Max. :2354197
## NA's :11674 NA's :12032 NA's :2145
## Attributes for Data
attributes(Media_Digital_Media)$class
## [1] "data.frame"
attributes(Media_Digital_Media)$names
## [1] "Title" "Languages" "Type"
## [4] "Runtime" "IMDb.Score" "Hidden.Gem.Score"
## [7] "View.Rating" "Rotten.Tomatoes.Score" "Metacritic.Score"
## [10] "Boxoffice" "Release.Date" "IMDb.Votes"
## Converting Char to factor variable
Media_Digital_Media$Type <- factor(Media_Digital_Media$Type,levels = c('Movie','Series'),labels = c('MMovie','SSeries'))
## Checking the status of Levels in the merged data sets
levels(Media_Digital_Media$Type)
## [1] "MMovie" "SSeries"
## Checking the Structure of data types
str(Media_Digital_Media)
## 'data.frame': 16386 obs. of 12 variables:
## $ Title : chr "Lets Fight Ghost" "HOW TO BUILD A GIRL" "Centigrade" "ANNE+" ...
## $ Languages : chr "Swedish, Spanish" "English" "English" "Turkish" ...
## $ Type : Factor w/ 2 levels "MMovie","SSeries": 2 1 1 2 1 1 2 2 1 1 ...
## $ Runtime : chr "< 30 minutes" "1-2 hour" "1-2 hour" "< 30 minutes" ...
## $ IMDb.Score : num 7.9 5.8 4.3 6.5 6.3 7.4 7.5 3.9 6.7 6.6 ...
## $ Hidden.Gem.Score : num 4.3 7 6.4 7.7 8.1 8.6 8.7 6.9 8.3 5.3 ...
## $ View.Rating : chr "R" "R" "Unrated" "" ...
## $ Rotten.Tomatoes.Score: int 98 79 NA NA NA NA NA NA NA NA ...
## $ Metacritic.Score : int 82 69 46 NA NA NA NA NA NA NA ...
## $ Boxoffice : int 2122065 70632 16263 NA NA NA NA NA NA NA ...
## $ Release.Date : chr "12-Dec-08" "08-May-20" "28-Aug-20" "01-Oct-16" ...
## $ IMDb.Votes : int 205926 2838 1720 1147 63 131 47 107 88 5926 ...
After merging the data sets (Media and Digital Media), we analyzed that the data structure and data variables give the descriptive statistics for each variable with the help of a summary (). Columns’ names/attributes will go with the class which is merged with the attribute function. Since the merged data contains the categorical variable, we convert the following data type as a character to a factor variable. For getting the structure we frame the structure by using the str() function for getting the final structure.
#Reshaping the data into Tidy format
# Separating records for Genre
New_Media <- Media_Digital_Media %>% separate(Languages,into=c("PrimaryLang","SecondaryLang"),sep=",")
## Warning: Expected 2 pieces. Additional pieces discarded in 1502 rows [10, 30,
## 67, 98, 107, 127, 142, 148, 167, 168, 183, 202, 232, 254, 255, 256, 258, 270,
## 307, 316, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 12599 rows [2, 3,
## 4, 5, 6, 7, 8, 9, 11, 12, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, ...].
#Separating release records
New_Media2 <- New_Media %>% separate(Release.Date,into=c("Release Date","Release Month", "Release Year"),sep= "-")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2143 rows [21,
## 23, 42, 59, 62, 77, 88, 124, 188, 241, 246, 265, 276, 288, 372, 380, 387, 410,
## 418, 433, ...].
# Changing Wide to Long
New_Media3 <- pivot_longer(New_Media2,names_to="Speech",values_to = "Language",cols = 2:3)
The Main Principles of a Tidy data are :
Each variable must have its own column. Each observation must have its own row. Each value must have its own cell.
For Untidy merged data:
The data frame Media_Digital_Media has variables e.g (Language, Release date) which contains more than one character separated by “,”. With help of a separate()function, we evaluated the function which separates the comma-separated values into multiple columns. Wherein for Release information we had three variables separated by “-” which was tidied by using separate() function as resultant into Release_Day, Release_Month, Release_Year.
# This is the R chunk for the Tidy & Manipulate Data II
## Mutating the Data
Media_Digital_Media_Muta <- New_Media3 %>% mutate(Boxoffice_Million=Boxoffice/1000000)
## Average for Hidden gem score by Runtime
Media_Digital_Media_Muta %>% group_by(Runtime) %>% summarize(Avg_Rating=mean(Hidden.Gem.Score, na.rm=TRUE))
## Arranging the Data
Media_Digital_Media_Arra <- Media_Digital_Media_Muta %>% arrange(desc(Rotten.Tomatoes.Score))
Tidyr package is used for the following functions. Mutated the Boxoffice(In millions) with Dataframe to add another column in the Dataframe by using mutate() function. Taking average function for Hidden gem score by Runtime using groupby() and after that summarizing the resultant output we use summarize() function as output. Arranged the data from highest to lowest scores by using arrange() function.
## Checking for NA values
head(is.na(Media_Digital_Media))
## Title Languages Type Runtime IMDb.Score Hidden.Gem.Score View.Rating
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## Rotten.Tomatoes.Score Metacritic.Score Boxoffice Release.Date IMDb.Votes
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] TRUE FALSE FALSE FALSE FALSE
## [4,] TRUE TRUE TRUE FALSE FALSE
## [5,] TRUE TRUE TRUE FALSE FALSE
## [6,] TRUE TRUE TRUE FALSE FALSE
## Total NA values
sum(is.na(Media_Digital_Media))
## [1] 39668
colSums(is.na(Media_Digital_Media))
## Title Languages Type
## 0 0 0
## Runtime IMDb.Score Hidden.Gem.Score
## 0 2143 2145
## View.Rating Rotten.Tomatoes.Score Metacritic.Score
## 0 9529 11674
## Boxoffice Release.Date IMDb.Votes
## 12032 0 2145
head(rowSums(is.na(Media_Digital_Media)))
## [1] 0 0 1 3 3 3
#location for NA values
head(which(is.na(Media_Digital_Media_Arra)))
## [1] 99215 99216 99453 99454 100607 100608
#Removing rows for NA values
Media_Digital_Media_NA <- na.omit(Media_Digital_Media_Arra)
## Scanning for Special Values
is.special <- function(x){ if (is.numeric(x))(is.infinite(x) | is.nan(x))
}
sapply(sapply(Media_Digital_Media,is.special),sum)
## Title Languages Type
## 0 0 0
## Runtime IMDb.Score Hidden.Gem.Score
## 0 0 0
## View.Rating Rotten.Tomatoes.Score Metacritic.Score
## 0 0 0
## Boxoffice Release.Date IMDb.Votes
## 0 0 0
Data is scanned after reshaping and manipulating by using is.na() function. Found the total NA values in the whole data frame for both column-wise and row-wise. Followed by the missing/NA values are removed by using the complete.cases() function. For scanning the special values, a function (is.special) is created which provides us the information on whether the data has infinite or NaN values. sapply() function is employed to check for any special values in the whole dataset.
# This is the R chunk for the Scan II
#BoxPlot(With Outliers)- Hidden.Gem.Score
boxplot(Media_Digital_Media_NA$Hidden.Gem.Score~Media_Digital_Media_NA$Runtime, main = "Hidden.Gem.Score", ylab = "Hidden.Gem.Score", xlab= "Media_Digital_Media_NA")
#BoxPlot(With Outliers)- Rotten.Tomatoes.Score
boxplot(Media_Digital_Media_NA$Rotten.Tomatoes.Score~Media_Digital_Media_NA$Runtime, main = "Rotten.Tomatoes.Score", ylab = "Rotten.Tomatoes.Score", xlab= "Media_Digital_Media_NA")
#Outliers (Hidden.Gem.Score)
Z.Hidden.Gem.Score <- Media_Digital_Media_NA$Hidden.Gem.Score %>% scores(type = "z")
Z.Hidden.Gem.Score %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.05549 -0.66944 -0.08584 0.00000 0.42481 4.21820
head(which( abs(Z.Hidden.Gem.Score) >3 ))
## [1] 3 5 6 7 8 11
length (which( abs(Z.Hidden.Gem.Score) >3 ))
## [1] 133
#Capping Outliers
#Cap Function
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
# capping outliers for population
Media_Digital_Media_NA$Hidden.Gem.Score <- Media_Digital_Media_NA$Hidden.Gem.Score %>% cap()
The data having no missing or NA values, for which we need to check for outliers in the numerical variables, and for that firstly, we need to visualize using the boxplot() function. After that z score method is applied, the outliers are detected by providing information of location and number of outliers with the help of which() and length() function. The outliers are handled using Capping which is also known as the Winsorising method, this method is used as we are deal with the data which can have n numbers of extreme values. Capping the values should be used as it takes the closest value so that we can have logical data/statistical approach for implementing the cap() function.
## Histogram for Hidden.Gem.Score
hist(Media_Digital_Media_NA$Hidden.Gem.Score)
## Transform
Hidden.Gem.Score.Transform <- log10(Media_Digital_Media_NA$Hidden.Gem.Score)
hist(Hidden.Gem.Score.Transform)
## Scaling
Media_Digital_Media_scale <- scale(Media_Digital_Media_NA$Hidden.Gem.Score, center =FALSE, scale =TRUE)
head(Media_Digital_Media_scale,5)
## [,1]
## [1,] 1.177079
## [2,] 1.177079
## [3,] 1.793645
## [4,] 1.205105
## [5,] 1.793645
hist(Media_Digital_Media_scale)
# reviewing the normalization curves
qqnorm(Hidden.Gem.Score.Transform)
qqnorm(Media_Digital_Media_scale)
## Explanation
For checking skewness of the data for certain variables we apply appropriate transforming techniques, where histograms is use to visualize the variables and data distribution of Hidden.Gem.Score is been found to be right and then the qq plots have been shown to review the variation. We change the scale of a certain variables or standardize the values of a variable for better understanding.