Data Wrangling (Data Preprocessing)

Introduction

This assignment is aimed to understand all the steps involved in Data Pre-processing (Data Wrangling):

Get
Understand
Tidy & Manipulate
Scan
Transform

Required packages

# Packages Used for the report:

library(readr)
library(gdata) 
library(rvest)
library(forecast)
library(Hmisc)
library(outliers)
library(editrules)
library(magrittr)
library(dplyr)
library(tidyr)

Executive Summary

The report/assignment covers important functions and techniques of data pre-processing. We have used two data sets for this purpose. Upon using basic pre-processing techniques like observing the structure and dimensions of all the variables, we will check if the data fulfills all the requirements asked as part of our report.Further we will join the data sets according to the common variables.
Moving to the understand and tidy processes we will make sure the data follows all the tidy principles like correct format. We will achieve this by changing variables to their correct data type, factorizing the variables.
As part of the scanning section, we will check for all the NA/Missing and special values. Missing values for the categorical variables will be replaced with the mode and missing values for numerical variables will be replaced with the mean. Apart from this, we will find outliers using the z score method for all numeric variables. The outliers will be dealt with using the capping process because the data is sensitive, and it may have extreme values which should not be removed.
In the final section, we will use an appropriate transformation technique on numerical variables to remove the skewness to make the data symmetric.

Data

This dataset combines information about data sources from Rotten Tomatoes, IMDB, posters, box office information, trailers on YouTube, and more by using various types of APIs. There are multiple data that give information for the past research based on their Language and series by their runtime.
Since the observations of this dataset are based on different languages based on different kinds of Genres provided by the title.
Hence for the second data, provides information about the rating score. There’s one unique metric called “Hidden Gem Store”, which gives calculation for low review count and high review count. Wherein Lower the review count and higher the user rating, the higher the hidden gem score.
The below data set helps us to provide a better understanding of the field of Entertainment, movies and TV shows, search engines, and classification.

Data Source

Source of the dataset is captured from – URL: https://www.kaggle.com/static.s3/netflix/kk0_Media Further Reference : Mohan Gupta,2021, Arts and Entertainment

Variable Description

Media Details DataSet 1

The data set contains 15480 observation and 4 variables.

Title : chr datatype
Languages : chr datatype
Series or Movies : chr datatype
Runtime : chr datatype

Digital Media Ratings DataSet 2

The data set contains 15480 observation and 9 variables.

Title : char datatype
IMDb score : num datatype
Hidden Gem Score : num datatype
View Rating : chr datatype
Rotten Tomatoes Score : int datatype
Metacritic Score : int datatype
Boxoffice : chr datatype
Release Date : chr datatype
IMDb votes : int datatype

## Setting Directory
getwd()

## [1] "C:/Users/CLOUD/Downloads"

setwd("C:/Users/CLOUD/Downloads")
getwd()

## [1] "C:/Users/CLOUD/Downloads"

## Importing Dataset 1 from the Directory
Media <- read.csv("Media_Details.csv")
head(Media)

## Importing Dataset 2 from the Directory
Digital_Media <- read.csv("Digital_Media_Ratings.csv")
head(Digital_Media)

## Merging the 2 data Sets
Media_Digital_Media <- inner_join(Media,Digital_Media,by=c("Title"))
Media_Digital_Media

Understand

##Summarize the types of variables and data structures
summary(Media_Digital_Media)

##     Title            Languages             Type             Runtime         
##  Length:16386       Length:16386       Length:16386       Length:16386      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    IMDb.Score    Hidden.Gem.Score View.Rating        Rotten.Tomatoes.Score
##  Min.   :1.000   Min.   :0.60     Length:16386       Min.   :  0.00       
##  1st Qu.:5.800   1st Qu.:3.80     Class :character   1st Qu.: 38.00       
##  Median :6.600   Median :6.70     Mode  :character   Median : 63.00       
##  Mean   :6.502   Mean   :5.89                        Mean   : 59.41       
##  3rd Qu.:7.300   3rd Qu.:7.90                        3rd Qu.: 82.00       
##  Max.   :9.700   Max.   :9.80                        Max.   :100.00       
##  NA's   :2143    NA's   :2145                        NA's   :9529         
##  Metacritic.Score   Boxoffice         Release.Date         IMDb.Votes     
##  Min.   :  5.00   Min.   :       72   Length:16386       Min.   :      5  
##  1st Qu.: 44.00   1st Qu.:   858576   Class :character   1st Qu.:    430  
##  Median : 57.00   Median : 20304578   Mode  :character   Median :   2521  
##  Mean   : 56.79   Mean   : 46113563                      Mean   :  44011  
##  3rd Qu.: 69.00   3rd Qu.: 60472742                      3rd Qu.:  23186  
##  Max.   :100.00   Max.   :659363944                      Max.   :2354197  
##  NA's   :11674    NA's   :12032                          NA's   :2145

## Attributes for Data
attributes(Media_Digital_Media)$class

## [1] "data.frame"

attributes(Media_Digital_Media)$names

##  [1] "Title"                 "Languages"             "Type"                 
##  [4] "Runtime"               "IMDb.Score"            "Hidden.Gem.Score"     
##  [7] "View.Rating"           "Rotten.Tomatoes.Score" "Metacritic.Score"     
## [10] "Boxoffice"             "Release.Date"          "IMDb.Votes"

## Converting Char to factor variable
Media_Digital_Media$Type <- factor(Media_Digital_Media$Type,levels = c('Movie','Series'),labels = c('MMovie','SSeries'))

## Checking the status of Levels in the merged data sets
levels(Media_Digital_Media$Type)

## [1] "MMovie"  "SSeries"

## Checking the Structure of data types
str(Media_Digital_Media)

## 'data.frame':    16386 obs. of  12 variables:
##  $ Title                : chr  "Lets Fight Ghost" "HOW TO BUILD A GIRL" "Centigrade" "ANNE+" ...
##  $ Languages            : chr  "Swedish, Spanish" "English" "English" "Turkish" ...
##  $ Type                 : Factor w/ 2 levels "MMovie","SSeries": 2 1 1 2 1 1 2 2 1 1 ...
##  $ Runtime              : chr  "< 30 minutes" "1-2 hour" "1-2 hour" "< 30 minutes" ...
##  $ IMDb.Score           : num  7.9 5.8 4.3 6.5 6.3 7.4 7.5 3.9 6.7 6.6 ...
##  $ Hidden.Gem.Score     : num  4.3 7 6.4 7.7 8.1 8.6 8.7 6.9 8.3 5.3 ...
##  $ View.Rating          : chr  "R" "R" "Unrated" "" ...
##  $ Rotten.Tomatoes.Score: int  98 79 NA NA NA NA NA NA NA NA ...
##  $ Metacritic.Score     : int  82 69 46 NA NA NA NA NA NA NA ...
##  $ Boxoffice            : int  2122065 70632 16263 NA NA NA NA NA NA NA ...
##  $ Release.Date         : chr  "12-Dec-08" "08-May-20" "28-Aug-20" "01-Oct-16" ...
##  $ IMDb.Votes           : int  205926 2838 1720 1147 63 131 47 107 88 5926 ...

Explanation

After merging the data sets (Media and Digital Media), we analyzed that the data structure and data variables give the descriptive statistics for each variable with the help of a summary ().
Columns’ names/attributes will go with the class which is merged with the attribute function. Since the merged data contains the categorical variable, we convert the following data type as a character to a factor variable.
For getting the structure we frame the structure by using the str() function for getting the final structure.

Tidy & Manipulate Data I

#Reshaping the data into Tidy format

# Separating records for Genre
New_Media <- Media_Digital_Media %>% separate(Languages,into=c("PrimaryLang","SecondaryLang"),sep=",")

## Warning: Expected 2 pieces. Additional pieces discarded in 1502 rows [10, 30,
## 67, 98, 107, 127, 142, 148, 167, 168, 183, 202, 232, 254, 255, 256, 258, 270,
## 307, 316, ...].

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 12599 rows [2, 3,
## 4, 5, 6, 7, 8, 9, 11, 12, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, ...].

#Separating release records
New_Media2 <- New_Media %>% separate(Release.Date,into=c("Release Date","Release Month", "Release Year"),sep= "-")

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2143 rows [21,
## 23, 42, 59, 62, 77, 88, 124, 188, 241, 246, 265, 276, 288, 372, 380, 387, 410,
## 418, 433, ...].

# Changing Wide to Long
New_Media3 <- pivot_longer(New_Media2,names_to="Speech",values_to = "Language",cols = 2:3)

Explaination

The Main Principles of a Tidy data are :

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

For Untidy merged data:

The data frame Media_Digital_Media has variables e.g (Language, Release date) which contains more than one character separated by “,”.
With help of a separate()function, we evaluated the function which separates the comma-separated values into multiple columns.
Wherein for Release information we had three variables separated by “-” which was tidied by using separate() function as resultant into Release_Day, Release_Month, Release_Year.

Tidy & Manipulate Data II

# This is the R chunk for the Tidy & Manipulate Data II 
## Mutating the Data
Media_Digital_Media_Muta <- New_Media3 %>% mutate(Boxoffice_Million=Boxoffice/1000000)

## Average for Hidden gem score by Runtime
Media_Digital_Media_Muta %>% group_by(Runtime) %>% summarize(Avg_Rating=mean(Hidden.Gem.Score, na.rm=TRUE))

## Arranging the Data
Media_Digital_Media_Arra <- Media_Digital_Media_Muta %>% arrange(desc(Rotten.Tomatoes.Score))

Explanation

Tidyr package is used for the following functions. Mutated the Boxoffice(In millions) with Dataframe to add another column in the Dataframe by using mutate() function.
Taking average function for Hidden gem score by Runtime using groupby() and after that summarizing the resultant output we use summarize() function as output.
Arranged the data from highest to lowest scores by using arrange() function.

Scan I

## Checking for NA values
head(is.na(Media_Digital_Media))

##      Title Languages  Type Runtime IMDb.Score Hidden.Gem.Score View.Rating
## [1,] FALSE     FALSE FALSE   FALSE      FALSE            FALSE       FALSE
## [2,] FALSE     FALSE FALSE   FALSE      FALSE            FALSE       FALSE
## [3,] FALSE     FALSE FALSE   FALSE      FALSE            FALSE       FALSE
## [4,] FALSE     FALSE FALSE   FALSE      FALSE            FALSE       FALSE
## [5,] FALSE     FALSE FALSE   FALSE      FALSE            FALSE       FALSE
## [6,] FALSE     FALSE FALSE   FALSE      FALSE            FALSE       FALSE
##      Rotten.Tomatoes.Score Metacritic.Score Boxoffice Release.Date IMDb.Votes
## [1,]                 FALSE            FALSE     FALSE        FALSE      FALSE
## [2,]                 FALSE            FALSE     FALSE        FALSE      FALSE
## [3,]                  TRUE            FALSE     FALSE        FALSE      FALSE
## [4,]                  TRUE             TRUE      TRUE        FALSE      FALSE
## [5,]                  TRUE             TRUE      TRUE        FALSE      FALSE
## [6,]                  TRUE             TRUE      TRUE        FALSE      FALSE

## Total NA values
sum(is.na(Media_Digital_Media))

## [1] 39668

colSums(is.na(Media_Digital_Media))

##                 Title             Languages                  Type 
##                     0                     0                     0 
##               Runtime            IMDb.Score      Hidden.Gem.Score 
##                     0                  2143                  2145 
##           View.Rating Rotten.Tomatoes.Score      Metacritic.Score 
##                     0                  9529                 11674 
##             Boxoffice          Release.Date            IMDb.Votes 
##                 12032                     0                  2145

head(rowSums(is.na(Media_Digital_Media)))

## [1] 0 0 1 3 3 3

#location for NA values
head(which(is.na(Media_Digital_Media_Arra)))

## [1]  99215  99216  99453  99454 100607 100608

#Removing rows for NA values
Media_Digital_Media_NA <- na.omit(Media_Digital_Media_Arra)

## Scanning for Special Values
is.special <- function(x){ if (is.numeric(x))(is.infinite(x) | is.nan(x))
}
sapply(sapply(Media_Digital_Media,is.special),sum)

##                 Title             Languages                  Type 
##                     0                     0                     0 
##               Runtime            IMDb.Score      Hidden.Gem.Score 
##                     0                     0                     0 
##           View.Rating Rotten.Tomatoes.Score      Metacritic.Score 
##                     0                     0                     0 
##             Boxoffice          Release.Date            IMDb.Votes 
##                     0                     0                     0

Explanation

Data is scanned after reshaping and manipulating by using is.na() function.
Found the total NA values in the whole data frame for both column-wise and row-wise.
Followed by the missing/NA values are removed by using the complete.cases() function.
For scanning the special values, a function (is.special) is created which provides us the information on whether the data has infinite or NaN values.
sapply() function is employed to check for any special values in the whole dataset.

Scan II

# This is the R chunk for the Scan II

#BoxPlot(With Outliers)- Hidden.Gem.Score
boxplot(Media_Digital_Media_NA$Hidden.Gem.Score~Media_Digital_Media_NA$Runtime, main = "Hidden.Gem.Score", ylab = "Hidden.Gem.Score", xlab= "Media_Digital_Media_NA")

#BoxPlot(With Outliers)- Rotten.Tomatoes.Score
boxplot(Media_Digital_Media_NA$Rotten.Tomatoes.Score~Media_Digital_Media_NA$Runtime, main = "Rotten.Tomatoes.Score", ylab = "Rotten.Tomatoes.Score", xlab= "Media_Digital_Media_NA")

#Outliers (Hidden.Gem.Score)

Z.Hidden.Gem.Score <- Media_Digital_Media_NA$Hidden.Gem.Score %>% scores(type = "z") 
Z.Hidden.Gem.Score %>% summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.05549 -0.66944 -0.08584  0.00000  0.42481  4.21820

head(which( abs(Z.Hidden.Gem.Score) >3 ))

## [1]  3  5  6  7  8 11

length (which( abs(Z.Hidden.Gem.Score) >3 ))

## [1] 133

#Capping Outliers 

#Cap Function

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}

# capping outliers for population

Media_Digital_Media_NA$Hidden.Gem.Score <- Media_Digital_Media_NA$Hidden.Gem.Score %>% cap()

Explanation

The data having no missing or NA values, for which we need to check for outliers in the numerical variables, and for that firstly, we need to visualize using the boxplot() function.
After that z score method is applied, the outliers are detected by providing information of location and number of outliers with the help of which() and length() function.
The outliers are handled using Capping which is also known as the Winsorising method, this method is used as we are deal with the data which can have n numbers of extreme values.
Capping the values should be used as it takes the closest value so that we can have logical data/statistical approach for implementing the cap() function.

Transform

## Histogram for Hidden.Gem.Score
hist(Media_Digital_Media_NA$Hidden.Gem.Score)

## Transform

Hidden.Gem.Score.Transform <- log10(Media_Digital_Media_NA$Hidden.Gem.Score)

hist(Hidden.Gem.Score.Transform)

## Scaling

Media_Digital_Media_scale <- scale(Media_Digital_Media_NA$Hidden.Gem.Score, center =FALSE, scale =TRUE)
head(Media_Digital_Media_scale,5)

##          [,1]
## [1,] 1.177079
## [2,] 1.177079
## [3,] 1.793645
## [4,] 1.205105
## [5,] 1.793645

hist(Media_Digital_Media_scale)

# reviewing the normalization curves

qqnorm(Hidden.Gem.Score.Transform)

qqnorm(Media_Digital_Media_scale)

## Explanation

For checking skewness of the data for certain variables we apply appropriate transforming techniques, where histograms is use to visualize the variables and data distribution of Hidden.Gem.Score is been found to be right and then the qq plots have been shown to review the variation.
We change the scale of a certain variables or standardize the values of a variable for better understanding.

Data Wrangling (Data Preprocessing)

Practical Assessment 2

Akash Singh s3871025

Introduction

Required packages

Executive Summary

Data

Data Source

Variable Description

Media Details DataSet 1

Digital Media Ratings DataSet 2

Understand

Explanation

Tidy & Manipulate Data I

Explaination

Tidy & Manipulate Data II

Explanation

Scan I

Explanation

Scan II

Explanation

Transform