Required packages

The following packages were installed and loaded to be used in the report:

library(readr)
library(foreign)
library(gdata)
library(rvest)
library(dplyr)
library(tidyr)
library(knitr)
library(deductive)
library(validate)
library(Hmisc)
library(stringr)
library(outliers)
library(MVN)
library(forecast)
library(infotheo)
library(lubridate)

Executive Summary

The first step of preprocessing the data was to import the data sets into R using the read.csv() function. The next step was to combine the three imported data sets using rbind() to produce the main data set that will be used throughout the process. Checking the data types of the variables followed using glimpse(). Of the twelve variables in the data set, six variables were converted from character to factor using as.factor() with the Is_NF_Original variable being labelled using factor(). Two variables were also converted from character to numeric using as.numeric(). The separate() function was then used to form multiple column from one to tidy the data set. The data set was scanned for missing values by using the sum(is.na()) and colSums(is.na()) to identify the total number of missing values and which columns contain these values, respectively. The missing values in the RT_rating column were imputed using the mean of the variable. The is.special function was created to identify all numerical variables for special values (infinite and NaN) and then using sapply() to scan the data. Boxplots were then used to scan all numerical variables for outliers. The RT_rating variable showed to have outliers which were dealt with by the method of capping using the sapply() function. The distribution of the Average_rating variables was checked using histogram() and was transformed using the BoxCox() fucntion. After completing all of the preprocessing steps the data is then ready for analysis.

Data

The data sets were found on the Kaggle website and can be accessed using the following URL link: https://www.kaggle.com/intandea/netflix2020. The three data sets chosen were of the weekly number one netflix movie and TV show in Australia (AUS_NF), Great Britain (GBR_NF) and The United States of America (USA_NF). The data sets include 12 variables, which are listed below:

The three data sets chosen were imported into R using the read.csv() function as shown below. The argument stringAsFactors = FALSE was used to stipulate to R not to convert all characters into factors. A head of the data sets were produced.

AUS_NF <- read.csv("D:/RMIT/Semester 2 2020/MATH2349/Assignment 2/AUS_NF.csv", stringsAsFactors = FALSE)
print.data.frame(head(AUS_NF))
  week show_type               title    ori_country           genre release_date is_NF_Ori imdb_rating
1   37     Movie    Love, Guaranteed            USA          Comedy    3/09/2020      TRUE         55%
2   37   TV Show                Away            USA Science Fiction    4/09/2020      TRUE         71%
3   36     Movie Mary Queen of Scots United Kingdom           Drama   21/12/2018     FALSE         65%
4   36   TV Show           Cobra Kai            USA          Action    2/05/2018     FALSE         88%
5   35     Movie       The Sleepover            USA          Comedy   21/08/2020      TRUE         55%
6   35   TV Show             Lucifer            USA       Superhero   25/01/2016      TRUE         83%
  rt_rating country_chart                                             show_link Continent
1       50%           AUS          https://flixpatrol.com/title/love-guaranteed       OCE
2       73%           AUS                https://flixpatrol.com/title/away-2020       OCE
3       62%           AUS https://flixpatrol.com/title/mary-queen-of-scots-2018       OCE
4       94%           AUS                https://flixpatrol.com/title/cobra-kai       OCE
5       80%           AUS            https://flixpatrol.com/title/the-sleepover       OCE
6       87%           AUS                  https://flixpatrol.com/title/lucifer       OCE
GBR_NF <- read.csv("D:/RMIT/Semester 2 2020/MATH2349/Assignment 2/GBR_NF.csv", stringsAsFactors = FALSE)
print.data.frame(head(GBR_NF))
  week show_type                             title ori_country           genre release_date is_NF_Ori
1   37     Movie Charlie and the Chocolate Factory         USA         Fantasy    9/07/2005     FALSE
2   37   TV Show                              Away         USA Science Fiction    4/09/2020      TRUE
3   36     Movie                             Venom         USA       Superhero    3/10/2018     FALSE
4   36   TV Show                         Cobra Kai         USA          Action    2/05/2018     FALSE
5   35     Movie                         Bee Movie         USA        Animated   28/10/2007     FALSE
6   35   TV Show                          The Fall     Ireland           Crime   12/05/2013     FALSE
  imdb_rating rt_rating country_chart                                                      show_link
1         67%       83%           GBR https://flixpatrol.com/title/charlie-and-the-chocolate-factory
2         71%       73%           GBR                         https://flixpatrol.com/title/away-2020
3         71%       29%           GBR                        https://flixpatrol.com/title/venom-2018
4         88%       94%           GBR                         https://flixpatrol.com/title/cobra-kai
5         62%       50%           GBR                         https://flixpatrol.com/title/bee-movie
6         82%                     GBR                     https://flixpatrol.com/title/the-fall-2013
  Continent
1       EUR
2       EUR
3       EUR
4       EUR
5       EUR
6       EUR
USA_NF <- read.csv("D:/RMIT/Semester 2 2020/MATH2349/Assignment 2/USA_NF.csv", stringsAsFactors = FALSE)
print.data.frame(head(USA_NF))
  week show_type             title ori_country           genre release_date is_NF_Ori imdb_rating rt_rating
1   37     Movie  Love, Guaranteed         USA          Comedy    3/09/2020      TRUE         55%       50%
2   37   TV Show              Away         USA Science Fiction    4/09/2020      TRUE         71%       73%
3   36     Movie The Frozen Ground         USA                   23/08/2013     FALSE         64%          
4   36   TV Show         Cobra Kai         USA          Action    2/05/2018     FALSE         88%       94%
5   35     Movie     Project Power         USA          Action   14/08/2020      TRUE         61%       63%
6   35   TV Show           Lucifer         USA       Superhero   25/01/2016      TRUE         83%       87%
  country_chart                                      show_link Continent
1           USA   https://flixpatrol.com/title/love-guaranteed       AME
2           USA         https://flixpatrol.com/title/away-2020       AME
3           USA https://flixpatrol.com/title/the-frozen-ground       AME
4           USA         https://flixpatrol.com/title/cobra-kai       AME
5           USA     https://flixpatrol.com/title/project-power       AME
6           USA           https://flixpatrol.com/title/lucifer       AME

The three data sets were merged together to create the main data set for the report using the rbind() function. The options(max.print = 10000) function was used to increase the limit of max.print so that rows were not omitted when combining the data sets. The arrange(desc())function was used to order the combined data in desceding order so all results for week 37, etc. were together.

options(max.print = 10000)
combined <- GBR_NF %>% rbind(USA_NF, AUS_NF)
NF_combined <- combined %>% arrange(desc(week))
print.data.frame(head(NF_combined))
  week show_type                             title ori_country           genre release_date is_NF_Ori
1   37     Movie Charlie and the Chocolate Factory         USA         Fantasy    9/07/2005     FALSE
2   37   TV Show                              Away         USA Science Fiction    4/09/2020      TRUE
3   37     Movie                  Love, Guaranteed         USA          Comedy    3/09/2020      TRUE
4   37   TV Show                              Away         USA Science Fiction    4/09/2020      TRUE
5   37     Movie                  Love, Guaranteed         USA          Comedy    3/09/2020      TRUE
6   37   TV Show                              Away         USA Science Fiction    4/09/2020      TRUE
  imdb_rating rt_rating country_chart                                                      show_link
1         67%       83%           GBR https://flixpatrol.com/title/charlie-and-the-chocolate-factory
2         71%       73%           GBR                         https://flixpatrol.com/title/away-2020
3         55%       50%           USA                   https://flixpatrol.com/title/love-guaranteed
4         71%       73%           USA                         https://flixpatrol.com/title/away-2020
5         55%       50%           AUS                   https://flixpatrol.com/title/love-guaranteed
6         71%       73%           AUS                         https://flixpatrol.com/title/away-2020
  Continent
1       EUR
2       EUR
3       AME
4       AME
5       OCE
6       OCE

The colnames() function was used to change the column names in the main data set.

colnames(NF_combined) <- c("Week", "Show_type", "Title", "Origin_country", 
                           "Genre", "Release_date", "Is_NF_original", 
                           "IMDB_rating(%)", "RT_rating(%)", "Country_chart",
                           "Show_link", "Continent")
print.data.frame(head(NF_combined))
  Week Show_type                             Title Origin_country           Genre Release_date
1   37     Movie Charlie and the Chocolate Factory            USA         Fantasy    9/07/2005
2   37   TV Show                              Away            USA Science Fiction    4/09/2020
3   37     Movie                  Love, Guaranteed            USA          Comedy    3/09/2020
4   37   TV Show                              Away            USA Science Fiction    4/09/2020
5   37     Movie                  Love, Guaranteed            USA          Comedy    3/09/2020
6   37   TV Show                              Away            USA Science Fiction    4/09/2020
  Is_NF_original IMDB_rating(%) RT_rating(%) Country_chart
1          FALSE            67%          83%           GBR
2           TRUE            71%          73%           GBR
3           TRUE            55%          50%           USA
4           TRUE            71%          73%           USA
5           TRUE            55%          50%           AUS
6           TRUE            71%          73%           AUS
                                                       Show_link Continent
1 https://flixpatrol.com/title/charlie-and-the-chocolate-factory       EUR
2                         https://flixpatrol.com/title/away-2020       EUR
3                   https://flixpatrol.com/title/love-guaranteed       AME
4                         https://flixpatrol.com/title/away-2020       AME
5                   https://flixpatrol.com/title/love-guaranteed       OCE
6                         https://flixpatrol.com/title/away-2020       OCE

Understand

The dimensions and structure of the data set were checked using the glimpse() function which showed the dimensions to be 154 rows and 12 columns. The same function was used to check the data types of the variables in the data set. It showed that column one is a numeric variable, column sevel is a logical variable and the other ten columns are all character variables. Multiple variables require data type conversion.

glimpse(NF_combined)
Rows: 154
Columns: 12
$ Week             <int> 37, 37, 37, 37, 37, 37, 36, 36, 36, 36, 36, 36, 35, 35, 35, 35, 35, 35, 34, 34,...
$ Show_type        <chr> "Movie", "TV Show", "Movie", "TV Show", "Movie", "TV Show", "Movie", "TV Show",...
$ Title            <chr> "Charlie and the Chocolate Factory", "Away", "Love, Guaranteed", "Away", "Love,...
$ Origin_country   <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "United K...
$ Genre            <chr> "Fantasy", "Science Fiction", "Comedy", "Science Fiction", "Comedy", "Science F...
$ Release_date     <chr> "9/07/2005", "4/09/2020", "3/09/2020", "4/09/2020", "3/09/2020", "4/09/2020", "...
$ Is_NF_original   <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
$ `IMDB_rating(%)` <chr> "67%", "71%", "55%", "71%", "55%", "71%", "71%", "88%", "64%", "88%", "65%", "8...
$ `RT_rating(%)`   <chr> "83%", "73%", "50%", "73%", "50%", "73%", "29%", "94%", "", "94%", "62%", "94%"...
$ Country_chart    <chr> "GBR", "GBR", "USA", "USA", "AUS", "AUS", "GBR", "GBR", "USA", "USA", "AUS", "A...
$ Show_link        <chr> "https://flixpatrol.com/title/charlie-and-the-chocolate-factory", "https://flix...
$ Continent        <chr> "EUR", "EUR", "AME", "AME", "OCE", "OCE", "EUR", "EUR", "AME", "AME", "OCE", "O...

The following variables were converted from character to factor using the as.factor() function. The levels() function was used to check the levels of the factor variable and see if any of the variables needed to be labelled and/or ordered.

NF_combined$Show_type <- as.factor(NF_combined$Show_type)
levels(NF_combined$Show_type)
[1] "Documentary"    "Documentary TV" "Movie"          "TV Show"       
NF_combined$Origin_country <- as.factor(NF_combined$Origin_country)
levels(NF_combined$Origin_country)
[1] "Canada"         "France"         "Ireland"        "Poland"         "Spain"          "Switzerland"   
[7] "United Kingdom" "USA"           
NF_combined$Genre <- as.factor(NF_combined$Genre)
levels(NF_combined$Genre)
 [1] ""                "Action"          "Adventure"       "Animated"        "Comedy"         
 [6] "Crime"           "Documentary"     "Drama"           "Fantasy"         "Horror"         
[11] "Mystery"         "Reality-Show"    "Romance"         "Science Fiction" "Superhero"      
[16] "Thriller"       
NF_combined$Country_chart <- as.factor(NF_combined$Country_chart)
levels(NF_combined$Country_chart)
[1] "AUS" "GBR" "USA"
NF_combined$Continent <- as.factor(NF_combined$Continent)
levels(NF_combined$Continent)
[1] "AME" "EUR" "OCE"

The variable Is_NF_original was converted from a logical data type to a factor data type using the as.factor() function. The levels() function was used to check the levels of the factor variable. The factor() function was then used to change the labels assigned to the different levels of the variable. Lastly, levels() function was used again to check the new levels of the variable.

NF_combined$Is_NF_original <- as.factor(NF_combined$Is_NF_original)
levels(NF_combined$Is_NF_original)
[1] "FALSE" "TRUE" 
NF_combined$Is_NF_original <- NF_combined$Is_NF_original %>% 
  factor(levels = c("FALSE", "TRUE"), 
         labels = c("Not Netflix Original", "Netflix Original"))
levels(NF_combined$Is_NF_original)
[1] "Not Netflix Original" "Netflix Original"    

The following two variables were converted from character to numeric using the as.numeric() function. The sub() function was used to remove the “%” in the rows so that the variable could be changed to a numeric data type. The “%” was not needed at the column name explains that the numeric value in the row is a percentage.

NF_combined$`IMDB_rating(%)` <- 
  as.numeric(sub("%", "", NF_combined$`IMDB_rating(%)`))
NF_combined$`RT_rating(%)` <- 
  as.numeric(sub("%", "", NF_combined$`RT_rating(%)`))

Once all the data type conversions were completed, the glimpse() function was used to check that all variables had been changed into the correct data type. This is shown below.

glimpse(NF_combined)
Rows: 154
Columns: 12
$ Week             <int> 37, 37, 37, 37, 37, 37, 36, 36, 36, 36, 36, 36, 35, 35, 35, 35, 35, 35, 34, 34,...
$ Show_type        <fct> Movie, TV Show, Movie, TV Show, Movie, TV Show, Movie, TV Show, Movie, TV Show,...
$ Title            <chr> "Charlie and the Chocolate Factory", "Away", "Love, Guaranteed", "Away", "Love,...
$ Origin_country   <fct> USA, USA, USA, USA, USA, USA, USA, USA, USA, USA, United Kingdom, USA, USA, Ire...
$ Genre            <fct> Fantasy, Science Fiction, Comedy, Science Fiction, Comedy, Science Fiction, Sup...
$ Release_date     <chr> "9/07/2005", "4/09/2020", "3/09/2020", "4/09/2020", "3/09/2020", "4/09/2020", "...
$ Is_NF_original   <fct> Not Netflix Original, Netflix Original, Netflix Original, Netflix Original, Net...
$ `IMDB_rating(%)` <dbl> 67, 71, 55, 71, 55, 71, 71, 88, 64, 88, 65, 88, 62, 82, 61, 83, 55, 83, 61, 82,...
$ `RT_rating(%)`   <dbl> 83, 73, 50, 73, 50, 73, 29, 94, NA, 94, 62, 94, 50, NA, 63, 87, 80, 87, 63, NA,...
$ Country_chart    <fct> GBR, GBR, USA, USA, AUS, AUS, GBR, GBR, USA, USA, AUS, AUS, GBR, GBR, USA, USA,...
$ Show_link        <chr> "https://flixpatrol.com/title/charlie-and-the-chocolate-factory", "https://flix...
$ Continent        <fct> EUR, EUR, AME, AME, OCE, OCE, EUR, EUR, AME, AME, OCE, OCE, EUR, EUR, AME, AME,...

Tidy & Manipulate Data I

One of the common problems with messy data sets is that multiple variables are stored in one column. This does not conform with the tidy data principles. The NF_combined data set shows this messy problem as the release_date column has the day, month and year stored in one column. In order to separate these three variables into multiple columns, the separate() function from the tidyr package was used. The sep argument was used to tell R where to separate the variable into multiple columns. The data set was renamed Tidy_NF. Head() was used to show the new columns of the tidy data set.

Tidy_NF <- NF_combined %>% 
  separate(Release_date,into = c("Release_day","Release_month","Release_year"), 
           sep = "/")
print.data.frame(head(Tidy_NF))
  Week Show_type                             Title Origin_country           Genre Release_day Release_month
1   37     Movie Charlie and the Chocolate Factory            USA         Fantasy           9            07
2   37   TV Show                              Away            USA Science Fiction           4            09
3   37     Movie                  Love, Guaranteed            USA          Comedy           3            09
4   37   TV Show                              Away            USA Science Fiction           4            09
5   37     Movie                  Love, Guaranteed            USA          Comedy           3            09
6   37   TV Show                              Away            USA Science Fiction           4            09
  Release_year       Is_NF_original IMDB_rating(%) RT_rating(%) Country_chart
1         2005 Not Netflix Original             67           83           GBR
2         2020     Netflix Original             71           73           GBR
3         2020     Netflix Original             55           50           USA
4         2020     Netflix Original             71           73           USA
5         2020     Netflix Original             55           50           AUS
6         2020     Netflix Original             71           73           AUS
                                                       Show_link Continent
1 https://flixpatrol.com/title/charlie-and-the-chocolate-factory       EUR
2                         https://flixpatrol.com/title/away-2020       EUR
3                   https://flixpatrol.com/title/love-guaranteed       AME
4                         https://flixpatrol.com/title/away-2020       AME
5                   https://flixpatrol.com/title/love-guaranteed       OCE
6                         https://flixpatrol.com/title/away-2020       OCE

Scan I

The Tidy_NF data set was scanned for missing values first using the sum(is.na()) functions to show the total count of missing values in the data. The colSums(is.na()) functions were then use to see which columns in the data set contained the missing values. The two functions showed that there was a total of 40 missing values in the data and were all contained in the RT_rating(%) column. The methodology above was chosen to see which columns of the data set had the missing values and to determine what would be the correct actions for dealing with the missing values according to the data type of the variables.

sum(is.na(Tidy_NF))
[1] 40
colSums(is.na(Tidy_NF))
          Week      Show_type          Title Origin_country          Genre    Release_day  Release_month 
             0              0              0              0              0              0              0 
  Release_year Is_NF_original IMDB_rating(%)   RT_rating(%)  Country_chart      Show_link      Continent 
             0              0              0             40              0              0              0 

The action taken to deal with the missing values was to use the impute() function from the Hmisc package. The decision was to impute the mean for the missing values in the RT_rating(%) column. This action was made as it is a natural way of replacing missing values for numeric variables. The round() function was used to tidy up the column and prepare values that can be used for analysis.

Tidy_NF$`RT_rating(%)` <- impute(Tidy_NF$`RT_rating(%)`, fun = mean)
Tidy_NF$`RT_rating(%)` <- round(Tidy_NF$`RT_rating(%)`)

The RT_rating(%) variable was re-converted to a numeric data type after imputing of missing values was completed. This was to help with the scanning of outliers in the further steps of preprocessing.

Tidy_NF$`RT_rating(%)` <- as.numeric(Tidy_NF$`RT_rating(%)`)

Special values were scanned by first establishing a is.special function, which checks every numerical column to identify whether they have infinite or NaN values. The sapply() function, from the apply family functions, was then used to scan the data for any special values. This is shown below.

is.special <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(Tidy_NF, is.special)
$Week
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [52] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [69] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [86] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[103] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[120] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[137] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[154] FALSE

$Show_type
NULL

$Title
NULL

$Origin_country
NULL

$Genre
NULL

$Release_day
NULL

$Release_month
NULL

$Release_year
NULL

$Is_NF_original
NULL

$`IMDB_rating(%)`
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [52] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [69] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [86] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[103] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[120] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[137] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[154] FALSE

$`RT_rating(%)`
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [52] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [69] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [86] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[103] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[120] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[137] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[154] FALSE

$Country_chart
NULL

$Show_link
NULL

$Continent
NULL

Tidy & Manipulate Data II

A new variable was created and added to the data set from two existing variables. The new variable, Average_rating, was created from the IMDB_rating(%) and RT_rating(%) variables using the mutate() function. The two existing variables were added together and then divided by 2 to give an average rating for each movie/show in the data set. The head() function was used to show the new variable in the data set.

Tidy_NF <- 
  mutate(Tidy_NF, 
         Average_rating = (Tidy_NF$`IMDB_rating(%)` + Tidy_NF$`RT_rating(%)`)/2)
print.data.frame(head(Tidy_NF))
  Week Show_type                             Title Origin_country           Genre Release_day Release_month
1   37     Movie Charlie and the Chocolate Factory            USA         Fantasy           9            07
2   37   TV Show                              Away            USA Science Fiction           4            09
3   37     Movie                  Love, Guaranteed            USA          Comedy           3            09
4   37   TV Show                              Away            USA Science Fiction           4            09
5   37     Movie                  Love, Guaranteed            USA          Comedy           3            09
6   37   TV Show                              Away            USA Science Fiction           4            09
  Release_year       Is_NF_original IMDB_rating(%) RT_rating(%) Country_chart
1         2005 Not Netflix Original             67           83           GBR
2         2020     Netflix Original             71           73           GBR
3         2020     Netflix Original             55           50           USA
4         2020     Netflix Original             71           73           USA
5         2020     Netflix Original             55           50           AUS
6         2020     Netflix Original             71           73           AUS
                                                       Show_link Continent Average_rating
1 https://flixpatrol.com/title/charlie-and-the-chocolate-factory       EUR           75.0
2                         https://flixpatrol.com/title/away-2020       EUR           72.0
3                   https://flixpatrol.com/title/love-guaranteed       AME           52.5
4                         https://flixpatrol.com/title/away-2020       AME           72.0
5                   https://flixpatrol.com/title/love-guaranteed       OCE           52.5
6                         https://flixpatrol.com/title/away-2020       OCE           72.0

The Average.rating variable was converted from impute to numeric data type to allow for scanning of outliers in the further steps of preprocessing the data. This was completed by using the as.numeric() function as shown below.

Tidy_NF$Average_rating <- as.numeric(Tidy_NF$Average_rating)

Scan II

To scan for outliers in the numerical variable columns, the “Tukey’s method of outlier detection” in the boxplot was used. Boxplots were chosen as the methodology as the variables were univariate. Outliers are those values that are outside the “outlier fences” and are depicted as an “o” symbol on the boxplot. Four numerical variables in the data set were scanned for outliers using boxplots as shown below.

Tidy_NF$Week %>% boxplot(main = "Box Plot of Week", 
                         ylab = "Week", col = "grey")

Tidy_NF$`IMDB_rating(%)` %>% boxplot(main = "Box Plot of IMDB Rating", 
                                     ylab = "IMDB Rating", col = "grey")

Tidy_NF$`RT_rating(%)` %>% boxplot(main = "Box Plot of RT Rating", 
                                   ylab = "RT Rating", col = "grey")

Tidy_NF$Average_rating %>% boxplot(main = "Box Plot of Average Rating", 
                                   ylab = "Average Rating", col = "grey")

As shown above, the only numerical variable containing outliers is the RT_rating. This may be due to the imputation of the mean for missing values earlier in the preprocessing steps. The chosen method of dealing with the outliers was capping. This was done by establishing the cap function and then using the sapply() function as shown below. The summary() function was then used to show the summary statistics of the RT_rating(%) variable.

cap <- function(x){
    quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
    x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
    x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
    x
}
Tidy_NF$`RT_rating(%)` <- sapply(Tidy_NF$`RT_rating(%)`, FUN = cap)
summary(Tidy_NF$`RT_rating(%)`)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  19.00   63.00   67.00   66.86   80.00  100.00 

Transform

The variable that was chosen to be transformed was Average_rating. This variable was chosen as it has a slight left-skewness as shown below by using the histogram() function.

histogram(Tidy_NF$Average_rating)

The chosen transformation function for the Average_rating variable was Box-Cox transformation. This was chosen as the data shows slight non-normal distribution and requires transformation to a more “normal” distribution. The BoxCox() function was used along with the histogram() function to show the transformation of the Average_rating variable as shown below. The distribution has changed from a slight left-skewed to one that is similar to bi-modal. This new distribution will help provide understanding about average rating across the top movies/TV shows in the three countries.

boxcox_NF <- BoxCox(Tidy_NF$Average_rating, lambda = "auto")
histogram(boxcox_NF)

