The following packages were installed and loaded to be used in the report:
library(readr)
library(foreign)
library(gdata)
library(rvest)
library(dplyr)
library(tidyr)
library(knitr)
library(deductive)
library(validate)
library(Hmisc)
library(stringr)
library(outliers)
library(MVN)
library(forecast)
library(infotheo)
library(lubridate)
The first step of preprocessing the data was to import the data sets into R using the read.csv() function. The next step was to combine the three imported data sets using rbind() to produce the main data set that will be used throughout the process. Checking the data types of the variables followed using glimpse(). Of the twelve variables in the data set, six variables were converted from character to factor using as.factor() with the Is_NF_Original variable being labelled using factor(). Two variables were also converted from character to numeric using as.numeric(). The separate() function was then used to form multiple column from one to tidy the data set. The data set was scanned for missing values by using the sum(is.na()) and colSums(is.na()) to identify the total number of missing values and which columns contain these values, respectively. The missing values in the RT_rating column were imputed using the mean of the variable. The is.special function was created to identify all numerical variables for special values (infinite and NaN) and then using sapply() to scan the data. Boxplots were then used to scan all numerical variables for outliers. The RT_rating variable showed to have outliers which were dealt with by the method of capping using the sapply() function. The distribution of the Average_rating variables was checked using histogram() and was transformed using the BoxCox() fucntion. After completing all of the preprocessing steps the data is then ready for analysis.
The data sets were found on the Kaggle website and can be accessed using the following URL link: https://www.kaggle.com/intandea/netflix2020. The three data sets chosen were of the weekly number one netflix movie and TV show in Australia (AUS_NF), Great Britain (GBR_NF) and The United States of America (USA_NF). The data sets include 12 variables, which are listed below:
The three data sets chosen were imported into R using the read.csv() function as shown below. The argument stringAsFactors = FALSE was used to stipulate to R not to convert all characters into factors. A head of the data sets were produced.
AUS_NF <- read.csv("D:/RMIT/Semester 2 2020/MATH2349/Assignment 2/AUS_NF.csv", stringsAsFactors = FALSE)
print.data.frame(head(AUS_NF))
week show_type title ori_country genre release_date is_NF_Ori imdb_rating
1 37 Movie Love, Guaranteed USA Comedy 3/09/2020 TRUE 55%
2 37 TV Show Away USA Science Fiction 4/09/2020 TRUE 71%
3 36 Movie Mary Queen of Scots United Kingdom Drama 21/12/2018 FALSE 65%
4 36 TV Show Cobra Kai USA Action 2/05/2018 FALSE 88%
5 35 Movie The Sleepover USA Comedy 21/08/2020 TRUE 55%
6 35 TV Show Lucifer USA Superhero 25/01/2016 TRUE 83%
rt_rating country_chart show_link Continent
1 50% AUS https://flixpatrol.com/title/love-guaranteed OCE
2 73% AUS https://flixpatrol.com/title/away-2020 OCE
3 62% AUS https://flixpatrol.com/title/mary-queen-of-scots-2018 OCE
4 94% AUS https://flixpatrol.com/title/cobra-kai OCE
5 80% AUS https://flixpatrol.com/title/the-sleepover OCE
6 87% AUS https://flixpatrol.com/title/lucifer OCE
GBR_NF <- read.csv("D:/RMIT/Semester 2 2020/MATH2349/Assignment 2/GBR_NF.csv", stringsAsFactors = FALSE)
print.data.frame(head(GBR_NF))
week show_type title ori_country genre release_date is_NF_Ori
1 37 Movie Charlie and the Chocolate Factory USA Fantasy 9/07/2005 FALSE
2 37 TV Show Away USA Science Fiction 4/09/2020 TRUE
3 36 Movie Venom USA Superhero 3/10/2018 FALSE
4 36 TV Show Cobra Kai USA Action 2/05/2018 FALSE
5 35 Movie Bee Movie USA Animated 28/10/2007 FALSE
6 35 TV Show The Fall Ireland Crime 12/05/2013 FALSE
imdb_rating rt_rating country_chart show_link
1 67% 83% GBR https://flixpatrol.com/title/charlie-and-the-chocolate-factory
2 71% 73% GBR https://flixpatrol.com/title/away-2020
3 71% 29% GBR https://flixpatrol.com/title/venom-2018
4 88% 94% GBR https://flixpatrol.com/title/cobra-kai
5 62% 50% GBR https://flixpatrol.com/title/bee-movie
6 82% GBR https://flixpatrol.com/title/the-fall-2013
Continent
1 EUR
2 EUR
3 EUR
4 EUR
5 EUR
6 EUR
USA_NF <- read.csv("D:/RMIT/Semester 2 2020/MATH2349/Assignment 2/USA_NF.csv", stringsAsFactors = FALSE)
print.data.frame(head(USA_NF))
week show_type title ori_country genre release_date is_NF_Ori imdb_rating rt_rating
1 37 Movie Love, Guaranteed USA Comedy 3/09/2020 TRUE 55% 50%
2 37 TV Show Away USA Science Fiction 4/09/2020 TRUE 71% 73%
3 36 Movie The Frozen Ground USA 23/08/2013 FALSE 64%
4 36 TV Show Cobra Kai USA Action 2/05/2018 FALSE 88% 94%
5 35 Movie Project Power USA Action 14/08/2020 TRUE 61% 63%
6 35 TV Show Lucifer USA Superhero 25/01/2016 TRUE 83% 87%
country_chart show_link Continent
1 USA https://flixpatrol.com/title/love-guaranteed AME
2 USA https://flixpatrol.com/title/away-2020 AME
3 USA https://flixpatrol.com/title/the-frozen-ground AME
4 USA https://flixpatrol.com/title/cobra-kai AME
5 USA https://flixpatrol.com/title/project-power AME
6 USA https://flixpatrol.com/title/lucifer AME
The three data sets were merged together to create the main data set for the report using the rbind() function. The options(max.print = 10000) function was used to increase the limit of max.print so that rows were not omitted when combining the data sets. The arrange(desc())function was used to order the combined data in desceding order so all results for week 37, etc. were together.
options(max.print = 10000)
combined <- GBR_NF %>% rbind(USA_NF, AUS_NF)
NF_combined <- combined %>% arrange(desc(week))
print.data.frame(head(NF_combined))
week show_type title ori_country genre release_date is_NF_Ori
1 37 Movie Charlie and the Chocolate Factory USA Fantasy 9/07/2005 FALSE
2 37 TV Show Away USA Science Fiction 4/09/2020 TRUE
3 37 Movie Love, Guaranteed USA Comedy 3/09/2020 TRUE
4 37 TV Show Away USA Science Fiction 4/09/2020 TRUE
5 37 Movie Love, Guaranteed USA Comedy 3/09/2020 TRUE
6 37 TV Show Away USA Science Fiction 4/09/2020 TRUE
imdb_rating rt_rating country_chart show_link
1 67% 83% GBR https://flixpatrol.com/title/charlie-and-the-chocolate-factory
2 71% 73% GBR https://flixpatrol.com/title/away-2020
3 55% 50% USA https://flixpatrol.com/title/love-guaranteed
4 71% 73% USA https://flixpatrol.com/title/away-2020
5 55% 50% AUS https://flixpatrol.com/title/love-guaranteed
6 71% 73% AUS https://flixpatrol.com/title/away-2020
Continent
1 EUR
2 EUR
3 AME
4 AME
5 OCE
6 OCE
The colnames() function was used to change the column names in the main data set.
colnames(NF_combined) <- c("Week", "Show_type", "Title", "Origin_country",
"Genre", "Release_date", "Is_NF_original",
"IMDB_rating(%)", "RT_rating(%)", "Country_chart",
"Show_link", "Continent")
print.data.frame(head(NF_combined))
Week Show_type Title Origin_country Genre Release_date
1 37 Movie Charlie and the Chocolate Factory USA Fantasy 9/07/2005
2 37 TV Show Away USA Science Fiction 4/09/2020
3 37 Movie Love, Guaranteed USA Comedy 3/09/2020
4 37 TV Show Away USA Science Fiction 4/09/2020
5 37 Movie Love, Guaranteed USA Comedy 3/09/2020
6 37 TV Show Away USA Science Fiction 4/09/2020
Is_NF_original IMDB_rating(%) RT_rating(%) Country_chart
1 FALSE 67% 83% GBR
2 TRUE 71% 73% GBR
3 TRUE 55% 50% USA
4 TRUE 71% 73% USA
5 TRUE 55% 50% AUS
6 TRUE 71% 73% AUS
Show_link Continent
1 https://flixpatrol.com/title/charlie-and-the-chocolate-factory EUR
2 https://flixpatrol.com/title/away-2020 EUR
3 https://flixpatrol.com/title/love-guaranteed AME
4 https://flixpatrol.com/title/away-2020 AME
5 https://flixpatrol.com/title/love-guaranteed OCE
6 https://flixpatrol.com/title/away-2020 OCE
The dimensions and structure of the data set were checked using the glimpse() function which showed the dimensions to be 154 rows and 12 columns. The same function was used to check the data types of the variables in the data set. It showed that column one is a numeric variable, column sevel is a logical variable and the other ten columns are all character variables. Multiple variables require data type conversion.
glimpse(NF_combined)
Rows: 154
Columns: 12
$ Week <int> 37, 37, 37, 37, 37, 37, 36, 36, 36, 36, 36, 36, 35, 35, 35, 35, 35, 35, 34, 34,...
$ Show_type <chr> "Movie", "TV Show", "Movie", "TV Show", "Movie", "TV Show", "Movie", "TV Show",...
$ Title <chr> "Charlie and the Chocolate Factory", "Away", "Love, Guaranteed", "Away", "Love,...
$ Origin_country <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "United K...
$ Genre <chr> "Fantasy", "Science Fiction", "Comedy", "Science Fiction", "Comedy", "Science F...
$ Release_date <chr> "9/07/2005", "4/09/2020", "3/09/2020", "4/09/2020", "3/09/2020", "4/09/2020", "...
$ Is_NF_original <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
$ `IMDB_rating(%)` <chr> "67%", "71%", "55%", "71%", "55%", "71%", "71%", "88%", "64%", "88%", "65%", "8...
$ `RT_rating(%)` <chr> "83%", "73%", "50%", "73%", "50%", "73%", "29%", "94%", "", "94%", "62%", "94%"...
$ Country_chart <chr> "GBR", "GBR", "USA", "USA", "AUS", "AUS", "GBR", "GBR", "USA", "USA", "AUS", "A...
$ Show_link <chr> "https://flixpatrol.com/title/charlie-and-the-chocolate-factory", "https://flix...
$ Continent <chr> "EUR", "EUR", "AME", "AME", "OCE", "OCE", "EUR", "EUR", "AME", "AME", "OCE", "O...
The following variables were converted from character to factor using the as.factor() function. The levels() function was used to check the levels of the factor variable and see if any of the variables needed to be labelled and/or ordered.
NF_combined$Show_type <- as.factor(NF_combined$Show_type)
levels(NF_combined$Show_type)
[1] "Documentary" "Documentary TV" "Movie" "TV Show"
NF_combined$Origin_country <- as.factor(NF_combined$Origin_country)
levels(NF_combined$Origin_country)
[1] "Canada" "France" "Ireland" "Poland" "Spain" "Switzerland"
[7] "United Kingdom" "USA"
NF_combined$Genre <- as.factor(NF_combined$Genre)
levels(NF_combined$Genre)
[1] "" "Action" "Adventure" "Animated" "Comedy"
[6] "Crime" "Documentary" "Drama" "Fantasy" "Horror"
[11] "Mystery" "Reality-Show" "Romance" "Science Fiction" "Superhero"
[16] "Thriller"
NF_combined$Country_chart <- as.factor(NF_combined$Country_chart)
levels(NF_combined$Country_chart)
[1] "AUS" "GBR" "USA"
NF_combined$Continent <- as.factor(NF_combined$Continent)
levels(NF_combined$Continent)
[1] "AME" "EUR" "OCE"
The variable Is_NF_original was converted from a logical data type to a factor data type using the as.factor() function. The levels() function was used to check the levels of the factor variable. The factor() function was then used to change the labels assigned to the different levels of the variable. Lastly, levels() function was used again to check the new levels of the variable.
NF_combined$Is_NF_original <- as.factor(NF_combined$Is_NF_original)
levels(NF_combined$Is_NF_original)
[1] "FALSE" "TRUE"
NF_combined$Is_NF_original <- NF_combined$Is_NF_original %>%
factor(levels = c("FALSE", "TRUE"),
labels = c("Not Netflix Original", "Netflix Original"))
levels(NF_combined$Is_NF_original)
[1] "Not Netflix Original" "Netflix Original"
The following two variables were converted from character to numeric using the as.numeric() function. The sub() function was used to remove the “%” in the rows so that the variable could be changed to a numeric data type. The “%” was not needed at the column name explains that the numeric value in the row is a percentage.
NF_combined$`IMDB_rating(%)` <-
as.numeric(sub("%", "", NF_combined$`IMDB_rating(%)`))
NF_combined$`RT_rating(%)` <-
as.numeric(sub("%", "", NF_combined$`RT_rating(%)`))
Once all the data type conversions were completed, the glimpse() function was used to check that all variables had been changed into the correct data type. This is shown below.
glimpse(NF_combined)
Rows: 154
Columns: 12
$ Week <int> 37, 37, 37, 37, 37, 37, 36, 36, 36, 36, 36, 36, 35, 35, 35, 35, 35, 35, 34, 34,...
$ Show_type <fct> Movie, TV Show, Movie, TV Show, Movie, TV Show, Movie, TV Show, Movie, TV Show,...
$ Title <chr> "Charlie and the Chocolate Factory", "Away", "Love, Guaranteed", "Away", "Love,...
$ Origin_country <fct> USA, USA, USA, USA, USA, USA, USA, USA, USA, USA, United Kingdom, USA, USA, Ire...
$ Genre <fct> Fantasy, Science Fiction, Comedy, Science Fiction, Comedy, Science Fiction, Sup...
$ Release_date <chr> "9/07/2005", "4/09/2020", "3/09/2020", "4/09/2020", "3/09/2020", "4/09/2020", "...
$ Is_NF_original <fct> Not Netflix Original, Netflix Original, Netflix Original, Netflix Original, Net...
$ `IMDB_rating(%)` <dbl> 67, 71, 55, 71, 55, 71, 71, 88, 64, 88, 65, 88, 62, 82, 61, 83, 55, 83, 61, 82,...
$ `RT_rating(%)` <dbl> 83, 73, 50, 73, 50, 73, 29, 94, NA, 94, 62, 94, 50, NA, 63, 87, 80, 87, 63, NA,...
$ Country_chart <fct> GBR, GBR, USA, USA, AUS, AUS, GBR, GBR, USA, USA, AUS, AUS, GBR, GBR, USA, USA,...
$ Show_link <chr> "https://flixpatrol.com/title/charlie-and-the-chocolate-factory", "https://flix...
$ Continent <fct> EUR, EUR, AME, AME, OCE, OCE, EUR, EUR, AME, AME, OCE, OCE, EUR, EUR, AME, AME,...
One of the common problems with messy data sets is that multiple variables are stored in one column. This does not conform with the tidy data principles. The NF_combined data set shows this messy problem as the release_date column has the day, month and year stored in one column. In order to separate these three variables into multiple columns, the separate() function from the tidyr package was used. The sep argument was used to tell R where to separate the variable into multiple columns. The data set was renamed Tidy_NF. Head() was used to show the new columns of the tidy data set.
Tidy_NF <- NF_combined %>%
separate(Release_date,into = c("Release_day","Release_month","Release_year"),
sep = "/")
print.data.frame(head(Tidy_NF))
Week Show_type Title Origin_country Genre Release_day Release_month
1 37 Movie Charlie and the Chocolate Factory USA Fantasy 9 07
2 37 TV Show Away USA Science Fiction 4 09
3 37 Movie Love, Guaranteed USA Comedy 3 09
4 37 TV Show Away USA Science Fiction 4 09
5 37 Movie Love, Guaranteed USA Comedy 3 09
6 37 TV Show Away USA Science Fiction 4 09
Release_year Is_NF_original IMDB_rating(%) RT_rating(%) Country_chart
1 2005 Not Netflix Original 67 83 GBR
2 2020 Netflix Original 71 73 GBR
3 2020 Netflix Original 55 50 USA
4 2020 Netflix Original 71 73 USA
5 2020 Netflix Original 55 50 AUS
6 2020 Netflix Original 71 73 AUS
Show_link Continent
1 https://flixpatrol.com/title/charlie-and-the-chocolate-factory EUR
2 https://flixpatrol.com/title/away-2020 EUR
3 https://flixpatrol.com/title/love-guaranteed AME
4 https://flixpatrol.com/title/away-2020 AME
5 https://flixpatrol.com/title/love-guaranteed OCE
6 https://flixpatrol.com/title/away-2020 OCE
The Tidy_NF data set was scanned for missing values first using the sum(is.na()) functions to show the total count of missing values in the data. The colSums(is.na()) functions were then use to see which columns in the data set contained the missing values. The two functions showed that there was a total of 40 missing values in the data and were all contained in the RT_rating(%) column. The methodology above was chosen to see which columns of the data set had the missing values and to determine what would be the correct actions for dealing with the missing values according to the data type of the variables.
sum(is.na(Tidy_NF))
[1] 40
colSums(is.na(Tidy_NF))
Week Show_type Title Origin_country Genre Release_day Release_month
0 0 0 0 0 0 0
Release_year Is_NF_original IMDB_rating(%) RT_rating(%) Country_chart Show_link Continent
0 0 0 40 0 0 0
The action taken to deal with the missing values was to use the impute() function from the Hmisc package. The decision was to impute the mean for the missing values in the RT_rating(%) column. This action was made as it is a natural way of replacing missing values for numeric variables. The round() function was used to tidy up the column and prepare values that can be used for analysis.
Tidy_NF$`RT_rating(%)` <- impute(Tidy_NF$`RT_rating(%)`, fun = mean)
Tidy_NF$`RT_rating(%)` <- round(Tidy_NF$`RT_rating(%)`)
The RT_rating(%) variable was re-converted to a numeric data type after imputing of missing values was completed. This was to help with the scanning of outliers in the further steps of preprocessing.
Tidy_NF$`RT_rating(%)` <- as.numeric(Tidy_NF$`RT_rating(%)`)
Special values were scanned by first establishing a is.special function, which checks every numerical column to identify whether they have infinite or NaN values. The sapply() function, from the apply family functions, was then used to scan the data for any special values. This is shown below.
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))
}
sapply(Tidy_NF, is.special)
$Week
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[69] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[86] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[103] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[120] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[137] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[154] FALSE
$Show_type
NULL
$Title
NULL
$Origin_country
NULL
$Genre
NULL
$Release_day
NULL
$Release_month
NULL
$Release_year
NULL
$Is_NF_original
NULL
$`IMDB_rating(%)`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[69] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[86] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[103] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[120] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[137] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[154] FALSE
$`RT_rating(%)`
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[69] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[86] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[103] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[120] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[137] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[154] FALSE
$Country_chart
NULL
$Show_link
NULL
$Continent
NULL
A new variable was created and added to the data set from two existing variables. The new variable, Average_rating, was created from the IMDB_rating(%) and RT_rating(%) variables using the mutate() function. The two existing variables were added together and then divided by 2 to give an average rating for each movie/show in the data set. The head() function was used to show the new variable in the data set.
Tidy_NF <-
mutate(Tidy_NF,
Average_rating = (Tidy_NF$`IMDB_rating(%)` + Tidy_NF$`RT_rating(%)`)/2)
print.data.frame(head(Tidy_NF))
Week Show_type Title Origin_country Genre Release_day Release_month
1 37 Movie Charlie and the Chocolate Factory USA Fantasy 9 07
2 37 TV Show Away USA Science Fiction 4 09
3 37 Movie Love, Guaranteed USA Comedy 3 09
4 37 TV Show Away USA Science Fiction 4 09
5 37 Movie Love, Guaranteed USA Comedy 3 09
6 37 TV Show Away USA Science Fiction 4 09
Release_year Is_NF_original IMDB_rating(%) RT_rating(%) Country_chart
1 2005 Not Netflix Original 67 83 GBR
2 2020 Netflix Original 71 73 GBR
3 2020 Netflix Original 55 50 USA
4 2020 Netflix Original 71 73 USA
5 2020 Netflix Original 55 50 AUS
6 2020 Netflix Original 71 73 AUS
Show_link Continent Average_rating
1 https://flixpatrol.com/title/charlie-and-the-chocolate-factory EUR 75.0
2 https://flixpatrol.com/title/away-2020 EUR 72.0
3 https://flixpatrol.com/title/love-guaranteed AME 52.5
4 https://flixpatrol.com/title/away-2020 AME 72.0
5 https://flixpatrol.com/title/love-guaranteed OCE 52.5
6 https://flixpatrol.com/title/away-2020 OCE 72.0
The Average.rating variable was converted from impute to numeric data type to allow for scanning of outliers in the further steps of preprocessing the data. This was completed by using the as.numeric() function as shown below.
Tidy_NF$Average_rating <- as.numeric(Tidy_NF$Average_rating)
To scan for outliers in the numerical variable columns, the “Tukey’s method of outlier detection” in the boxplot was used. Boxplots were chosen as the methodology as the variables were univariate. Outliers are those values that are outside the “outlier fences” and are depicted as an “o” symbol on the boxplot. Four numerical variables in the data set were scanned for outliers using boxplots as shown below.
Tidy_NF$Week %>% boxplot(main = "Box Plot of Week",
ylab = "Week", col = "grey")
Tidy_NF$`IMDB_rating(%)` %>% boxplot(main = "Box Plot of IMDB Rating",
ylab = "IMDB Rating", col = "grey")
Tidy_NF$`RT_rating(%)` %>% boxplot(main = "Box Plot of RT Rating",
ylab = "RT Rating", col = "grey")
Tidy_NF$Average_rating %>% boxplot(main = "Box Plot of Average Rating",
ylab = "Average Rating", col = "grey")
As shown above, the only numerical variable containing outliers is the RT_rating. This may be due to the imputation of the mean for missing values earlier in the preprocessing steps. The chosen method of dealing with the outliers was capping. This was done by establishing the cap function and then using the sapply() function as shown below. The summary() function was then used to show the summary statistics of the RT_rating(%) variable.
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
Tidy_NF$`RT_rating(%)` <- sapply(Tidy_NF$`RT_rating(%)`, FUN = cap)
summary(Tidy_NF$`RT_rating(%)`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
19.00 63.00 67.00 66.86 80.00 100.00
The variable that was chosen to be transformed was Average_rating. This variable was chosen as it has a slight left-skewness as shown below by using the histogram() function.
histogram(Tidy_NF$Average_rating)
The chosen transformation function for the Average_rating variable was Box-Cox transformation. This was chosen as the data shows slight non-normal distribution and requires transformation to a more “normal” distribution. The BoxCox() function was used along with the histogram() function to show the transformation of the Average_rating variable as shown below. The distribution has changed from a slight left-skewed to one that is similar to bi-modal. This new distribution will help provide understanding about average rating across the top movies/TV shows in the three countries.
boxcox_NF <- BoxCox(Tidy_NF$Average_rating, lambda = "auto")
histogram(boxcox_NF)