Install and load the necessary packages to reproduce the report here:
library(readr) # Useful for importing data
## Warning: package 'readr' was built under R version 4.1.1
library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
library(rvest) # Useful for scraping HTML data
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
library(knitr) # Useful for creating nice tables
library(magrittr)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v dplyr 1.0.7
## v tibble 3.1.2 v stringr 1.4.0
## v tidyr 1.1.3 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::extract() masks magrittr::extract()
## x dplyr::filter() masks stats::filter()
## x rvest::guess_encoding() masks readr::guess_encoding()
## x dplyr::lag() masks stats::lag()
## x purrr::set_names() masks magrittr::set_names()
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
The data used in this assessment was sourced from https://www.kaggle.com
# Importing the data.
netflixOriginals <- read_csv("E:/RMIT/NetflixOriginals.csv")
## Rows: 584 Columns: 6
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): Title, Genre, Premiere, Language
## dbl (2): Runtime, IMDB Score
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# viewing the data set
glimpse(netflixOriginals)
## Rows: 584
## Columns: 6
## $ Title <chr> "Enter the Anime", "Dark Forces", "The App", "The Open Ho~
## $ Genre <chr> "Documentary", "Thriller", "Science fiction/Drama", "Horr~
## $ Premiere <chr> "August 5, 2019", "August 21, 2020", "December 26, 2019",~
## $ Runtime <dbl> 58, 81, 79, 94, 90, 147, 112, 149, 73, 139, 58, 112, 97, ~
## $ `IMDB Score` <dbl> 2.5, 2.6, 2.6, 3.2, 3.4, 3.5, 3.7, 3.7, 3.9, 4.1, 4.1, 4.~
## $ Language <chr> "English/Japanese", "Spanish", "Italian", "English", "Hin~
head(netflixOriginals)
In this step, the data set was imported into RStudio using the read_csv() function and saving it as a data frame and assigning it to the variable “netflixOrginals”.
Once the data was imported, the head() and str() functions were used to inspect the size of the data set and the data types of the variables.
Overall the data is tidy and is only needing a few changes to get it ready for analysis:
The data was sourced [online] Kaggle. Available at: https://www.kaggle.com/luiscorter/netflix-original-films-imdb-scores?select=NetflixOriginals.csv
The data set contains 6 variables:
Title: Names of the Netflix original films
Genre: Genre of the films
Premiere: Date the film aired
Runtime: How long (in minutes) did it take to finish the film
IMDB score: The score is on a scale of 1-10, 1 being poor and 10 being excellent
Language: The language spoken in the film (some films have multiple langues spoken in them)
Explaining in order step by step:
Check dimensions: To check the dimensions I used two different functions. dim() provides a basic description, whereas glimpse() is more detailed.
Inspect/convert data types: After inspecting the data set I decided to make “Language” and “Genre” a factor variable. Using the functions mutate(), as.factor() and case_when() achieved the desired result. When changing the “Premiere” variable, the functions mutate() and as.date where used.
Check levels: To inspect the levels of the factor variables I used the levels() function. No rearrangement or renaming was needed.
Check column names: After using colnames() function I needed to rename the date variable.
#check the dimensions of the data frame(two options)
dim(netflixOriginals)
## [1] 584 6
##or
glimpse(netflixOriginals)
## Rows: 584
## Columns: 6
## $ Title <chr> "Enter the Anime", "Dark Forces", "The App", "The Open Ho~
## $ Genre <chr> "Documentary", "Thriller", "Science fiction/Drama", "Horr~
## $ Premiere <chr> "August 5, 2019", "August 21, 2020", "December 26, 2019",~
## $ Runtime <dbl> 58, 81, 79, 94, 90, 147, 112, 149, 73, 139, 58, 112, 97, ~
## $ `IMDB Score` <dbl> 2.5, 2.6, 2.6, 3.2, 3.4, 3.5, 3.7, 3.7, 3.9, 4.1, 4.1, 4.~
## $ Language <chr> "English/Japanese", "Spanish", "Italian", "English", "Hin~
#check the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
#data set. If variables are not in the correct data type, apply proper type conversions.
##make Language variable a factor with levels
main_data <- netflixOriginals %>%
mutate(Language = as.factor(case_when(
Language %in% c('Bengali' ) ~ 'Bengali',
Language %in% c('Dutch') ~ 'Dutch',
Language %in% c('English','English/Akan','English/Arabic','English/Hindi','English/Japanese',
'English/Korean',
'English/Mandarin', 'English/Russian', 'English/Spanish', 'English/Taiwanese/Mandarin','English/Ukranian/Russian')~ 'English',
Language %in% c('Filipino') ~ 'Filipino',
Language %in% c('French', 'Khmer/English/French') ~ 'French',
Language %in% c('German') ~ 'German',
Language %in% c('Hindi') ~ 'Hindi',
Language %in% c('Indonesian') ~ 'Indonesian',
Language %in% c('Italian') ~ 'Italian',
Language %in% c('Japanese') ~ 'Japanese',
Language %in% c('Korean') ~ 'Korean',
Language %in% c('Portuguese') ~ 'Portuguese',
Language %in% c('Spanish', 'Spanish/Basque', 'Spanish/Catalan', 'Spanish/English') ~ 'Spanish',
Language %in% c('Portuguese') ~ 'Portuguese',
TRUE ~ 'Other' )))
## Make Genre variable a factor with levels
main_data%<>%
mutate(Genre = as.factor(case_when(
Genre %in% c('Action', 'Action-adventure', 'Action-thriller', 'Action comedy', 'Action thriller', 'Action/Comedy', 'Action/Science fiction') ~ 'Action',
Genre %in% c('Adventure', 'Adventure-romance', 'Adventure/Comedy') ~'Adventure',
Genre %in% c('Aftershow / Interview') ~ 'Interview',
Genre %in% c('Animated musical comedy', 'Animation', 'Animation / Comedy', 'Animation / Musicial',
'Animation / Science Fiction', 'Animation / Short','Animation/Christmas/Comedy/Adventure', 'Animation/Comedy/Adventure', 'Animation/Musical/Adventure', 'Animation/Superhero') ~ 'Animation',
Genre %in% c('Anime / Short', 'Anime/Fantasy', 'Anime/Science fiction', 'Anime/Science fiction') ~ 'Anime',
Genre %in% c('Biopic') ~ 'Biopic',
Genre %in% c('Black comedy', 'Christmas comedy', 'Comedy', 'Comedy-drama', 'Comedy / Musical', 'Comedy horror', 'Comedy mystery', 'Comedy/Fantasy/Family',
'Comedy/Horror', 'Coming-of-age comedy-drama', 'Dance comedy', 'Dark comedy','Anthology/Dark comedy') ~ 'Comedy',
Genre %in% c('Crime drama', 'Crime thriller') ~ 'Crime',
Genre %in% c('Documentary') ~ 'Documentary',
Genre %in% c('Drama', 'Drama-Comedy', 'Drama / Short','Drama/Horror', 'Romantic drama') ~ 'Drama',
Genre %in% c('Horror', 'Horror-thriller','Horror anthology','Horror comedy','Horror thriller','Horror/Crime drama') ~ 'Horror',
Genre %in% c('Science fiction', 'Science fiction adventure', 'Science fiction thriller', 'Science fiction/Action', 'Science fiction/Drama', 'Science fiction/Mystery',
'Science fiction/Thriller') ~ 'Science fiction',
Genre %in% c('Thriller') ~ 'Thriller',
TRUE ~ 'Other')))
## Changing "Premiere" to date type variable
tidy_data<-main_data%>%mutate (as.Date(main_data$Premiere, format = "%B %d, %Y" ))
# check the levels of factor variables, rename/rearrange them if required.
## The factor variables have names the make sense and do not require rearrangement as they
#are qualitative variables.
levels(tidy_data$Genre)
## [1] "Action" "Adventure" "Animation" "Anime"
## [5] "Biopic" "Comedy" "Crime" "Documentary"
## [9] "Drama" "Horror" "Interview" "Other"
## [13] "Science fiction" "Thriller"
levels(tidy_data$Language)
## [1] "Bengali" "Dutch" "English" "Filipino" "French"
## [6] "German" "Hindi" "Indonesian" "Italian" "Japanese"
## [11] "Korean" "Other" "Portuguese" "Spanish"
#check the column names in the data frame, rename them if required.
colnames(tidy_data)
## [1] "Title"
## [2] "Genre"
## [3] "Premiere"
## [4] "Runtime"
## [5] "IMDB Score"
## [6] "Language"
## [7] "as.Date(main_data$Premiere, format = \"%B %d, %Y\")"
##rename column 7
colnames(tidy_data)[7] <- "Premiere"
## Drop column 3
tidy_data<- subset(tidy_data, select = -c(3) )
colnames(tidy_data)
## [1] "Title" "Genre" "Runtime" "IMDB Score" "Language"
## [6] "Premiere"
The data set that I have chosen has conformed to the tidy data principles in respect to each variable having its own column and each observation having its own row.
After cleaning the data which involved making changes to three variable types, the data is ready to be used for statistical analysis.
In this section a summary of statistics has been provided for the “Runtime” of each “Genre”.
## Creates a data frame grouped by Genre contains multiple summarised statistics inside the one summarise() function.
sumstat<- tidy_data%>%group_by(Genre)%>%summarise(mean(Runtime),median(Runtime),min(Runtime), max(Runtime), sd(Runtime))
sumstat
In this section a list was created to assign an ID number to a certain type of “Genre”.
## Creates a Genre ID for each response
my_list<- tidy_data %>% select(Genre)
my_list<- my_list %>% mutate("Genre ID" = as.numeric(Genre))
my_list
The list from the previous step as been added to the “tidy_data1” to provide a “Genre ID” for each response.
tidy_data1 <- tidy_data
my_join<- tidy_data1 %>% mutate("Genre ID" = as.numeric(Genre))
my_join
str(my_join)
## tibble [584 x 7] (S3: tbl_df/tbl/data.frame)
## $ Title : chr [1:584] "Enter the Anime" "Dark Forces" "The App" "The Open House" ...
## $ Genre : Factor w/ 14 levels "Action","Adventure",..: 8 14 13 10 12 1 6 12 12 6 ...
## $ Runtime : num [1:584] 58 81 79 94 90 147 112 149 73 139 ...
## $ IMDB Score: num [1:584] 2.5 2.6 2.6 3.2 3.4 3.5 3.7 3.7 3.9 4.1 ...
## $ Language : Factor w/ 14 levels "Bengali","Dutch",..: 3 14 9 3 7 7 12 3 3 7 ...
## $ Premiere : Date[1:584], format: "2019-08-05" "2020-08-21" ...
## $ Genre ID : num [1:584] 8 14 13 10 12 1 6 12 12 6 ...
After subsetting the first 10 observations and transforming that data frame into a matrix all the variables where switched to character data type because all data types in a matrix must be the same (character, numeric, integer, factor, or logical).
## creates a data frame with the first 10 observations
top_10<-head(tidy_data, n=10)
## converts data frame into a character matrix.
my_matrix<- as.matrix(top_10)
str(my_matrix)
## chr [1:10, 1:6] "Enter the Anime" "Dark Forces" "The App" "The Open House" ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:6] "Title" "Genre" "Runtime" "IMDB Score" ...
head(my_matrix)
## Title Genre Runtime IMDB Score Language
## [1,] "Enter the Anime" "Documentary" " 58" "2.5" "English"
## [2,] "Dark Forces" "Thriller" " 81" "2.6" "Spanish"
## [3,] "The App" "Science fiction" " 79" "2.6" "Italian"
## [4,] "The Open House" "Horror" " 94" "3.2" "English"
## [5,] "Kaali Khuhi" "Other" " 90" "3.4" "Hindi"
## [6,] "Drive" "Action" "147" "3.5" "Hindi"
## Premiere
## [1,] "2019-08-05"
## [2,] "2020-08-21"
## [3,] "2019-12-26"
## [4,] "2018-01-19"
## [5,] "2020-10-30"
## [6,] "2019-11-01"
In this section the first and last observation of the data set was subsetted and saved as an .RData file
## Creates a data frame witht the first and last observation
my_subsetII <- tidy_data[c(1,584), c(1:6)]
## saves the data frame as a .Rdata file
save(my_subsetII, file = "my_subsetII.RData")
my_subsetII
Create a data frame with 2 variables. Your data frame has to contain one integer variable and one ordinal variable.
The ordinal variable has to be a factor and ordered properly. Make sure you name your variables.
Show the structure of your variables and the levels of the ordinal variable.
Create another numeric vector and use cbind() to add this vector to your data frame.
After this step you should have 3 variables in the data frame.
Check the attributes and the dimension of your new data frame.
Provide the R codes with outputs and explain everything that you do in this step.
## Age variable
respondent_age <- c(18,23,27,35)
##factor variable High school(HS), Bachelor Degree(BD), masters degree(MD) Doctoral degree (PHD)
educational_level <-as.factor(c("High School", "Bachelor Degree", "Masters Degree", "Doctoral Degree"))
## creates new data frame
new_df <- data.frame(respondent_age, educational_level)
## respondent id variable
respondent_id <-c(1,2,3,4)
##joins the two new data frames
new_df_1 <- cbind(new_df,respondent_id)
str(new_df)
## 'data.frame': 4 obs. of 2 variables:
## $ respondent_age : num 18 23 27 35
## $ educational_level: Factor w/ 4 levels "Bachelor Degree",..: 3 1 4 2
dim(new_df)
## [1] 4 2
head(new_df)
str(new_df_1)
## 'data.frame': 4 obs. of 3 variables:
## $ respondent_age : num 18 23 27 35
## $ educational_level: Factor w/ 4 levels "Bachelor Degree",..: 3 1 4 2
## $ respondent_id : num 1 2 3 4
dim(new_df_1)
## [1] 4 3
head(new_df_1)
In this section I created a “sex” variable and used the function cbind() to bind it to the data frame from the previous step.
## creates sex variable
sex <- as.factor(c("Male", "Female", "Female", "Male"))
## Binds the sex variable to the data frame from the previous step
step12_df <- cbind(new_df_1, sex)
glimpse(step12_df)
## Rows: 4
## Columns: 4
## $ respondent_age <dbl> 18, 23, 27, 35
## $ educational_level <fct> High School, Bachelor Degree, Masters Degree, Doctor~
## $ respondent_id <dbl> 1, 2, 3, 4
## $ sex <fct> Male, Female, Female, Male
head(step12_df)