MATH2405: Data Wrangling

Setup

Install and load the necessary packages to reproduce the report here:

library(readr) # Useful for importing data

## Warning: package 'readr' was built under R version 4.1.1

library(foreign) # Useful for importing SPSS, SAS, STATA etc. data files
library(rvest) # Useful for scraping HTML data

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(knitr) # Useful for creating nice tables
library(magrittr)
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v dplyr   1.0.7
## v tibble  3.1.2     v stringr 1.4.0
## v tidyr   1.1.3     v forcats 0.5.1
## v purrr   0.3.4

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::extract()        masks magrittr::extract()
## x dplyr::filter()         masks stats::filter()
## x rvest::guess_encoding() masks readr::guess_encoding()
## x dplyr::lag()            masks stats::lag()
## x purrr::set_names()      masks magrittr::set_names()

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

Locate Data

The data used in this assessment was sourced from https://www.kaggle.com

Read/Import Data

# Importing the data.

 netflixOriginals <- read_csv("E:/RMIT/NetflixOriginals.csv")

## Rows: 584 Columns: 6

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): Title, Genre, Premiere, Language
## dbl (2): Runtime, IMDB Score

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

# viewing the data set
 glimpse(netflixOriginals)

## Rows: 584
## Columns: 6
## $ Title        <chr> "Enter the Anime", "Dark Forces", "The App", "The Open Ho~
## $ Genre        <chr> "Documentary", "Thriller", "Science fiction/Drama", "Horr~
## $ Premiere     <chr> "August 5, 2019", "August 21, 2020", "December 26, 2019",~
## $ Runtime      <dbl> 58, 81, 79, 94, 90, 147, 112, 149, 73, 139, 58, 112, 97, ~
## $ `IMDB Score` <dbl> 2.5, 2.6, 2.6, 3.2, 3.4, 3.5, 3.7, 3.7, 3.9, 4.1, 4.1, 4.~
## $ Language     <chr> "English/Japanese", "Spanish", "Italian", "English", "Hin~

 head(netflixOriginals)

In this step, the data set was imported into RStudio using the read_csv() function and saving it as a data frame and assigning it to the variable “netflixOrginals”.

Once the data was imported, the head() and str() functions were used to inspect the size of the data set and the data types of the variables.

Overall the data is tidy and is only needing a few changes to get it ready for analysis:

“Genre” and “Language” will become factor variables instead of charters
“Premiere” will become a date variable instead of characters

Data description

The data was sourced [online] Kaggle. Available at: https://www.kaggle.com/luiscorter/netflix-original-films-imdb-scores?select=NetflixOriginals.csv

The data set contains 6 variables:

Title: Names of the Netflix original films
Genre: Genre of the films
Premiere: Date the film aired
Runtime: How long (in minutes) did it take to finish the film
IMDB score: The score is on a scale of 1-10, 1 being poor and 10 being excellent
Language: The language spoken in the film (some films have multiple langues spoken in them)

Inspect dataset and variables

Explaining in order step by step:

Check dimensions: To check the dimensions I used two different functions. dim() provides a basic description, whereas glimpse() is more detailed.
Inspect/convert data types: After inspecting the data set I decided to make “Language” and “Genre” a factor variable. Using the functions mutate(), as.factor() and case_when() achieved the desired result. When changing the “Premiere” variable, the functions mutate() and as.date where used.
Check levels: To inspect the levels of the factor variables I used the levels() function. No rearrangement or renaming was needed.
Check column names: After using colnames() function I needed to rename the date variable.

#check the dimensions of the data frame(two options)
dim(netflixOriginals)

## [1] 584   6

##or
glimpse(netflixOriginals)

## Rows: 584
## Columns: 6
## $ Title        <chr> "Enter the Anime", "Dark Forces", "The App", "The Open Ho~
## $ Genre        <chr> "Documentary", "Thriller", "Science fiction/Drama", "Horr~
## $ Premiere     <chr> "August 5, 2019", "August 21, 2020", "December 26, 2019",~
## $ Runtime      <dbl> 58, 81, 79, 94, 90, 147, 112, 149, 73, 139, 58, 112, 97, ~
## $ `IMDB Score` <dbl> 2.5, 2.6, 2.6, 3.2, 3.4, 3.5, 3.7, 3.7, 3.9, 4.1, 4.1, 4.~
## $ Language     <chr> "English/Japanese", "Spanish", "Italian", "English", "Hin~

#check the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the 
#data set. If variables are not in the correct data type, apply proper type conversions.

##make Language variable a factor with levels

main_data <- netflixOriginals %>% 
  mutate(Language = as.factor(case_when(
    
    Language %in% c('Bengali' ) ~ 'Bengali',
    
    Language %in% c('Dutch') ~ 'Dutch',
    
    Language %in% c('English','English/Akan','English/Arabic','English/Hindi','English/Japanese',
                    'English/Korean',
'English/Mandarin', 'English/Russian', 'English/Spanish', 'English/Taiwanese/Mandarin','English/Ukranian/Russian')~ 'English',

    Language %in% c('Filipino') ~ 'Filipino',
    
    Language %in% c('French', 'Khmer/English/French') ~ 'French',
    
    Language %in% c('German') ~ 'German',
     
    Language %in% c('Hindi') ~ 'Hindi',
    
    Language %in% c('Indonesian') ~ 'Indonesian',
    
    Language %in% c('Italian') ~ 'Italian',
    
    Language %in% c('Japanese') ~ 'Japanese',
    
    Language %in% c('Korean') ~ 'Korean', 
    
    Language %in% c('Portuguese') ~ 'Portuguese',
    
    Language %in% c('Spanish', 'Spanish/Basque', 'Spanish/Catalan', 'Spanish/English') ~ 'Spanish',
    
    Language %in% c('Portuguese') ~ 'Portuguese',
    
    TRUE ~ 'Other' )))

## Make Genre variable a factor with levels 

 main_data%<>% 
  mutate(Genre = as.factor(case_when(
    Genre %in% c('Action', 'Action-adventure', 'Action-thriller', 'Action comedy', 'Action thriller', 'Action/Comedy', 'Action/Science fiction') ~ 'Action',
    
    Genre %in% c('Adventure', 'Adventure-romance', 'Adventure/Comedy') ~'Adventure',
    
    Genre %in% c('Aftershow / Interview') ~ 'Interview',
    
    Genre %in% c('Animated musical comedy', 'Animation', 'Animation / Comedy', 'Animation / Musicial',
'Animation / Science Fiction', 'Animation / Short','Animation/Christmas/Comedy/Adventure', 'Animation/Comedy/Adventure', 'Animation/Musical/Adventure', 'Animation/Superhero') ~ 'Animation',
    
    Genre %in% c('Anime / Short', 'Anime/Fantasy', 'Anime/Science fiction', 'Anime/Science fiction') ~ 'Anime',
    
    Genre %in% c('Biopic') ~ 'Biopic',
    
    Genre %in% c('Black comedy', 'Christmas comedy', 'Comedy', 'Comedy-drama', 'Comedy / Musical', 'Comedy horror', 'Comedy mystery', 'Comedy/Fantasy/Family',
                 'Comedy/Horror', 'Coming-of-age comedy-drama', 'Dance comedy', 'Dark comedy','Anthology/Dark comedy') ~ 'Comedy',
    
   Genre %in% c('Crime drama', 'Crime thriller') ~ 'Crime',
    
    Genre %in% c('Documentary') ~ 'Documentary',
    
    Genre %in% c('Drama', 'Drama-Comedy', 'Drama / Short','Drama/Horror', 'Romantic drama') ~ 'Drama',  
    
    
    Genre %in% c('Horror', 'Horror-thriller','Horror anthology','Horror comedy','Horror thriller','Horror/Crime drama') ~ 'Horror',
    
    
    Genre %in% c('Science fiction', 'Science fiction adventure', 'Science fiction thriller', 'Science fiction/Action', 'Science fiction/Drama', 'Science fiction/Mystery',
                 
                 'Science fiction/Thriller') ~ 'Science fiction',
    
   Genre %in% c('Thriller') ~ 'Thriller',
    
    TRUE ~ 'Other')))
 
 ## Changing "Premiere" to date type variable 
 
tidy_data<-main_data%>%mutate (as.Date(main_data$Premiere, format = "%B %d, %Y" ))

 
# check the levels of factor variables, rename/rearrange them if required.
 
## The factor variables have names the make sense and do not require rearrangement as they
#are qualitative variables.
levels(tidy_data$Genre)

##  [1] "Action"          "Adventure"       "Animation"       "Anime"          
##  [5] "Biopic"          "Comedy"          "Crime"           "Documentary"    
##  [9] "Drama"           "Horror"          "Interview"       "Other"          
## [13] "Science fiction" "Thriller"

levels(tidy_data$Language)

##  [1] "Bengali"    "Dutch"      "English"    "Filipino"   "French"    
##  [6] "German"     "Hindi"      "Indonesian" "Italian"    "Japanese"  
## [11] "Korean"     "Other"      "Portuguese" "Spanish"

#check the column names in the data frame, rename them if required.


colnames(tidy_data)

## [1] "Title"                                              
## [2] "Genre"                                              
## [3] "Premiere"                                           
## [4] "Runtime"                                            
## [5] "IMDB Score"                                         
## [6] "Language"                                           
## [7] "as.Date(main_data$Premiere, format = \"%B %d, %Y\")"

##rename column 7

colnames(tidy_data)[7] <- "Premiere"

## Drop column 3
tidy_data<- subset(tidy_data, select = -c(3) )

colnames(tidy_data)

## [1] "Title"      "Genre"      "Runtime"    "IMDB Score" "Language"  
## [6] "Premiere"

Tidy data

The data set that I have chosen has conformed to the tidy data principles in respect to each variable having its own column and each observation having its own row.

After cleaning the data which involved making changes to three variable types, the data is ready to be used for statistical analysis.

Summary statistics

In this section a summary of statistics has been provided for the “Runtime” of each “Genre”.

## Creates a data frame grouped by Genre contains multiple summarised statistics inside the one summarise() function.

sumstat<- tidy_data%>%group_by(Genre)%>%summarise(mean(Runtime),median(Runtime),min(Runtime), max(Runtime), sd(Runtime))

sumstat

Create a list

In this section a list was created to assign an ID number to a certain type of “Genre”.

## Creates a Genre ID for each response 
my_list<- tidy_data %>% select(Genre)

my_list<- my_list %>% mutate("Genre ID" = as.numeric(Genre))


my_list

Join the list

The list from the previous step as been added to the “tidy_data1” to provide a “Genre ID” for each response.

tidy_data1 <- tidy_data

my_join<- tidy_data1 %>% mutate("Genre ID" = as.numeric(Genre))

my_join

str(my_join)

## tibble [584 x 7] (S3: tbl_df/tbl/data.frame)
##  $ Title     : chr [1:584] "Enter the Anime" "Dark Forces" "The App" "The Open House" ...
##  $ Genre     : Factor w/ 14 levels "Action","Adventure",..: 8 14 13 10 12 1 6 12 12 6 ...
##  $ Runtime   : num [1:584] 58 81 79 94 90 147 112 149 73 139 ...
##  $ IMDB Score: num [1:584] 2.5 2.6 2.6 3.2 3.4 3.5 3.7 3.7 3.9 4.1 ...
##  $ Language  : Factor w/ 14 levels "Bengali","Dutch",..: 3 14 9 3 7 7 12 3 3 7 ...
##  $ Premiere  : Date[1:584], format: "2019-08-05" "2020-08-21" ...
##  $ Genre ID  : num [1:584] 8 14 13 10 12 1 6 12 12 6 ...

Subsetting I

After subsetting the first 10 observations and transforming that data frame into a matrix all the variables where switched to character data type because all data types in a matrix must be the same (character, numeric, integer, factor, or logical).

## creates a data frame with the first 10 observations 
top_10<-head(tidy_data, n=10)


## converts data frame into a character matrix.
my_matrix<- as.matrix(top_10)

str(my_matrix)

##  chr [1:10, 1:6] "Enter the Anime" "Dark Forces" "The App" "The Open House" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:6] "Title" "Genre" "Runtime" "IMDB Score" ...

head(my_matrix)

##      Title             Genre             Runtime IMDB Score Language 
## [1,] "Enter the Anime" "Documentary"     " 58"   "2.5"      "English"
## [2,] "Dark Forces"     "Thriller"        " 81"   "2.6"      "Spanish"
## [3,] "The App"         "Science fiction" " 79"   "2.6"      "Italian"
## [4,] "The Open House"  "Horror"          " 94"   "3.2"      "English"
## [5,] "Kaali Khuhi"     "Other"           " 90"   "3.4"      "Hindi"  
## [6,] "Drive"           "Action"          "147"   "3.5"      "Hindi"  
##      Premiere    
## [1,] "2019-08-05"
## [2,] "2020-08-21"
## [3,] "2019-12-26"
## [4,] "2018-01-19"
## [5,] "2020-10-30"
## [6,] "2019-11-01"

Subsetting II

In this section the first and last observation of the data set was subsetted and saved as an .RData file

## Creates a data frame witht the first and last observation 
my_subsetII <- tidy_data[c(1,584), c(1:6)]

## saves the data frame as a .Rdata file

save(my_subsetII, file = "my_subsetII.RData")

my_subsetII

Create a new Data Frame

Create a data frame with 2 variables. Your data frame has to contain one integer variable and one ordinal variable.

The ordinal variable has to be a factor and ordered properly. Make sure you name your variables.
Show the structure of your variables and the levels of the ordinal variable.
Create another numeric vector and use cbind() to add this vector to your data frame.
After this step you should have 3 variables in the data frame.
Check the attributes and the dimension of your new data frame.
Provide the R codes with outputs and explain everything that you do in this step.

## Age variable 
respondent_age <- c(18,23,27,35)

##factor variable High school(HS), Bachelor Degree(BD), masters degree(MD) Doctoral degree (PHD)
educational_level <-as.factor(c("High School", "Bachelor Degree", "Masters Degree", "Doctoral Degree"))

## creates new data frame
new_df <- data.frame(respondent_age, educational_level)


## respondent id variable
respondent_id <-c(1,2,3,4) 


##joins the two new data frames
new_df_1 <- cbind(new_df,respondent_id)

str(new_df)

## 'data.frame':    4 obs. of  2 variables:
##  $ respondent_age   : num  18 23 27 35
##  $ educational_level: Factor w/ 4 levels "Bachelor Degree",..: 3 1 4 2

dim(new_df)

## [1] 4 2

head(new_df)

str(new_df_1)

## 'data.frame':    4 obs. of  3 variables:
##  $ respondent_age   : num  18 23 27 35
##  $ educational_level: Factor w/ 4 levels "Bachelor Degree",..: 3 1 4 2
##  $ respondent_id    : num  1 2 3 4

dim(new_df_1)

## [1] 4 3

head(new_df_1)

Create another Data Frame

In this section I created a “sex” variable and used the function cbind() to bind it to the data frame from the previous step.

## creates sex variable 
sex <- as.factor(c("Male", "Female", "Female", "Male"))

## Binds the sex variable to the data frame from the previous step
step12_df <- cbind(new_df_1, sex)

glimpse(step12_df)

## Rows: 4
## Columns: 4
## $ respondent_age    <dbl> 18, 23, 27, 35
## $ educational_level <fct> High School, Bachelor Degree, Masters Degree, Doctor~
## $ respondent_id     <dbl> 1, 2, 3, 4
## $ sex               <fct> Male, Female, Female, Male

head(step12_df)