Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages
library(tidyr)

library(dplyr)

library(editrules)

library(deducorrect)

library(outliers)

## Executive Summary
#This report provides the preprocessing of 2 datasets, which contains methods to merge datasets, convert variable type, label factor variable, reshape data into a tidy format, mutate new variable, scan outliers for numeric variables, transform data, scan errors (missing values, special values, obvious errors, and inconsistency) and use the technique to deal with errors (missing values, special values, obvious errors, and inconsistency) by only using packages of tidyr, dplyr, editrules, deducorrect, and outliers. This report aims to provide a sample of how to preprocess datasets, the source of datasets is all come from the government of VIC https://www.vic.gov.au/, they are all open and have a Creative Commons Licence. 
# This is the R chunk for the Data Section

## Data

#Import dataset from website, this dataset is from the government of victoria and it's source website is: https://discover.data.vic.gov.au/dataset/all-schools-fte-enrolments-feb-2015. This dataset contains information collected from the February 2014 school census of Victorian schools until February 2015. This dataset contains 54 variables but we are going to subset to 9 variables otherwise it will be too large and when we merge variables together or merge two datasets together, R will reports Error: vector memory exhausted (limit reached?). After we subset the data set, there will be 9 variables left: 

#"Education_Sector":It refers to educational sector.
#"Entity_Type":It refers to type of school district.       
#"School_No":It refers to the code that represent the school.
#"School_Name":It refers to the name of school.         
#"School_Type":It refers to the type of school.         
#"School_Status":It refers to the status of school.
#"Prep.Males.Total..":It refers to the total number of preparatory male student.
#"Prep.Females.Total..":It refers to the total number of preparatory female student.
#"Prep.Total..":It refers to the total number of preparatory student. 

#Import dataset from website.
url <- "http://www.education.vic.gov.au/Documents/about/research/datavic/dv171-Feb2015FTEenrolments.csv"
All_Schools_FTE_enrolments_Feb_2015 <- read.csv(url, stringsAsFactors = FALSE)

#We need to subset data because this dataset is too big, if we don't subset dataset, then when we merge variables together or merge two datasets together, R will reports Error: vector memory exhausted (limit reached?).
All_Schools_FTE_enrolments_Feb_2015 <- subset(All_Schools_FTE_enrolments_Feb_2015, select = Education_Sector:Prep.Total..)

head(All_Schools_FTE_enrolments_Feb_2015)



#Import dataset from another website, this dataset is from the government of victoria and it's source website is: https://discover.data.vic.gov.au/dataset/all-schools-fte-enrolments-feb-2016. This dataset contains information collected from the February 2015 school census of Victorian schools until February 2016. This dataset also contains 54 variables but we are going to subset to 9 variables due to the problem of large dataset. And these 9 variables are the same as the previous dataset, it means that they have the same variable and also those variables have the same meaning.

#Import dataset from website.
url <- "http://www.education.vic.gov.au/Documents/about/research/datavic/dv214-allschoolsfetenrolfeb2016.csv"
All_Schools_FTE_enrolments_Feb_2016 <- read.csv(url, stringsAsFactors = FALSE)

#We use anti_join() function to obtain the different data from student enrolments in 2015 (New data), because we woudl like merge additional data from student enrolments in 2016 with student enrolments in 2016 in next step.
New_All_Schools_FTE_enrolments_Feb_2016 <- anti_join(All_Schools_FTE_enrolments_Feb_2016, All_Schools_FTE_enrolments_Feb_2015, by = "School_Name")

#The same situation in this one, we have to minimize data otherwise, R will generate error when merge 2 dataset because they are too large.
New_All_Schools_FTE_enrolments_Feb_2016 <- subset(New_All_Schools_FTE_enrolments_Feb_2016, select = Education_Sector:Prep.Total..)

head(New_All_Schools_FTE_enrolments_Feb_2016)

#We use bind_rows() function to merge two datasets because they have the same variables.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students <- bind_rows(All_Schools_FTE_enrolments_Feb_2015, New_All_Schools_FTE_enrolments_Feb_2016)

head(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students)

Tidy & Manipulate Data I

# This is the R chunk for the Tidy & Manipulate Data I 

#We put this part (Tidy & Manipulate Data I) prior to the (Understand) part is because there will be some variables created after we tidy this dataset and it is better to convert variable type, rename variable, rearrange variables after we set down the dataset. 



#The problem of this dataset is that the column names are not names of variables, but values of a variable. In this case, the column names "Prep.Males.Total..", and "Prep.Females.Total.." represent values of the gender variable, and each row represents two observations, not one. 
head(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students)

#Therefore, we use gather() function to merge two variables into a factor variable and also create a new variable which is called "Number_of_students".
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students %>%
gather(`Prep.Males.Total..`, `Prep.Females.Total..`, key = "Gender", value = "Number_of_students")

head(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)
NA
NA

Understand

# This is the R chunk for the Understand Section

#Rearrange and rename and rearrange variables
#We first check the order of variables. 
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)
'data.frame':   4502 obs. of  9 variables:
 $ Education_Sector  : chr  "Catholic" "Catholic" "Catholic" "Catholic" ...
 $ Entity_Type       : int  2 2 2 2 2 2 2 2 2 2 ...
 $ School_No         : int  20 25 26 28 29 30 33 35 60 77 ...
 $ School_Name       : chr  "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
 $ School_Type       : chr  "Secondary" "Secondary" "Secondary" "Secondary" ...
 $ School_Status     : chr  "O" "O" "O" "O" ...
 $ Prep.Total..      : num  0 0 0 0 34 44 0 0 38 39 ...
 $ Gender            : chr  "Prep.Males.Total.." "Prep.Males.Total.." "Prep.Males.Total.." "Prep.Males.Total.." ...
 $ Number_of_students: num  0 0 0 0 15 19 0 0 22 15 ...
#We use select() function to rearrange the order of variables.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 %>% select(School_Name, School_No, School_Type, School_Status, Education_Sector, Entity_Type, Gender, Number_of_students, Prep.Total..)

#Given that we gather two variables together to build up a factor variable "Gender". We should rename the levels of "Gender".
#Then, use levels() function to check the level name of this variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Gender)
NULL
#We use label() function to re-label the name of levels in a factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Gender <- factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Gender,levels = c("Prep.Females.Total..","Prep.Males.Total.."),
                                                                                     labels = c("Female","Male"))

#We also notice that the variable name of "Prep.Total.." doesn't look good (because there are some ".." in the name) and "Number_of_students" should actually be "Number_of_pre.student".
#We use rename() function to rename variables.                              
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 %>% 
  rename(
    Total_number_of_pre.student_for_school = Prep.Total..
    , Number_of_pre.student = Number_of_students
  )

#Check the variables and order again, this also prove that this dataset contains multiple data types (numerics, characters, factors).
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)
'data.frame':   4502 obs. of  9 variables:
 $ School_Name                           : chr  "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
 $ School_No                             : int  20 25 26 28 29 30 33 35 60 77 ...
 $ School_Type                           : chr  "Secondary" "Secondary" "Secondary" "Secondary" ...
 $ School_Status                         : chr  "O" "O" "O" "O" ...
 $ Education_Sector                      : chr  "Catholic" "Catholic" "Catholic" "Catholic" ...
 $ Entity_Type                           : int  2 2 2 2 2 2 2 2 2 2 ...
 $ Gender                                : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
 $ Number_of_pre.student                 : num  0 0 0 0 15 19 0 0 22 15 ...
 $ Total_number_of_pre.student_for_school: num  0 0 0 0 34 44 0 0 38 39 ...
#Convert variable type
#Notice that "Education_Sector" has some values are the same, so try to covert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector)

is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector)
[1] TRUE
#Check the levels of this variable, and the result is that there are only 3 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector)
[1] "Catholic"    "Government"  "Independent"
#Entity_Type is obviously a factor variable but R indicate that as int. So, we need to convert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type)

is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type)
[1] TRUE
#To ensure that this variable is a factor variable, we check the levels for this variable and the result is that there are only 2 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type)
[1] "1" "2"
#School_Type is obviously a factor variable but R indicate that as chr. So, we need to convert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type)

is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type)
[1] TRUE
#To ensure that this variable is a factor variable, we check the levels for this variable and the result is that there are only 5 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type)
[1] "Language"  "Pri/Sec"   "Primary"   "Secondary" "Special"  
#Notice that "School_Status" has some values are the same, so try to covert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status)

is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status)
[1] TRUE
#Check the levels of this variable, and the result is that there are only 2 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status)
[1] "O" "U"
#This variable has to be integer because it can't be half student.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Number_of_pre.student <- as.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Number_of_pre.student)
is.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Number_of_pre.student)
[1] TRUE
#This variable has to be integer because it can't be half student.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Total_number_of_pre.student_for_school <- as.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Total_number_of_pre.student_for_school)
is.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Total_number_of_pre.student_for_school)
[1] TRUE
#Check the varibale type again to make sure everything is good. 
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)
'data.frame':   4502 obs. of  9 variables:
 $ School_Name                           : chr  "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
 $ School_No                             : int  20 25 26 28 29 30 33 35 60 77 ...
 $ School_Type                           : Factor w/ 5 levels "Language","Pri/Sec",..: 4 4 4 4 3 3 4 4 3 3 ...
 $ School_Status                         : Factor w/ 2 levels "O","U": 1 1 1 1 1 1 1 1 1 1 ...
 $ Education_Sector                      : Factor w/ 3 levels "Catholic","Government",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Entity_Type                           : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ Gender                                : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
 $ Number_of_pre.student                 : int  0 0 0 0 15 19 0 0 22 15 ...
 $ Total_number_of_pre.student_for_school: int  0 0 0 0 34 44 0 0 38 39 ...

Tidy & Manipulate Data II

# This is the R chunk for the Tidy & Manipulate Data II 

#Mutate new variable
#Use mutate() function to create a new variable. This new variable is the sex proportion in terms of male and female for each school. And it's formula is: Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- mutate(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2,
Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school
)

#Check if we create "Sex_proportion_for_the_pre.student_in_school" successfully.
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)
'data.frame':   4502 obs. of  10 variables:
 $ School_Name                                 : chr  "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
 $ School_No                                   : int  20 25 26 28 29 30 33 35 60 77 ...
 $ School_Type                                 : Factor w/ 5 levels "Language","Pri/Sec",..: 4 4 4 4 3 3 4 4 3 3 ...
 $ School_Status                               : Factor w/ 2 levels "O","U": 1 1 1 1 1 1 1 1 1 1 ...
 $ Education_Sector                            : Factor w/ 3 levels "Catholic","Government",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Entity_Type                                 : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ Gender                                      : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
 $ Number_of_pre.student                       : int  0 0 0 0 15 19 0 0 22 15 ...
 $ Total_number_of_pre.student_for_school      : int  0 0 0 0 34 44 0 0 38 39 ...
 $ Sex_proportion_for_the_pre.student_in_school: num  NaN NaN NaN NaN 0.441 ...

Scan I

# This is the R chunk for the Scan I

#Sometimes dataset can have obious errors. For example, sex proportion cannot be negative or beyond 1. For checking these inconsistency, we would like to apply rules to check whether this dataset violates the rule by using editfile() function. And also, we could also apply rules to correct errors by using correctionRules() function. When the number of rules increase, it is better to manage the rules separate from the data. The function of editrules() can check which rules are obeyed or not and allows one to find the minimal set of variables to adapt so that all rules can be obeyed. In addition,correctionRules() function force variables to follow it's constraints so that, it is very helpful when correcting data. In this part, we are going to using editrules() and correctionRules() to deal with inconsistency.

#The first thing is to check the consistency of "Total_number_of_pre.student_for_school" = "Number_of_pre.student" for male + "Number_of_pre.student" for female.
#It is more convenient to check the consistency of "Total_number_of_pre.student_for_school" = "Number_of_pre.student" for male + "Number_of_pre.student" for female by spreading the "Gender" out.
#We find out that there is a variable prevent us to spread "Gender" out in "All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2", which is called "Sex_proportion_for_the_pre.student_in_school". 
#That is because "Sex_proportion_for_the_pre.student_in_school" = "Number_of_pre.student"/"Total_number_of_pre.student_for_school". And when we spread "Gender" out, it will produce many NaN values.
#And this is why we should exclude "Sex_proportion_for_the_pre.student_in_school" first and then spread "Gender" out.
Data_for_checking_inconsistency1 = subset(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2, select = -Sex_proportion_for_the_pre.student_in_school )

#And then we use spread() function to spread it out.
Data_for_checking_inconsistency2 <- spread(Data_for_checking_inconsistency1, key = Gender, value = Number_of_pre.student)

#Import rules into R
Rules1 <- editfile("/Users/macbook/Documents/RMIT-Master of Analytics/Data wrangling/editrules2.txt", type = "all")

Rules1

Edit set:
num1 : Female + Male == Total_number_of_pre.student_for_school 
#Use summary() function to avoid the problem of large data size.
Violated1 <- violatedEdits(Rules1, Data_for_checking_inconsistency2)

summary(Violated1)
Edit violations, 2251 observations, 0 completely missing (0%):

Edit violations per record:
#The result shows that this data does have some errors.



#To correct those errors, we import rules for correction.
Rules2 <- correctionRules("/Users/macbook/Documents/RMIT-Master of Analytics/Data wrangling/editrules3.txt")

Rules2
Object of class 'correctionRules'
##  1-------
  if (Male + Female != Total_number_of_pre.student_for_school) {
      Total_number_of_pre.student_for_school <- Male + Female
  } 
#Apply rules into the data that contains errors.
cor <- correctWithRules(Rules2, Data_for_checking_inconsistency2)

#And then, check it again whether we successfully correct errors.
#Use summary() function to avoid the problem of large data size.
Violated2 <- violatedEdits(Rules1, cor$corrected)

summary(Violated2)
No violations detected, 0 checks evaluated to NA
NULL
#And then, after correcting the data, we put "Sex_proportion_for_the_pre.student_in_school" back into the dataset.
#The first thing is to reshape the dataset of "Data_for_checking_inconsistency2".
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 <- Data_for_checking_inconsistency2 %>%
  gather(`Male`, `Female`, key = "Gender", value = "Number_of_pre.student")

#And then, put "Sex_proportion_for_the_pre.student_in_school" back into the dataset.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 <- mutate(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3,
                                                                              Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school
)

#And then, rearrange the sequence of variables.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 %>% select(School_Name, School_No, School_Type, School_Status, Education_Sector, Entity_Type, Gender, Number_of_pre.student, Total_number_of_pre.student_for_school, Sex_proportion_for_the_pre.student_in_school)



#The second thing is to check inifite value, special value, and missing value. For convenience, we first build up a function to better deal with missing values, infinite values, and NaN values. We acutally would like to use sapply() function to look for errors but this function could not deal with NaN values. Thus, we should write up a function to deal with NaN values.
is.specialorNA <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}

#There are 932 na value for column "Sex_proportion_for_the_pre.student_in_school"
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.na(x) ))
                                 School_Name 
                                           0 
                                   School_No 
                                           0 
                                 School_Type 
                                           0 
                               School_Status 
                                           0 
                            Education_Sector 
                                           0 
                                 Entity_Type 
                                           0 
                                      Gender 
                                           0 
                       Number_of_pre.student 
                                           0 
      Total_number_of_pre.student_for_school 
                                           0 
Sex_proportion_for_the_pre.student_in_school 
                                         932 
#There are 0 infinite value for this dataset
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.infinite(x) ))
                                 School_Name 
                                           0 
                                   School_No 
                                           0 
                                 School_Type 
                                           0 
                               School_Status 
                                           0 
                            Education_Sector 
                                           0 
                                 Entity_Type 
                                           0 
                                      Gender 
                                           0 
                       Number_of_pre.student 
                                           0 
      Total_number_of_pre.student_for_school 
                                           0 
Sex_proportion_for_the_pre.student_in_school 
                                           0 
#There are 932 NaN value for column "Sex_proportion_for_the_pre.student_in_school"
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.nan(x) ))
                                 School_Name 
                                           0 
                                   School_No 
                                           0 
                                 School_Type 
                                           0 
                               School_Status 
                                           0 
                            Education_Sector 
                                           0 
                                 Entity_Type 
                                           0 
                                      Gender 
                                           0 
                       Number_of_pre.student 
                                           0 
      Total_number_of_pre.student_for_school 
                                           0 
Sex_proportion_for_the_pre.student_in_school 
                                         932 
#NaN means "not a number". When R generate NaN, it means that the value just makes a little sense. In this case, we find out a lot of NaNs in the variable of “Sex_proportion_for_the_pre.student_in_school”, which is calculated by the formula of:
#Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school. 
#Number_of_pre.student can be 0 and Total_number_of_pre.student_for_school can be 0 either. The is the reason of NaN because when 0 divided by 0, it is meaningless and that’s why R generate NaN as the result. To fix this special value, we can just covert NaN to 0 directly. 
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school[All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school == "NaN" ] <- 0

#And then, check it again whether we eliminate all NaN values and Na values.
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.na(x) ))
                                 School_Name 
                                           0 
                                   School_No 
                                           0 
                                 School_Type 
                                           0 
                               School_Status 
                                           0 
                            Education_Sector 
                                           0 
                                 Entity_Type 
                                           0 
                                      Gender 
                                           0 
                       Number_of_pre.student 
                                           0 
      Total_number_of_pre.student_for_school 
                                           0 
Sex_proportion_for_the_pre.student_in_school 
                                           0 
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.nan(x) ))
                                 School_Name 
                                           0 
                                   School_No 
                                           0 
                                 School_Type 
                                           0 
                               School_Status 
                                           0 
                            Education_Sector 
                                           0 
                                 Entity_Type 
                                           0 
                                      Gender 
                                           0 
                       Number_of_pre.student 
                                           0 
      Total_number_of_pre.student_for_school 
                                           0 
Sex_proportion_for_the_pre.student_in_school 
                                           0 
#We successfully remove all NaN values and Na values.



#The third thing is to check incoensistency, and we do the same thing as what we done previously.
#Firts, we import rules into R.
Rules3 <- editfile("/Users/macbook/Documents/RMIT-Master of Analytics/Data wrangling/editrules.txt", type = "all")

Rules3

Data model:
dat1 : Education_Sector %in% c('Catholic', 'Government', 'Independent')
dat2 : Entity_Type %in% c('1', '2')
dat3 : Gender %in% c('Female', 'Male')
dat4 : School_Status %in% c('O', 'U')
dat5 : School_Type %in% c('Language', 'Pri/Sec', 'Primary', 'Secondary', 'Special') 

Edit set:
num1 : Number_of_pre.student <= Total_number_of_pre.student_for_school
num2 : Sex_proportion_for_the_pre.student_in_school <= 1
num3 : 0 <= Sex_proportion_for_the_pre.student_in_school 
#Use summary() function to avoid the problem of large data size.

Violated3 <- violatedEdits(Rules3, All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3)

summary(Violated3)
No violations detected, 0 checks evaluated to NA
NULL
#Fortunately, we don't have any inconsistencies. 

Transform

# This is the R chunk for the Transform Section


#7 Data transformation
#The reaosn to put "data transformation" prior to the "Scan II" is that we have data who has right-skewed distribution. And we decide to use z-scores method in order to locate where are the outliers. 
#Therefore, we have to transform data from right-skewed distribution to approximately normal distribution first. 
hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school)


#Based on the variable "Sex_proportion_for_the_pre.student_in_school", it contains 0 values and it is right-skewed.
#We first try to use square root transformation to deal with this data because this data has 0 values and it is reight-skewed.
sqrt_salary <- sqrt(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school)
hist(sqrt_salary)


#As result, it is still right skewed and still need to be improved.

#We note that this data contrains 0 values or 1 values, and it is because there are some all-male school or all-female school.
#When investigating sex porporion, it is meaningless to investigate what percentage of male student in a all-female school or what percentage of female student in all-male school.
#Therefore, we should remove 0 value and 1 value. 
subset_no_0_1 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 %>% filter(Sex_proportion_for_the_pre.student_in_school != "0" & Sex_proportion_for_the_pre.student_in_school !="1")

#Now, let's produce a histogram again. 
hist(subset_no_0_1$Sex_proportion_for_the_pre.student_in_school)


#As result, it is approximately normal distributed now.

Scan II

# This is the R chunk for the Scan II

#We first check out which variable is numeric. 
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3)
'data.frame':   4502 obs. of  10 variables:
 $ School_Name                                 : chr  "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
 $ School_No                                   : int  20 25 26 28 29 30 33 35 60 77 ...
 $ School_Type                                 : Factor w/ 5 levels "Language","Pri/Sec",..: 4 4 4 4 3 3 4 4 3 3 ...
 $ School_Status                               : Factor w/ 2 levels "O","U": 1 1 1 1 1 1 1 1 1 1 ...
 $ Education_Sector                            : Factor w/ 3 levels "Catholic","Government",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Entity_Type                                 : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
 $ Gender                                      : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
 $ Number_of_pre.student                       : int  0 0 0 0 15 19 0 0 22 15 ...
 $ Total_number_of_pre.student_for_school      : int  0 0 0 0 34 44 0 0 38 39 ...
 $ Sex_proportion_for_the_pre.student_in_school: num  0 0 0 0 0.441 ...
#We could see that there are 4 numeric variables: 
#School_No, Number_of_pre.student, Total_number_of_pre.student_for_school, and Sex_proportion_for_the_pre.student_in_school
#However, School_No is not necessary to be scanned because it is just a code to represent a school and not a numeric number technically. 



#First, we are going to investigate outliers of "Sex_proportion_for_the_pre.student_in_school".
hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school)


#And also, they are non-symmetric/ non-normal data distributions, it is right-skewed distribution.
#As we known, there are all-male school and all-female school, so we expect that there are some outliers is 0 or 1 in terms of sex proportion.
#So, we should use z-scores method to look for outliers. The reason is that we could know which values are outliers, and knowing how many outliers are there, and if there are just a samll acount of outliers we may choose to delete them.
#Given that we already normalize the right-skewed distribution of "Sex_proportion_for_the_pre.student_in_school"
#Now, we could directly apply z-scores method to look for outliers.

#We now calculate z scores:
z.scores <- subset_no_0_1$Sex_proportion_for_the_pre.student_in_school %>%  scores(type = "z")
z.scores %>% summary()
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-3.571547 -0.558508 -0.000538  0.000000  0.557432  3.570471 
#we can see that the maximum is 12.04. And we can find the locations of the z-scores whose absolute value is greater than 3 by using which().
which( abs(z.scores) >3 )
 [1]   24  120  154  641  701  708  774 1171 1287 1361 1374 1451 1489
[14] 1491 1708 1804 1838 2325 2385 2392 2458 2855 2971 3045 3058 3135
[27] 3173 3175 3349
#By using length() function, we can see that there are 29 outliers.
length (which( abs(z.scores) >3 ))
[1] 29
#It is still possible that those outliers are not the result of data entry errors, measurement errors, experimental errors, intentional errors, data processing errors or due to the sampling (i.e., sampling error) because very low sex proportion or very high sex proportion does happened in some school.
#But it is also possible that those outliers are actually errors.
#In this case, we should use capping technique to deal with outliers because we are not sure about whether there are errors. 
#Thus, we should use the safe way to deal with this data: Replacing the outliers with the nearest neighbours that are not outliers.

#First, we need to create our own function: cap. This means that we need to first find the value that are less than (quantiles 0.25)-1.5*IQR and then repalce it as (quantiles 0.05). Second, find the value that are bigger than (quantiles 0.75)+1.5*IQR and then repalce it as (quantiles 0.95).
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

#And then, cap the outliers.
Sex_proportion_capped <- subset_no_0_1$Sex_proportion_for_the_pre.student_in_school %>% cap()

#Let's check outliers again.
z.scores <- Sex_proportion_capped %>%  scores(type = "z")
z.scores %>% summary()
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-2.5208407 -0.6305411 -0.0004412  0.0000000  0.6296587  2.5199584 
#we can see that the maximum is 2.5199584. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores) >3 )
integer(0)
#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores) >3 ))
[1] 0
#Second, we are going to ivestigate outliers of "Number_of_pre.student" in terms of "gender". 
#Given that they are one factor variable and one numeric variable. And also, they are non-symmetric/ non-normal data distributions.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_male <-All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 %>% filter(Gender == "Male")

hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_male$Number_of_pre.student)


All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_female <-All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 %>% filter(Gender == "Female")

hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_female$Number_of_pre.student)


#Thus, in this case, using the technique of "Bivariate box plot" is highly encouraged.

boxplot(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Number_of_pre.student ~ All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Gender, main="Number of pre.student by gender", ylab = "Number of pre.student", xlab = "Gender")


#Using the bivariate box plot approach, we can easily detect outliers in "Number_of_pre.student" for a given "Gender".
#However, these outliers are valuable becasue this variable (Number_of_pre.student) is used for collect information of the number of stduent for different gender in different school. 
#And, it is okay that some school have a lot of student (note that outliers are all beyond the upper fence for both male and female) because schools can be like all-male school or all-female school. 
#Moreover, school type is divided by "Language", "Pri/Sec", "Primary", "Secondary", and "Special", it is expected that there should be many pre-student for language school or special school, and less pre-student for Pri/Sec, Primary, and Secondary school.
#And, we could see that the Q1 is just around 0 and Q3 is around 30 for both male and female. Obviously, there are a lot of primary and secondary school who has 0 pre-student or only very samll number of pre-student in this datset.
#Therefore, it is possible that some "Language" and "Special" school have a lot of students and are defined as outliers in this plot. So, these outliers are explainable. 
#However, it is still possible that some outliers are errors such as measurement error or sampleing error.
#Thus, we should use the safe way to deal with this data: Replacing the outliers with the nearest neighbours that are not outliers.
#Given that we already have our own function: cap and we could just directly apply cap function into dataset.



#We first deal with the outliers for male.
Student_male_capped <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_male$Number_of_pre.student %>% cap()

#Let's check outliers again.
z.scores2 <- Student_male_capped %>%  scores(type = "z")
z.scores2 %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.0015 -0.9418 -0.2859  0.0000  0.6086  2.9342 
#we can see that the maximum is 2.9342. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores2) >3 )
integer(0)
#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores2) >3 ))
[1] 0
#And then, we deal with the outliers for female.
Student_female_capped <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_female$Number_of_pre.student %>% cap()

#Let's check outliers again.
z.scores3 <- Student_female_capped %>%  scores(type = "z")
z.scores3 %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.0039 -0.9410 -0.3118  0.0000  0.6320  2.9602 
#we can see that the maximum is 2.9602. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores3) >3 )
integer(0)
#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores3) >3 ))
[1] 0
#Third, we are going to investigate outliers of "Total_number_of_pre.student_for_school".
hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Total_number_of_pre.student_for_school)


#And also, they are non-symmetric/ non-normal data distributions.
#Thus, in this case, using the technique of "Univariate box plot" is highly encouraged.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Total_number_of_pre.student_for_school %>%  boxplot(main="Box Plot of total number of pre.student for school", ylab="Total number of pre.student", col = "grey")


#This case is very similar to the previous case, outliers are all beyond the upper fence. As we can see in the plot, outliers are all in around 140 students to 250 student. It seems okay because some big school can have 250 pre-student. 
#And, we could see that the Q1 is just around 0 and Q3 is around 50. Obviously, this plot has the same situation with the previous one, there are a lot of primary and secondary school who has 0 pre-student only very samll number of pre-student in this datset.
#Therefore, it is possible that some "Language" and "Special" school have a lot of students and are defined as outliers in this plot. So, these outliers are explainable. 
#Outliers are valuable and explainable.
#However, it is still possible that some outliers are errors such as measurement error or sampleing error.
#Thus, we should use the safe way to deal with this data: Replacing the outliers with the nearest neighbours that are not outliers.
#Given that we already have our own function: cap and we could just directly apply cap function into dataset.



Total_tudent_capped <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Total_number_of_pre.student_for_school %>% cap()

#Let's check outliers again.
z.scores4 <- Total_tudent_capped %>%  scores(type = "z")
z.scores4 %>% summary()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.0269 -0.9327 -0.2739  0.0000  0.6046  2.8948 
#we can see that the maximum is 2.8948. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores4) >3 )
integer(0)
#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores4) >3 ))
[1] 0
---
title: "MATH2349 Data Wrangling"
author: "Wenxuan Chen   S3613636"
subtitle: Assignment 2
output:
  html_notebook: default
  pdf_document: default
---
## Required packages 


Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

```{r}
# This is the R chunk for the required packages
library(tidyr)

library(dplyr)

library(editrules)

library(deducorrect)

library(outliers)
```

```{r}

## Executive Summary
#This report provides the preprocessing of 2 datasets, which contains methods to merge datasets, convert variable type, label factor variable, reshape data into a tidy format, mutate new variable, scan outliers for numeric variables, transform data, scan errors (missing values, special values, obvious errors, and inconsistency) and use the technique to deal with errors (missing values, special values, obvious errors, and inconsistency) by only using packages of tidyr, dplyr, editrules, deducorrect, and outliers. This report aims to provide a sample of how to preprocess datasets, the source of datasets is all come from the government of VIC https://www.vic.gov.au/, they are all open and have a Creative Commons Licence. 
```

```{r}
# This is the R chunk for the Data Section

## Data

#Import dataset from website, this dataset is from the government of victoria and it's source website is: https://discover.data.vic.gov.au/dataset/all-schools-fte-enrolments-feb-2015. This dataset contains information collected from the February 2014 school census of Victorian schools until February 2015. This dataset contains 54 variables but we are going to subset to 9 variables otherwise it will be too large and when we merge variables together or merge two datasets together, R will reports Error: vector memory exhausted (limit reached?). After we subset the data set, there will be 9 variables left: 

#"Education_Sector":It refers to educational sector.
#"Entity_Type":It refers to type of school district.       
#"School_No":It refers to the code that represent the school.
#"School_Name":It refers to the name of school.         
#"School_Type":It refers to the type of school.         
#"School_Status":It refers to the status of school.
#"Prep.Males.Total..":It refers to the total number of preparatory male student.
#"Prep.Females.Total..":It refers to the total number of preparatory female student.
#"Prep.Total..":It refers to the total number of preparatory student. 

#Import dataset from website.
url <- "http://www.education.vic.gov.au/Documents/about/research/datavic/dv171-Feb2015FTEenrolments.csv"
All_Schools_FTE_enrolments_Feb_2015 <- read.csv(url, stringsAsFactors = FALSE)

#We need to subset data because this dataset is too big, if we don't subset dataset, then when we merge variables together or merge two datasets together, R will reports Error: vector memory exhausted (limit reached?).
All_Schools_FTE_enrolments_Feb_2015 <- subset(All_Schools_FTE_enrolments_Feb_2015, select = Education_Sector:Prep.Total..)

head(All_Schools_FTE_enrolments_Feb_2015)



#Import dataset from another website, this dataset is from the government of victoria and it's source website is: https://discover.data.vic.gov.au/dataset/all-schools-fte-enrolments-feb-2016. This dataset contains information collected from the February 2015 school census of Victorian schools until February 2016. This dataset also contains 54 variables but we are going to subset to 9 variables due to the problem of large dataset. And these 9 variables are the same as the previous dataset, it means that they have the same variable and also those variables have the same meaning.

#Import dataset from website.
url <- "http://www.education.vic.gov.au/Documents/about/research/datavic/dv214-allschoolsfetenrolfeb2016.csv"
All_Schools_FTE_enrolments_Feb_2016 <- read.csv(url, stringsAsFactors = FALSE)

#We use anti_join() function to obtain the different data from student enrolments in 2015 (New data), because we woudl like merge additional data from student enrolments in 2016 with student enrolments in 2016 in next step.
New_All_Schools_FTE_enrolments_Feb_2016 <- anti_join(All_Schools_FTE_enrolments_Feb_2016, All_Schools_FTE_enrolments_Feb_2015, by = "School_Name")

#The same situation in this one, we have to minimize data otherwise, R will generate error when merge 2 dataset because they are too large.
New_All_Schools_FTE_enrolments_Feb_2016 <- subset(New_All_Schools_FTE_enrolments_Feb_2016, select = Education_Sector:Prep.Total..)

head(New_All_Schools_FTE_enrolments_Feb_2016)

#We use bind_rows() function to merge two datasets because they have the same variables.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students <- bind_rows(All_Schools_FTE_enrolments_Feb_2015, New_All_Schools_FTE_enrolments_Feb_2016)

head(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students)
```

##	Tidy & Manipulate Data I 

```{r}
# This is the R chunk for the Tidy & Manipulate Data I 

#We put this part (Tidy & Manipulate Data I) prior to the (Understand) part is because there will be some variables created after we tidy this dataset and it is better to convert variable type, rename variable, rearrange variables after we set down the dataset. 



#The problem of this dataset is that the column names are not names of variables, but values of a variable. In this case, the column names "Prep.Males.Total..", and "Prep.Females.Total.." represent values of the gender variable, and each row represents two observations, not one. 
head(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students)

#Therefore, we use gather() function to merge two variables into a factor variable and also create a new variable which is called "Number_of_students".
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students %>%
gather(`Prep.Males.Total..`, `Prep.Females.Total..`, key = "Gender", value = "Number_of_students")

head(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)


```

## Understand 

```{r}
# This is the R chunk for the Understand Section

#Rearrange and rename and rearrange variables
#We first check the order of variables. 
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)

#We use select() function to rearrange the order of variables.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 %>% select(School_Name, School_No, School_Type, School_Status, Education_Sector, Entity_Type, Gender, Number_of_students, Prep.Total..)

#Given that we gather two variables together to build up a factor variable "Gender". We should rename the levels of "Gender".
#Then, use levels() function to check the level name of this variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Gender)

#We use label() function to re-label the name of levels in a factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Gender <- factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Gender,levels = c("Prep.Females.Total..","Prep.Males.Total.."),
                                                                                     labels = c("Female","Male"))

#We also notice that the variable name of "Prep.Total.." doesn't look good (because there are some ".." in the name) and "Number_of_students" should actually be "Number_of_pre.student".
#We use rename() function to rename variables.                              
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 %>% 
  rename(
    Total_number_of_pre.student_for_school = Prep.Total..
    , Number_of_pre.student = Number_of_students
  )

#Check the variables and order again, this also prove that this dataset contains multiple data types (numerics, characters, factors).
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)



#Convert variable type
#Notice that "Education_Sector" has some values are the same, so try to covert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector)

is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector)

#Check the levels of this variable, and the result is that there are only 3 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Education_Sector)


#Entity_Type is obviously a factor variable but R indicate that as int. So, we need to convert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type)

is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type)

#To ensure that this variable is a factor variable, we check the levels for this variable and the result is that there are only 2 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Entity_Type)



#School_Type is obviously a factor variable but R indicate that as chr. So, we need to convert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type)

is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type)

#To ensure that this variable is a factor variable, we check the levels for this variable and the result is that there are only 5 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Type)



#Notice that "School_Status" has some values are the same, so try to covert it into factor variable.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status <- as.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status)

is.factor(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status)

#Check the levels of this variable, and the result is that there are only 2 levels. This confirms my guess and this variable should actually be a factor variable.
levels(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$School_Status)



#This variable has to be integer because it can't be half student.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Number_of_pre.student <- as.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Number_of_pre.student)
is.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Number_of_pre.student)



#This variable has to be integer because it can't be half student.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Total_number_of_pre.student_for_school <- as.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Total_number_of_pre.student_for_school)
is.integer(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2$Total_number_of_pre.student_for_school)

#Check the varibale type again to make sure everything is good. 
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)

```

##	Tidy & Manipulate Data II 

```{r}
# This is the R chunk for the Tidy & Manipulate Data II 

#Mutate new variable
#Use mutate() function to create a new variable. This new variable is the sex proportion in terms of male and female for each school. And it's formula is: Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 <- mutate(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2,
Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school
)

#Check if we create "Sex_proportion_for_the_pre.student_in_school" successfully.
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2)
```

##	Scan I 

```{r}
# This is the R chunk for the Scan I

#Sometimes dataset can have obious errors. For example, sex proportion cannot be negative or beyond 1. For checking these inconsistency, we would like to apply rules to check whether this dataset violates the rule by using editfile() function. And also, we could also apply rules to correct errors by using correctionRules() function. When the number of rules increase, it is better to manage the rules separate from the data. The function of editrules() can check which rules are obeyed or not and allows one to find the minimal set of variables to adapt so that all rules can be obeyed. In addition,correctionRules() function force variables to follow it's constraints so that, it is very helpful when correcting data. In this part, we are going to using editrules() and correctionRules() to deal with inconsistency.

#The first thing is to check the consistency of "Total_number_of_pre.student_for_school" = "Number_of_pre.student" for male + "Number_of_pre.student" for female.
#It is more convenient to check the consistency of "Total_number_of_pre.student_for_school" = "Number_of_pre.student" for male + "Number_of_pre.student" for female by spreading the "Gender" out.
#We find out that there is a variable prevent us to spread "Gender" out in "All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2", which is called "Sex_proportion_for_the_pre.student_in_school". 
#That is because "Sex_proportion_for_the_pre.student_in_school" = "Number_of_pre.student"/"Total_number_of_pre.student_for_school". And when we spread "Gender" out, it will produce many NaN values.
#And this is why we should exclude "Sex_proportion_for_the_pre.student_in_school" first and then spread "Gender" out.
Data_for_checking_inconsistency1 = subset(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2, select = -Sex_proportion_for_the_pre.student_in_school )

#And then we use spread() function to spread it out.
Data_for_checking_inconsistency2 <- spread(Data_for_checking_inconsistency1, key = Gender, value = Number_of_pre.student)

#Import rules into R
Rules1 <- editfile("/Users/macbook/Documents/RMIT-Master of Analytics/Data wrangling/editrules2.txt", type = "all")

Rules1

#Use summary() function to avoid the problem of large data size.
Violated1 <- violatedEdits(Rules1, Data_for_checking_inconsistency2)

summary(Violated1)

#The result shows that this data does have some errors.



#To correct those errors, we import rules for correction.
Rules2 <- correctionRules("/Users/macbook/Documents/RMIT-Master of Analytics/Data wrangling/editrules3.txt")

Rules2

#Apply rules into the data that contains errors.
cor <- correctWithRules(Rules2, Data_for_checking_inconsistency2)

#And then, check it again whether we successfully correct errors.
#Use summary() function to avoid the problem of large data size.
Violated2 <- violatedEdits(Rules1, cor$corrected)

summary(Violated2)



#And then, after correcting the data, we put "Sex_proportion_for_the_pre.student_in_school" back into the dataset.
#The first thing is to reshape the dataset of "Data_for_checking_inconsistency2".
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 <- Data_for_checking_inconsistency2 %>%
  gather(`Male`, `Female`, key = "Gender", value = "Number_of_pre.student")

#And then, put "Sex_proportion_for_the_pre.student_in_school" back into the dataset.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 <- mutate(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3,
                                                                              Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school
)

#And then, rearrange the sequence of variables.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_2 %>% select(School_Name, School_No, School_Type, School_Status, Education_Sector, Entity_Type, Gender, Number_of_pre.student, Total_number_of_pre.student_for_school, Sex_proportion_for_the_pre.student_in_school)



#The second thing is to check inifite value, special value, and missing value. For convenience, we first build up a function to better deal with missing values, infinite values, and NaN values. We acutally would like to use sapply() function to look for errors but this function could not deal with NaN values. Thus, we should write up a function to deal with NaN values.
is.specialorNA <- function(x){
  if (is.numeric(x)) (is.infinite(x) | is.nan(x) | is.na(x))
}

#There are 932 na value for column "Sex_proportion_for_the_pre.student_in_school"
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.na(x) ))

#There are 0 infinite value for this dataset
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.infinite(x) ))

#There are 932 NaN value for column "Sex_proportion_for_the_pre.student_in_school"
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.nan(x) ))

#NaN means "not a number". When R generate NaN, it means that the value just makes a little sense. In this case, we find out a lot of NaNs in the variable of “Sex_proportion_for_the_pre.student_in_school”, which is calculated by the formula of:
#Sex_proportion_for_the_pre.student_in_school = Number_of_pre.student / Total_number_of_pre.student_for_school. 
#Number_of_pre.student can be 0 and Total_number_of_pre.student_for_school can be 0 either. The is the reason of NaN because when 0 divided by 0, it is meaningless and that’s why R generate NaN as the result. To fix this special value, we can just covert NaN to 0 directly. 
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school[All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school == "NaN" ] <- 0

#And then, check it again whether we eliminate all NaN values and Na values.
sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.na(x) ))

sapply(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3, function(x) sum( is.nan(x) ))

#We successfully remove all NaN values and Na values.



#The third thing is to check incoensistency, and we do the same thing as what we done previously.
#Firts, we import rules into R.
Rules3 <- editfile("/Users/macbook/Documents/RMIT-Master of Analytics/Data wrangling/editrules.txt", type = "all")

Rules3

#Use summary() function to avoid the problem of large data size.

Violated3 <- violatedEdits(Rules3, All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3)

summary(Violated3)

#Fortunately, we don't have any inconsistencies. 
```

##	Transform 

```{r}
# This is the R chunk for the Transform Section


#7 Data transformation
#The reaosn to put "data transformation" prior to the "Scan II" is that we have data who has right-skewed distribution. And we decide to use z-scores method in order to locate where are the outliers. 
#Therefore, we have to transform data from right-skewed distribution to approximately normal distribution first. 
hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school)

#Based on the variable "Sex_proportion_for_the_pre.student_in_school", it contains 0 values and it is right-skewed.
#We first try to use square root transformation to deal with this data because this data has 0 values and it is reight-skewed.
sqrt_salary <- sqrt(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school)
hist(sqrt_salary)

#As result, it is still right skewed and still need to be improved.

#We note that this data contrains 0 values or 1 values, and it is because there are some all-male school or all-female school.
#When investigating sex porporion, it is meaningless to investigate what percentage of male student in a all-female school or what percentage of female student in all-male school.
#Therefore, we should remove 0 value and 1 value. 
subset_no_0_1 <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 %>% filter(Sex_proportion_for_the_pre.student_in_school != "0" & Sex_proportion_for_the_pre.student_in_school !="1")

#Now, let's produce a histogram again. 
hist(subset_no_0_1$Sex_proportion_for_the_pre.student_in_school)

#As result, it is approximately normal distributed now.
```

##	Scan II

```{r}
# This is the R chunk for the Scan II

#We first check out which variable is numeric. 
str(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3)

#We could see that there are 4 numeric variables: 
#School_No, Number_of_pre.student, Total_number_of_pre.student_for_school, and Sex_proportion_for_the_pre.student_in_school
#However, School_No is not necessary to be scanned because it is just a code to represent a school and not a numeric number technically. 



#First, we are going to investigate outliers of "Sex_proportion_for_the_pre.student_in_school".
hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Sex_proportion_for_the_pre.student_in_school)

#And also, they are non-symmetric/ non-normal data distributions, it is right-skewed distribution.
#As we known, there are all-male school and all-female school, so we expect that there are some outliers is 0 or 1 in terms of sex proportion.
#So, we should use z-scores method to look for outliers. The reason is that we could know which values are outliers, and knowing how many outliers are there, and if there are just a samll acount of outliers we may choose to delete them.
#Given that we already normalize the right-skewed distribution of "Sex_proportion_for_the_pre.student_in_school"
#Now, we could directly apply z-scores method to look for outliers.

#We now calculate z scores:
z.scores <- subset_no_0_1$Sex_proportion_for_the_pre.student_in_school %>%  scores(type = "z")
z.scores %>% summary()

#we can see that the maximum is 12.04. And we can find the locations of the z-scores whose absolute value is greater than 3 by using which().
which( abs(z.scores) >3 )

#By using length() function, we can see that there are 29 outliers.
length (which( abs(z.scores) >3 ))

#It is still possible that those outliers are not the result of data entry errors, measurement errors, experimental errors, intentional errors, data processing errors or due to the sampling (i.e., sampling error) because very low sex proportion or very high sex proportion does happened in some school.
#But it is also possible that those outliers are actually errors.
#In this case, we should use capping technique to deal with outliers because we are not sure about whether there are errors. 
#Thus, we should use the safe way to deal with this data: Replacing the outliers with the nearest neighbours that are not outliers.

#First, we need to create our own function: cap. This means that we need to first find the value that are less than (quantiles 0.25)-1.5*IQR and then repalce it as (quantiles 0.05). Second, find the value that are bigger than (quantiles 0.75)+1.5*IQR and then repalce it as (quantiles 0.95).
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

#And then, cap the outliers.
Sex_proportion_capped <- subset_no_0_1$Sex_proportion_for_the_pre.student_in_school %>% cap()

#Let's check outliers again.
z.scores <- Sex_proportion_capped %>%  scores(type = "z")
z.scores %>% summary()

#we can see that the maximum is 2.5199584. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores) >3 )

#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores) >3 ))



#Second, we are going to ivestigate outliers of "Number_of_pre.student" in terms of "gender". 
#Given that they are one factor variable and one numeric variable. And also, they are non-symmetric/ non-normal data distributions.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_male <-All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 %>% filter(Gender == "Male")

hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_male$Number_of_pre.student)

All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_female <-All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3 %>% filter(Gender == "Female")

hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_female$Number_of_pre.student)

#Thus, in this case, using the technique of "Bivariate box plot" is highly encouraged.

boxplot(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Number_of_pre.student ~ All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Gender, main="Number of pre.student by gender", ylab = "Number of pre.student", xlab = "Gender")

#Using the bivariate box plot approach, we can easily detect outliers in "Number_of_pre.student" for a given "Gender".
#However, these outliers are valuable becasue this variable (Number_of_pre.student) is used for collect information of the number of stduent for different gender in different school. 
#And, it is okay that some school have a lot of student (note that outliers are all beyond the upper fence for both male and female) because schools can be like all-male school or all-female school. 
#Moreover, school type is divided by "Language", "Pri/Sec", "Primary", "Secondary", and "Special", it is expected that there should be many pre-student for language school or special school, and less pre-student for Pri/Sec, Primary, and Secondary school.
#And, we could see that the Q1 is just around 0 and Q3 is around 30 for both male and female. Obviously, there are a lot of primary and secondary school who has 0 pre-student or only very samll number of pre-student in this datset.
#Therefore, it is possible that some "Language" and "Special" school have a lot of students and are defined as outliers in this plot. So, these outliers are explainable. 
#However, it is still possible that some outliers are errors such as measurement error or sampleing error.
#Thus, we should use the safe way to deal with this data: Replacing the outliers with the nearest neighbours that are not outliers.
#Given that we already have our own function: cap and we could just directly apply cap function into dataset.



#We first deal with the outliers for male.
Student_male_capped <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_male$Number_of_pre.student %>% cap()

#Let's check outliers again.
z.scores2 <- Student_male_capped %>%  scores(type = "z")
z.scores2 %>% summary()

#we can see that the maximum is 2.9342. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores2) >3 )

#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores2) >3 ))



#And then, we deal with the outliers for female.
Student_female_capped <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3_female$Number_of_pre.student %>% cap()

#Let's check outliers again.
z.scores3 <- Student_female_capped %>%  scores(type = "z")
z.scores3 %>% summary()

#we can see that the maximum is 2.9602. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores3) >3 )

#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores3) >3 ))



#Third, we are going to investigate outliers of "Total_number_of_pre.student_for_school".
hist(All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Total_number_of_pre.student_for_school)

#And also, they are non-symmetric/ non-normal data distributions.
#Thus, in this case, using the technique of "Univariate box plot" is highly encouraged.
All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Total_number_of_pre.student_for_school %>%  boxplot(main="Box Plot of total number of pre.student for school", ylab="Total number of pre.student", col = "grey")

#This case is very similar to the previous case, outliers are all beyond the upper fence. As we can see in the plot, outliers are all in around 140 students to 250 student. It seems okay because some big school can have 250 pre-student. 
#And, we could see that the Q1 is just around 0 and Q3 is around 50. Obviously, this plot has the same situation with the previous one, there are a lot of primary and secondary school who has 0 pre-student only very samll number of pre-student in this datset.
#Therefore, it is possible that some "Language" and "Special" school have a lot of students and are defined as outliers in this plot. So, these outliers are explainable. 
#Outliers are valuable and explainable.
#However, it is still possible that some outliers are errors such as measurement error or sampleing error.
#Thus, we should use the safe way to deal with this data: Replacing the outliers with the nearest neighbours that are not outliers.
#Given that we already have our own function: cap and we could just directly apply cap function into dataset.



Total_tudent_capped <- All_Schools_FTE_enrolments_Feb_2015_Combine_2016_for_pre.students_3$Total_number_of_pre.student_for_school %>% cap()

#Let's check outliers again.
z.scores4 <- Total_tudent_capped %>%  scores(type = "z")
z.scores4 %>% summary()

#we can see that the maximum is 2.8948. And we can find whcih value whose absolute Z-value is greater than 3 by using which().
which( abs(z.scores4) >3 )

#By using length() function, we can see that there are 0 outliers.
length (which( abs(z.scores4) >3 ))

```
