Data Wrangling (Data Preprocessing)

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.3

## -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.3     v stringr 1.4.0
## v tidyr   1.1.1     v forcats 0.5.0

## Warning: package 'ggplot2' was built under R version 4.0.4

## -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

#install.packages("deductive")
#install.packages("validate")
library(deductive)

## Warning: package 'deductive' was built under R version 4.0.5

library(validate)

## Warning: package 'validate' was built under R version 4.0.5

## 
## Attaching package: 'validate'

## The following object is masked from 'package:ggplot2':
## 
##     expr

## The following object is masked from 'package:dplyr':
## 
##     expr

#install.packages("editrules")
library(editrules)

## Warning: package 'editrules' was built under R version 4.0.5

## Loading required package: igraph

## Warning: package 'igraph' was built under R version 4.0.5

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:validate':
## 
##     compare, hierarchy

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## 
## Attaching package: 'editrules'

## The following objects are masked from 'package:igraph':
## 
##     blocks, normalize

## The following object is masked from 'package:purrr':
## 
##     reduce

## The following objects are masked from 'package:tidyr':
## 
##     contains, separate

## The following object is masked from 'package:dplyr':
## 
##     contains

library(ggplot2)
#install.packages("forecast")
library(forecast)

## Warning: package 'forecast' was built under R version 4.0.5

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(outliers)

## Warning: package 'outliers' was built under R version 4.0.3

Executive Summary

The first data set and the second data set are all based on the website: https://discover.data.vic.gov.au/. Both of them are all record about the schools details in 2020. I found these two data set is because they almost have the same school names and number.

To finish this assignment, I need to understand the data set first, so I do some changes about the data set, that I did not use all the variables and observations in this assignment, it is easier for me to analyze the data set, and also, it will be more clear to see these schools important detail.

In the following steps, I will check and delete each data set for missing values and exclude rows. Merge two data sets and convert the variables to the appropriate data types. Then, I will check the data whether is tidy or untidy, and also check the missing and special value for this data set.

At the last, the data set will be scanned by making boxplot and deal with the outlier, also, I need to translate the data to normal distribution, which is necessary for future analysis.

Data

The data is from the victoria governemnt website, the data 1 is “School Locations - 2020” from the url: https://discover.data.vic.gov.au/dataset/school-locations-2020, and the data 2 is also from the same website, called “All School FTE Enrolments - Feb 2020” from the url: https://discover.data.vic.gov.au/dataset/all-schools-fte-enrolments-feb-2020/resource/5d7ba739-f01f-4589-9b00-1aa5859cdd72. data1(School Locations - 2020) has 720 observations and 21 variables, data2(All School FTE Enrolments - Feb 2020) has 721 observations with 54 variables.

The first step is to import two data sets and merge them together, and delete the school number for data1 because of duplicated. ie. There are different st patrick’s school number in data1, and we will keep these numbers.

# This is the R chunk for the Data Section
data1<- read_csv("schoollocations2020.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Entity_Type = col_double(),
##   SCHOOL_NO = col_double(),
##   Address_Postcode = col_double(),
##   Postal_Postcode = col_double(),
##   LGA_ID = col_double(),
##   X = col_double(),
##   Y = col_double()
## )

## See spec(...) for full column specifications.

head(data1)

data2<- read_csv("allschoolsFTEenrolmentsFeb2020.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Education_Sector = col_character(),
##   School_Name = col_character(),
##   School_Type = col_character(),
##   School_Status = col_character(),
##   CENSUS_TYPE = col_logical()
## )
## See spec(...) for full column specifications.

head(data2)

main_data<- left_join(data2, data1, by = "School_Name")# merge two data sets

main_data<-main_data[,-3]# delete the school number for data1 because of duplicated. ie. There are different st patrick's school number in data1, and we will keep these numbers.

head(main_data)

After merge two datasets as “main_data”, There are 73 variables. For easier to manipulate and analysis the data, i will select columns to store in another data frame.

# This is the R chunk for the Data Section
data<-main_data%>%select(School_Name, SCHOOL_NO, School_Type.y, `Male total`,`Female total`, Year)
head(data)

str(data)

## tibble [3,929 x 6] (S3: tbl_df/tbl/data.frame)
##  $ School_Name  : chr [1:3929] "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
##  $ SCHOOL_NO    : num [1:3929] 20 25 26 28 29 104 187 227 386 445 ...
##  $ School_Type.y: chr [1:3929] "Secondary" "Secondary" "Secondary" "Secondary" ...
##  $ Male total   : num [1:3929] 1953 421 477 1340 94 ...
##  $ Female total : num [1:3929] 0 0 0 0 119 119 119 119 119 119 ...
##  $ Year         : num [1:3929] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...

Dataset Description I choose 6 variables with 3929 observation in data from main_data, which is easier to analysis. Here are 6 variables description:

School_Name: the name of schools, and with the type of character.
School_No: the number of schools, this will identify different school with the same name, and the data type is numeric
School_Type.y: this variable includes 4 types of schools, Primary, Secondary, Special and Primary/Secondary, the data type for this variable is character
Male total: the variable is total enrollment students number for male, the data type is numeric
Female total: the variable is total enrollment students number for Female, and the data type is numeric
Year: the year of recorded original data, and this should be always 2020.(Because the data is from 2020)

Understand

In this section, I will first look at the variables type, and then decided what data type of variables I wanna change to numeric, character or factor.

#have a look about the type of variables
str(data)

## tibble [3,929 x 6] (S3: tbl_df/tbl/data.frame)
##  $ School_Name  : chr [1:3929] "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
##  $ SCHOOL_NO    : num [1:3929] 20 25 26 28 29 104 187 227 386 445 ...
##  $ School_Type.y: chr [1:3929] "Secondary" "Secondary" "Secondary" "Secondary" ...
##  $ Male total   : num [1:3929] 1953 421 477 1340 94 ...
##  $ Female total : num [1:3929] 0 0 0 0 119 119 119 119 119 119 ...
##  $ Year         : num [1:3929] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...

From the data type above, I have character and numeric data in the table. However, there is no factor variable, in this case, School——Type.y and SCHOOL_NO is suitable to change to factor variables. Year is only include the year, no month and day values, so is cannot to change to date data type, but can be integer variable.

as.data.frame(data)

#set school_no as factor.
data$SCHOOL_NO <- as.factor(data$SCHOOL_NO)
#set Year as integer data type 
data$Year<-as.integer(data$Year)
#set school_type as factor and character data type
data$School_Type.y <- data$School_Type.y %>% factor(
  levels = c('Pri/Sec', 'Primary', 'Secondary', 'Special'),
  labels = c("Primary or Secondary","Primary", "Secondary", "Special"), ordered = TRUE)
#set school_name character data type (just wanna make sure)
data$School_Name<-as.character(data$School_Name)
str(data)

## tibble [3,929 x 6] (S3: tbl_df/tbl/data.frame)
##  $ School_Name  : chr [1:3929] "Parade College" "Simonds Catholic College" "Christian Brothers' College St Kilda" "St Patrick's College Ballarat" ...
##  $ SCHOOL_NO    : Factor w/ 719 levels "1","5","20","25",..: 3 4 5 6 7 17 39 47 71 79 ...
##  $ School_Type.y: Ord.factor w/ 4 levels "Primary or Secondary"<..: 3 3 3 3 2 2 2 2 2 2 ...
##  $ Male total   : num [1:3929] 1953 421 477 1340 94 ...
##  $ Female total : num [1:3929] 0 0 0 0 119 119 119 119 119 119 ...
##  $ Year         : int [1:3929] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...

From the data type above, this data set has 4 type of data, which are character, factor, numeric and integer.

Tidy & Manipulate Data II

In this section, I will add another column to sum the male total and female total to calculate the total enrollment.

I have swap the order of “Tidy & Manipulate Data II” and “Tidy & Manipulate Data I”, because this is the best way for me to mutate the new column, otherwise, the data set will increased to more than 2000000 observations.

#change the column name which is easier to manipulate data
data<-data %>% 
  rename(
    Male= `Male total`,
    Female= `Female total`
    ) 
#mutate the Total_Enrollment column
data<-data%>%mutate(Total_Enrolment=Male+Female)
data

Tidy & Manipulate Data I

In the data dataset, it includes the Male total and Female total variables, which from data2. This type of variable is untidy. Followed by the tidy data priciple, each variable must have its own column, each observation must have its own row, each value must have its own cell. Obviously, “Male total” and “Female total” are values, not variable name, these two variable name can be classified to “Sex” variable.

#change male and female as values
data_tidy1<- data %>% pivot_longer(
   cols = Male:Female,
   names_to = c("Enrollment Sex"),
   values_to = "Enrollment count by sex"
 )

#distinct data_tidy1 table
data_tidy1<-distinct(data_tidy1)
#remane variable
data_tidy1<-data_tidy1 %>% 
  rename(
    School_Type=School_Type.y
    )
data_tidy1

Scan I

In this section, I will scan all variables for missing values, special values and obvious errors (i.e. inconsistencies).

Firstly, scan missing values and show which column has missing values.

colSums(is.na(data_tidy1))

##             School_Name               SCHOOL_NO             School_Type 
##                       2                       6                       6 
##                    Year         Total_Enrolment          Enrollment Sex 
##                       2                       0                       0 
## Enrollment count by sex 
##                       0

There are missing values in Name, SCHOOL_NO and School_Type, we can find where the missing value.

inCompleteCases <- data_tidy1[!complete.cases(data_tidy1), ]
inCompleteCases

From incompleteCases, we could find that 2 rows have no name, that because when we join together, that might have some empty rows in the table, and there are 6 values missing in School_No and School_Type, this should be caused by joining two data sets at the #1, and data1 have no these schools in data2, in this case, we should be better to remove these rows.

#before remove, there are 7858 rows
nrow(data_tidy1)

## [1] 7858

#before remove
nrow(data_tidy1)

## [1] 7858

#remove the missing values
data_tidy2<-na.omit(data_tidy1)

#after remove
nrow(data_tidy2)

## [1] 7852

After remove, there are 7852 rows, and 6 rows removed, which is the same as we scan above

After dealing with missing values, I will scan special values

#check whether data_tidy2 has finite value, because the data type of school name and enrollment_sex is character, so it shows false when check finite value
data_tidy2 %>%sapply(is.finite)%>%head()

##      School_Name SCHOOL_NO School_Type Year Total_Enrolment Enrollment Sex
## [1,]       FALSE      TRUE        TRUE TRUE            TRUE          FALSE
## [2,]       FALSE      TRUE        TRUE TRUE            TRUE          FALSE
## [3,]       FALSE      TRUE        TRUE TRUE            TRUE          FALSE
## [4,]       FALSE      TRUE        TRUE TRUE            TRUE          FALSE
## [5,]       FALSE      TRUE        TRUE TRUE            TRUE          FALSE
## [6,]       FALSE      TRUE        TRUE TRUE            TRUE          FALSE
##      Enrollment count by sex
## [1,]                    TRUE
## [2,]                    TRUE
## [3,]                    TRUE
## [4,]                    TRUE
## [5,]                    TRUE
## [6,]                    TRUE

#check whether data_tidy2 has infinite value
data_tidy2%>%sapply(is.infinite)%>%head()

##      School_Name SCHOOL_NO School_Type  Year Total_Enrolment Enrollment Sex
## [1,]       FALSE     FALSE       FALSE FALSE           FALSE          FALSE
## [2,]       FALSE     FALSE       FALSE FALSE           FALSE          FALSE
## [3,]       FALSE     FALSE       FALSE FALSE           FALSE          FALSE
## [4,]       FALSE     FALSE       FALSE FALSE           FALSE          FALSE
## [5,]       FALSE     FALSE       FALSE FALSE           FALSE          FALSE
## [6,]       FALSE     FALSE       FALSE FALSE           FALSE          FALSE
##      Enrollment count by sex
## [1,]                   FALSE
## [2,]                   FALSE
## [3,]                   FALSE
## [4,]                   FALSE
## [5,]                   FALSE
## [6,]                   FALSE

Check numerical column in data_tidy2 whether they have infinite or NaN values is.special by function called is.specialorNA

is.specialorNA <- function(data_tidy2){
if (is.numeric(data_tidy2)) (is.infinite(data_tidy2) | is.nan(data_tidy2) | is.na(data_tidy2))
}

data_tidy2%>%head()%>%sapply(is.specialorNA)

## $School_Name
## NULL
## 
## $SCHOOL_NO
## NULL
## 
## $School_Type
## NULL
## 
## $Year
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## 
## $Total_Enrolment
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
## 
## $`Enrollment Sex`
## NULL
## 
## $`Enrollment count by sex`
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

#at the last, we can check if all value has no missing and special values
sapply(data_tidy2, function(data_tidy2) sum( is.na(data_tidy2) ))

##             School_Name               SCHOOL_NO             School_Type 
##                       0                       0                       0 
##                    Year         Total_Enrolment          Enrollment Sex 
##                       0                       0                       0 
## Enrollment count by sex 
##                       0

Check inconsistencies value in data_tidy2, which should not included any inconsistencies data in numeriacal columns

#rename
data_tidy2<-data_tidy2 %>% 
  rename(
    Enrollment_Sex_NO= `Enrollment count by sex`
    )
#set rules
(Rules <- editset(c("Year == 2020", "Total_Enrolment >=0", "Enrollment_Sex_NO >=0")))

## 
## Edit set:
## num1 : Year == 2020
## num2 : 0 <= Total_Enrolment
## num3 : 0 <= Enrollment_Sex_NO

#check whether include inconsistencies data in year, total_enrolment and enrollment_sex_no
summary(violatedEdits(Rules, data_tidy2))

## No violations detected, 0 checks evaluated to NA

## NULL

Scan II

Scan the all numeric data for outliers. In this step, I will separate data to male and female data, which is easy for me to analyze and compare the school type with different sex enrollment detail.

#boxplots of the situation for male and female in different type of school
Male_enrolment<-data_tidy2%>%filter(`Enrollment Sex`=="Male")

Female_enrolment<-data_tidy2%>%filter(`Enrollment Sex`=="Female")

#School type analysis for male
ggplot(Male_enrolment, aes(x=School_Type, y=Enrollment_Sex_NO, color=School_Type))+
  stat_boxplot(geom = "errorbar",width=0.15)+
  geom_boxplot(outlier.colour="red",outlier.size=2)+
  labs(title="The enrollment of different type of school for Male ",x="School Type", y = "Enrollment number")

From the table, “Primary/Secondary” and “Primary” has most outliers

#School type analysis for male
ggplot(Female_enrolment, aes(x=School_Type, y=Enrollment_Sex_NO, color=School_Type)) + 
  stat_boxplot(geom = "errorbar",width=0.15)+
  geom_boxplot(outlier.colour="red", outlier.size=2)+
  labs(title="The enrollment of different type of school for Female ",x="School Type", y = "Enrollment number")

In the Female data set, the situation about the outliers looks as the same as Male data set.

Female and Male data set shows the similar situation, in this case, I can have a look about the whole data set

ggplot(data_tidy2, aes(x=School_Type, y=Total_Enrolment, color=School_Type)) + 
  stat_boxplot(geom = "errorbar",width=0.15)+
  geom_boxplot(outlier.colour="red", outlier.size=2)+
  labs(title="The total enrollment of different type of school",x="School Type", y = "Total Enrollment")

For all the three boxplot, Male and Female data shows the similar distribution as the whole data set, so I just need to find out which school type has more ourliers (Primary or Secondary, and Primary)

outliers1 <- data_tidy2 %>% filter(School_Type == "Primary or Secondary")
z.scores_ps <- outliers1$Total_Enrolment %>%  scores(type = "z")
z.scores_ps %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.1533 -0.8007 -0.2167  0.0000  0.4650  3.5588

outliers2 <- data_tidy2 %>% filter(School_Type == "Primary")
z.scores_p <- outliers2$Total_Enrolment %>% scores(type = "z")
z.scores_p %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.4369 -0.7778 -0.1524  0.0000  0.5870  8.8690

#check how many outliers for Primary or Secondary
length (which(abs(z.scores_ps) >3 ))

## [1] 4

#check hwo many outliers for Primary 
length (which(abs(z.scores_p) >3 ))

## [1] 32

Since there are 32 outliers in data_tidy2 with school_type Primary, I will use capping outliers method to deal with them, because, there are 3929 observation, and only 32 outliers, so it caused by data processing errors or outliers observed, even the number is small, but still need to fit the matrix.

# Define a function to cap the values outside the limits
cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

#use the function in Primary type school data set
outliers2$Total_Enrolment <-  outliers2$Total_Enrolment %>% cap()

#check the outlier
ggplot(outliers2, aes(x=School_Type, y=Total_Enrolment)) + 
  stat_boxplot(geom = "errorbar",width=0.15)+
  geom_boxplot(outlier.colour="red", outlier.size=2)+
  labs(title="The total enrollment of Primary type of school",x="School Type", y = "Total Enrollment")

After capping method, the box plot shows that there is no outliers for Primary data set.

#use the function in Primary/Secondary type school data set
outliers1$Total_Enrolment <-  outliers1$Total_Enrolment %>% cap()

#check the outlier
ggplot(outliers1, aes(x=School_Type, y=Total_Enrolment)) + 
  stat_boxplot(geom = "errorbar",width=0.15)+
  geom_boxplot(outlier.colour="red", outlier.size=2)+
  labs(title="The total enrollment of Primary type of school",x="School Type", y = "Total Enrollment")

After capping method, the box plot shows that there is no outliers for Primary/Secondary data set.

Transform

In this section, I will use Male data to analysis the “total enrolment” students number. From the histogram, there is a right-skewed distribution for enrollment number.

#  the original one
hist(Male_enrolment$Total_Enrolment)

#ln transformation
ln_Enrolment <- log(Male_enrolment$Total_Enrolment)
hist(ln_Enrolment)

From the graphs above, I have tried ln and log10 transformation method, based on the graph, the log10 method performs the best.

NOTE: Note that sometimes the order of the tasks may be different than the order given here. For example, you may need to tidy the data sets first to be able to create the common key to merge. Therefore, for such cases you may have a different ordering of the sections.

Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Make sure your code is visible (within the margin of the page). Do not use View() to show your data, instead give headers (using head() )