MATH2349 Semester 2, 2019

Required packages

library(readr)
library(dplyr)
library(knitr)
library(stringr)
library(tidyr)
library(readxl)
library(MVN)

Executive Summary

In our analysis, we have three datasets at hand regarding schools in the state of Victoria which represent the number of enrolments in February 2013, locations in 2019 and building floor areas of the schools in 2013 respectively. Initially we had a lot of variables of each dataset, so we filtered out the ones required for our analysis and then joined the three datasets on the basis of common attributes. After having a final dataset to work on, we factorised the variables that were necessary and divided those variables on the basis of categories. After that we tidied the dataset, uniting the address of schools which was split across four columns under one variable. We then formed a new variable as Density so that we could have an idea of the population density (people per square metre) of each school.

Next we checked for missing values per each column and removed them. After that, we checked for possible outliers with mvn function and excluded those outliers. The last step we have taken is to transform the Density attribute in our dataset. We have done this transformation via the log transformation and z-score technique and we have used it to eliminate the skewness of Density.

Data

We have three datasets at hand which we have gathered from the Victorian Government website. The first dataset tells us about the location of schools in the state of Victoria and the second dataset tells us about the total number of enrolments across schools in Victoria in the month of February, 2013. The third dataset tells us about the building floor area of each of the state (Government) schools. We have created a new dataframe where we are joining all three datasets on the basis of common attribute (School_ID/School_No). We had a lot of variables in both datasets which we did not need, so we have filtered out only those that were necessary for our analysis.

In the locations dataset, we also have School_No and Entity_type to uniquely identify each school. The rest of the variables make up the complete Address of the schools. In the enrolments dataset, we have the names and types of schools along with the Education sector they fall in. We had the enrolments distributed according to classes which we have removed, since we only wanted the Grand Total of enrolments across the schools. From the Area dataset, we use group_by and summarise funtions to get exact total area of each school, as one school may have multiple campuses. Only included the Floor Area variable which shows us the area of schools per square metre was included to our dataframe.

Because we only have government school area dataset, we join it with the other two dataset with Entity_type = 1 (means government school) to form our dataframe. Finally, we use inner_join function to ensure that our dataframe can retrieve the enrolment number, address and area of one school.

Source:
School Locations

School Enrolments

School Building Floor Area

school_location <- read_csv("dv279-schoollocations2019.csv")
head(school_location)

Variables	Description	Data Type
Education_sector	Education Sector	Char (Catholic, Government, Independent)
Entity_Type	Entity Type	Double (1, 2) 1 for government, 2 for non-government
School_No	School Number	Double
School_Name	School Name	Char
School_Type	School Type	Char (Language, Primary, Secondary, Pri/Sec, Special)
School_Status	School Status	Char (C, O) C for close, O for open
Address_Line_1	Address Line 1	Char
Address_Line_2	Address Line 2	Char
Address_Town	Address Town	Char
Address_State	Address_State	Char (only one value: VIC)
Address_Postcode	Address Postcode	Double
Postal_Address_Line_1	Postal Address Line_1	Char
Postal_Address_Line_2	Postal Address Line_2	Char
Postal_Town	Postal Town	Char
Postal_State	Postal	Char (only one value: VIC)
Postal_Postcode	Postal Postcode	Double
Full_Phone_No	Full Phone Number	Char
LGA_ID	Local Government Areas ID	Double
LGA_Name	Local Government Areas Name	Char
X	Longitude	Double
Y	Latitude	Double

school_area <- read_excel("DV3-StateSchoolsBuildingFloorArea2013.xlsx", skip = 5, n_max = 1916)
head(school_area)

Variables	Description	Data Type
School ID	School ID	Double
School_Name	School Name	Char
Campus ID	Campus ID	Double
School/Campus Type	School/Campus Type	Char (LANGUAGE, PRIMARY, SECONDARY, TECHNICAL, OTHER, SPECIAL DEVELOPMENTAL SCHOOL, RECREATION CAMP, PRIMARY/SECONDARY, )
Total Floor Area (sqm)2	Total Floor Area (\(m^2\))	Double

school_enroll <- read_csv("37-dv2013fte-enrolments.csv")
head(school_enroll)

Variables	Description	Data Type
Education_sector	Education Sector	Char (Catholic, Government, Independent)
Entity_Type	Entity Type	Double (1, 2) 1 for government, 2 for non-government
School_No	School Number	Double
School_Name	School Name	Char
School_Type	School Type	Char (Language, Primary, Secondary, Pri/Sec, Special)
School_Status	School Status	Char (C, O) C for close, O for open
Prep Males Total, Prep Females Total, Prep Total	Number of Male/Female/All Students Enrolling in Preparatory Level	Double
Primary Ungraded Males Total, Primary Ungraded Females Total, Primary Ungraded Total	Number of Male/Female/All Students Enrolling in Primary Ungraded Level	Double
Postprimary Ungraded Males, Postprimary Ungraded Females, Postprimary Ungraded. otal	Number of Male/Female/All Students Enrolling in Postprimary Ungraded Level	Double
Secondary Ungraded Males, Secondary Ungraded Females, Secondary Ungraded. otal	Number of Male/Female/All Students Enrolling in Ungraded Secondary Level	Double
Secondary Total	Number of All Students Enrolling in Secondary Level	Double
Grand Total	Number of All Students Enrolling in the School	Double
Year 1 Males Total ⋮ Year 12 Total	Number of Male Students in Year 1 ⋮ Number of All Students in Year 12	Double ⋮ Double

school_area <- school_area %>% group_by(`School ID`) %>% summarise(Total_area = sum(`Total Floor Area (sqm)2`))
school_enroll <- school_enroll %>% filter(Entity_Type == 1) %>% select(c(3, 54))
school_location <- school_location %>% filter(Entity_Type == 1) %>% select(c(3:5, 7:11))
df <- school_area %>% inner_join(school_location, by = c("School ID" = "School_No")) %>%
      inner_join(school_enroll, by = c("School ID" = "School_No"))
head(df)

Variables	Description	Data Type
School ID	School ID	Double
School_Name	School Name	Char
Total_area	Total Floor Area (\(m^2\))	Double
School_Type	School Type	Char (Language, Primary, Secondary, Pri/Sec, Special)
Address_Line_1	Address Line 1	Char
Address_Line_2	Address Line 2	Char
Address_Town	Address Town	Char
Address_State	Address_State	Char (only one value: VIC)
Address_Postcode	Address Postcode	Double
Address_Postcode	Address Postcode	Double
Grand Total	Number of All Students Enrolling in the School	Double

Understand

First we have calculated the dimensions of the new dataframe after joining both datasets, we have 1494 observations across 10 variables. Next we have the structure of the dataframe which tells us the class of all our variables. Based on this, we factorized School_Type accordingly to show its categories of the variables. We have renamed variables to proper names. In the end, we used the structure command again to have a look at the changes we’ve made and the new classes of the variables.

dim(df)

## [1] 1493   10

str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1493 obs. of  10 variables:
##  $ School ID       : num  1 3 4 8 12 26 28 33 37 40 ...
##  $ Total_area      : num  1379 2551 1565 958 2888 ...
##  $ School_Name     : chr  "Alberton Primary School" "Allansford and District Primary School" "Avoca Primary School" "Avenel Primary School" ...
##  $ School_Type     : chr  "Primary" "Primary" "Primary" "Primary" ...
##  $ Address_Line_1  : chr  "21 Thomson Street" "Frank Street" "118 Barnett Street" "40 Anderson Street" ...
##  $ Address_Line_2  : chr  NA NA NA NA ...
##  $ Address_Town    : chr  "Alberton" "Allansford" "Avoca" "Avenel" ...
##  $ Address_State   : chr  "VIC" "VIC" "VIC" "VIC" ...
##  $ Address_Postcode: num  3971 3277 3467 3664 3113 ...
##  $ Grand Total     : num  66 157 95 107 298 ...

df <- df %>% mutate(School_Type = factor(School_Type, levels = c("Language", "Primary", "Secondary", "Pri/Sec", "Special")))
levels(df$School_Type)

## [1] "Language"  "Primary"   "Secondary" "Pri/Sec"   "Special"

names(df)[c(1, 10)] <- c("School_ID", "Total_enrol")
str(df)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1493 obs. of  10 variables:
##  $ School_ID       : num  1 3 4 8 12 26 28 33 37 40 ...
##  $ Total_area      : num  1379 2551 1565 958 2888 ...
##  $ School_Name     : chr  "Alberton Primary School" "Allansford and District Primary School" "Avoca Primary School" "Avenel Primary School" ...
##  $ School_Type     : Factor w/ 5 levels "Language","Primary",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Address_Line_1  : chr  "21 Thomson Street" "Frank Street" "118 Barnett Street" "40 Anderson Street" ...
##  $ Address_Line_2  : chr  NA NA NA NA ...
##  $ Address_Town    : chr  "Alberton" "Allansford" "Avoca" "Avenel" ...
##  $ Address_State   : chr  "VIC" "VIC" "VIC" "VIC" ...
##  $ Address_Postcode: num  3971 3277 3467 3664 3113 ...
##  $ Total_enrol     : num  66 157 95 107 298 ...

Tidy & Manipulate Data I

We had the address for schools distributed across five different variables, so in this step we have united all the those variables into a single column called “Address”. After that, we tidied our data to not show the string “NA” in the Address column.

df_tidy <- df %>% unite(Address, Address_Line_1, Address_Line_2, Address_Town,Address_Postcode, Address_State, sep = ",") %>% 
             mutate(Address = str_replace_all(Address, 'NA,?', ''))
head(df_tidy)

Tidy & Manipulate Data II

In this step, we have created a new variable called Density using the mutate function. Density represents the total number of enrolments of a school divided by the school’s floor area.

df_new <- df_tidy %>% mutate(Density = Total_enrol/Total_area)
head(df_new)

Scan I

We used colSums and is.na to have a look at the number of missing values in each attribute. After looking at which columns had missing values, we removed the missing values for the column Total Area because there is only one in our dataframe (<5%). In the end we used the colSums function again to show that we have dealt with the missing values and there are none of them in our dataframe. The dimension of the new dataframe (df_rmNA) is also showed.

colSums(is.na(df_new))

##   School_ID  Total_area School_Name School_Type     Address Total_enrol 
##           0           1           0           0           0           0 
##     Density 
##           1

df_rmNA <- na.omit(df_new)
colSums(is.na(df_rmNA))

##   School_ID  Total_area School_Name School_Type     Address Total_enrol 
##           0           0           0           0           0           0 
##     Density 
##           0

dim(df_rmNA)

## [1] 1492    7

Scan II

We have used mvn function to scan our data for possible outliers as there is a relation (Density) between Total_area and Total_enrol. At the beginning, we found the relation is similar to logarithm, so we used mvn with log function to find the possible outliers. After figuring out outliers in our dataset, we deleted the outliers because there only 90 outliers in the dataframe (90/1493 ~= 5%).

area_enrol <- df_rmNA %>% select(Total_area, Total_enrol)
results1 <- mvn(data = area_enrol, multivariateOutlierMethod = "quan", showOutliers = TRUE)

results2 <- mvn(data = log(area_enrol), multivariateOutlierMethod = "quan", showOutliers = TRUE, showNewData = TRUE)

df2 <- exp(results2$newData)
df2 <- df2 %>% mutate(Density = Total_enrol/Total_area)

Transform

The variable we have used for transformation is Density. The transformation we have used in this step is the log transformation and we have used it to eliminate the skewness of Density. In the end we have normalised the histogram for better representation.

hist(df2$Density)

density_log <- log(df2$Density)
hist(density_log)

z_density_log <- scale(density_log, center = TRUE, scale = TRUE)
hist(z_density_log)