Required packages

library(readr)
library(dplyr)
library(knitr)
library(stringr)
library(tidyr)
library(readxl)
library(MVN)

Executive Summary

In our analysis, we have three datasets at hand regarding schools in the state of Victoria which represent the number of enrolments in February 2013, locations in 2019 and building floor areas of the schools in 2013 respectively. Initially we had a lot of variables of each dataset, so we filtered out the ones required for our analysis and then joined the three datasets on the basis of common attributes. After having a final dataset to work on, we factorised the variables that were necessary and divided those variables on the basis of categories. After that we tidied the dataset, uniting the address of schools which was split across four columns under one variable. We then formed a new variable as Density so that we could have an idea of the population density (people per square metre) of each school.

Next we checked for missing values per each column and removed them. After that, we checked for possible outliers with mvn function and excluded those outliers. The last step we have taken is to transform the Density attribute in our dataset. We have done this transformation via the log transformation and z-score technique and we have used it to eliminate the skewness of Density.

Data

We have three datasets at hand which we have gathered from the Victorian Government website. The first dataset tells us about the location of schools in the state of Victoria and the second dataset tells us about the total number of enrolments across schools in Victoria in the month of February, 2013. The third dataset tells us about the building floor area of each of the state (Government) schools. We have created a new dataframe where we are joining all three datasets on the basis of common attribute (School_ID/School_No). We had a lot of variables in both datasets which we did not need, so we have filtered out only those that were necessary for our analysis.

In the locations dataset, we also have School_No and Entity_type to uniquely identify each school. The rest of the variables make up the complete Address of the schools. In the enrolments dataset, we have the names and types of schools along with the Education sector they fall in. We had the enrolments distributed according to classes which we have removed, since we only wanted the Grand Total of enrolments across the schools. From the Area dataset, we use group_by and summarise funtions to get exact total area of each school, as one school may have multiple campuses. Only included the Floor Area variable which shows us the area of schools per square metre was included to our dataframe.

Because we only have government school area dataset, we join it with the other two dataset with Entity_type = 1 (means government school) to form our dataframe. Finally, we use inner_join function to ensure that our dataframe can retrieve the enrolment number, address and area of one school.

Source:
School Locations

School Enrolments

School Building Floor Area

school_location <- read_csv("dv279-schoollocations2019.csv")
head(school_location)
Variables Description Data Type
Education_sector Education Sector Char (Catholic, Government, Independent)
Entity_Type Entity Type Double (1, 2) 1 for government, 2 for non-government
School_No School Number Double
School_Name School Name Char
School_Type School Type Char (Language, Primary, Secondary, Pri/Sec, Special)
School_Status School Status Char (C, O) C for close, O for open
Address_Line_1 Address Line 1 Char
Address_Line_2 Address Line 2 Char
Address_Town Address Town Char
Address_State Address_State Char (only one value: VIC)
Address_Postcode Address Postcode Double
Postal_Address_Line_1 Postal Address Line_1 Char
Postal_Address_Line_2 Postal Address Line_2 Char
Postal_Town Postal Town Char
Postal_State Postal Char (only one value: VIC)
Postal_Postcode Postal Postcode Double
Full_Phone_No Full Phone Number Char
LGA_ID Local Government Areas ID Double
LGA_Name Local Government Areas Name Char
X Longitude Double
Y Latitude Double
school_area <- read_excel("DV3-StateSchoolsBuildingFloorArea2013.xlsx", skip = 5, n_max = 1916)
head(school_area)
Variables Description Data Type
School ID School ID Double
School_Name School Name Char
Campus ID Campus ID Double
School/Campus Type School/Campus Type Char (LANGUAGE, PRIMARY, SECONDARY, TECHNICAL, OTHER, SPECIAL DEVELOPMENTAL SCHOOL, RECREATION CAMP, PRIMARY/SECONDARY, )
Total Floor Area (sqm)2 Total Floor Area (\(m^2\)) Double
school_enroll <- read_csv("37-dv2013fte-enrolments.csv")
head(school_enroll)
Variables Description Data Type
Education_sector Education Sector Char (Catholic, Government, Independent)
Entity_Type Entity Type Double (1, 2) 1 for government, 2 for non-government
School_No School Number Double
School_Name School Name Char
School_Type School Type Char (Language, Primary, Secondary, Pri/Sec, Special)
School_Status School Status Char (C, O) C for close, O for open
Prep Males Total, Prep Females Total, Prep Total Number of Male/Female/All Students Enrolling in Preparatory Level Double
Primary Ungraded Males Total, Primary Ungraded Females Total, Primary Ungraded Total Number of Male/Female/All Students Enrolling in Primary Ungraded Level Double
Postprimary Ungraded Males, Postprimary Ungraded Females, Postprimary Ungraded. otal Number of Male/Female/All Students Enrolling in Postprimary Ungraded Level Double
Secondary Ungraded Males, Secondary Ungraded Females, Secondary Ungraded. otal Number of Male/Female/All Students Enrolling in Ungraded Secondary Level Double
Secondary Total Number of All Students Enrolling in Secondary Level Double
Grand Total Number of All Students Enrolling in the School Double
Year 1 Males Total

Year 12 Total
Number of Male Students in Year 1

Number of All Students in Year 12
Double

Double
school_area <- school_area %>% group_by(`School ID`) %>% summarise(Total_area = sum(`Total Floor Area (sqm)2`))
school_enroll <- school_enroll %>% filter(Entity_Type == 1) %>% select(c(3, 54))
school_location <- school_location %>% filter(Entity_Type == 1) %>% select(c(3:5, 7:11))
df <- school_area %>% inner_join(school_location, by = c("School ID" = "School_No")) %>%
      inner_join(school_enroll, by = c("School ID" = "School_No"))
head(df)
Variables Description Data Type
School ID School ID Double
School_Name School Name Char
Total_area Total Floor Area (\(m^2\)) Double
School_Type School Type Char (Language, Primary, Secondary, Pri/Sec, Special)
Address_Line_1 Address Line 1 Char
Address_Line_2 Address Line 2 Char
Address_Town Address Town Char
Address_State Address_State Char (only one value: VIC)
Address_Postcode Address Postcode Double
Address_Postcode Address Postcode Double
Grand Total Number of All Students Enrolling in the School Double

Understand

First we have calculated the dimensions of the new dataframe after joining both datasets, we have 1494 observations across 10 variables. Next we have the structure of the dataframe which tells us the class of all our variables. Based on this, we factorized School_Type accordingly to show its categories of the variables. We have renamed variables to proper names. In the end, we used the structure command again to have a look at the changes we’ve made and the new classes of the variables.

dim(df)
## [1] 1493   10
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1493 obs. of  10 variables:
##  $ School ID       : num  1 3 4 8 12 26 28 33 37 40 ...
##  $ Total_area      : num  1379 2551 1565 958 2888 ...
##  $ School_Name     : chr  "Alberton Primary School" "Allansford and District Primary School" "Avoca Primary School" "Avenel Primary School" ...
##  $ School_Type     : chr  "Primary" "Primary" "Primary" "Primary" ...
##  $ Address_Line_1  : chr  "21 Thomson Street" "Frank Street" "118 Barnett Street" "40 Anderson Street" ...
##  $ Address_Line_2  : chr  NA NA NA NA ...
##  $ Address_Town    : chr  "Alberton" "Allansford" "Avoca" "Avenel" ...
##  $ Address_State   : chr  "VIC" "VIC" "VIC" "VIC" ...
##  $ Address_Postcode: num  3971 3277 3467 3664 3113 ...
##  $ Grand Total     : num  66 157 95 107 298 ...
df <- df %>% mutate(School_Type = factor(School_Type, levels = c("Language", "Primary", "Secondary", "Pri/Sec", "Special")))
levels(df$School_Type)
## [1] "Language"  "Primary"   "Secondary" "Pri/Sec"   "Special"
names(df)[c(1, 10)] <- c("School_ID", "Total_enrol")
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1493 obs. of  10 variables:
##  $ School_ID       : num  1 3 4 8 12 26 28 33 37 40 ...
##  $ Total_area      : num  1379 2551 1565 958 2888 ...
##  $ School_Name     : chr  "Alberton Primary School" "Allansford and District Primary School" "Avoca Primary School" "Avenel Primary School" ...
##  $ School_Type     : Factor w/ 5 levels "Language","Primary",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Address_Line_1  : chr  "21 Thomson Street" "Frank Street" "118 Barnett Street" "40 Anderson Street" ...
##  $ Address_Line_2  : chr  NA NA NA NA ...
##  $ Address_Town    : chr  "Alberton" "Allansford" "Avoca" "Avenel" ...
##  $ Address_State   : chr  "VIC" "VIC" "VIC" "VIC" ...
##  $ Address_Postcode: num  3971 3277 3467 3664 3113 ...
##  $ Total_enrol     : num  66 157 95 107 298 ...

Tidy & Manipulate Data I

We had the address for schools distributed across five different variables, so in this step we have united all the those variables into a single column called “Address”. After that, we tidied our data to not show the string “NA” in the Address column.

df_tidy <- df %>% unite(Address, Address_Line_1, Address_Line_2, Address_Town,Address_Postcode, Address_State, sep = ",") %>% 
             mutate(Address = str_replace_all(Address, 'NA,?', ''))
head(df_tidy)

Tidy & Manipulate Data II

In this step, we have created a new variable called Density using the mutate function. Density represents the total number of enrolments of a school divided by the school’s floor area.

df_new <- df_tidy %>% mutate(Density = Total_enrol/Total_area)
head(df_new)

Scan I

We used colSums and is.na to have a look at the number of missing values in each attribute. After looking at which columns had missing values, we removed the missing values for the column Total Area because there is only one in our dataframe (<5%). In the end we used the colSums function again to show that we have dealt with the missing values and there are none of them in our dataframe. The dimension of the new dataframe (df_rmNA) is also showed.

colSums(is.na(df_new))
##   School_ID  Total_area School_Name School_Type     Address Total_enrol 
##           0           1           0           0           0           0 
##     Density 
##           1
df_rmNA <- na.omit(df_new)
colSums(is.na(df_rmNA))
##   School_ID  Total_area School_Name School_Type     Address Total_enrol 
##           0           0           0           0           0           0 
##     Density 
##           0
dim(df_rmNA)
## [1] 1492    7

Scan II

We have used mvn function to scan our data for possible outliers as there is a relation (Density) between Total_area and Total_enrol. At the beginning, we found the relation is similar to logarithm, so we used mvn with log function to find the possible outliers. After figuring out outliers in our dataset, we deleted the outliers because there only 90 outliers in the dataframe (90/1493 ~= 5%).

area_enrol <- df_rmNA %>% select(Total_area, Total_enrol)
results1 <- mvn(data = area_enrol, multivariateOutlierMethod = "quan", showOutliers = TRUE)

results2 <- mvn(data = log(area_enrol), multivariateOutlierMethod = "quan", showOutliers = TRUE, showNewData = TRUE)

df2 <- exp(results2$newData)
df2 <- df2 %>% mutate(Density = Total_enrol/Total_area)

Transform

The variable we have used for transformation is Density. The transformation we have used in this step is the log transformation and we have used it to eliminate the skewness of Density. In the end we have normalised the histogram for better representation.

hist(df2$Density)

density_log <- log(df2$Density)
hist(density_log)

z_density_log <- scale(density_log, center = TRUE, scale = TRUE)
hist(z_density_log)