library(readr)
library(dplyr)
library(knitr)
library(stringr)
library(tidyr)
library(readxl)
library(MVN)
In our analysis, we have three datasets at hand regarding schools in the state of Victoria which represent the number of enrolments in February 2013, locations in 2019 and building floor areas of the schools in 2013 respectively. Initially we had a lot of variables of each dataset, so we filtered out the ones required for our analysis and then joined the three datasets on the basis of common attributes. After having a final dataset to work on, we factorised the variables that were necessary and divided those variables on the basis of categories. After that we tidied the dataset, uniting the address of schools which was split across four columns under one variable. We then formed a new variable as Density so that we could have an idea of the population density (people per square metre) of each school.
Next we checked for missing values per each column and removed them. After that, we checked for possible outliers with mvn function and excluded those outliers. The last step we have taken is to transform the Density attribute in our dataset. We have done this transformation via the log transformation and z-score technique and we have used it to eliminate the skewness of Density.
We have three datasets at hand which we have gathered from the Victorian Government website. The first dataset tells us about the location of schools in the state of Victoria and the second dataset tells us about the total number of enrolments across schools in Victoria in the month of February, 2013. The third dataset tells us about the building floor area of each of the state (Government) schools. We have created a new dataframe where we are joining all three datasets on the basis of common attribute (School_ID/School_No). We had a lot of variables in both datasets which we did not need, so we have filtered out only those that were necessary for our analysis.
In the locations dataset, we also have School_No and Entity_type to uniquely identify each school. The rest of the variables make up the complete Address of the schools. In the enrolments dataset, we have the names and types of schools along with the Education sector they fall in. We had the enrolments distributed according to classes which we have removed, since we only wanted the Grand Total of enrolments across the schools. From the Area dataset, we use group_by and summarise funtions to get exact total area of each school, as one school may have multiple campuses. Only included the Floor Area variable which shows us the area of schools per square metre was included to our dataframe.
Because we only have government school area dataset, we join it with the other two dataset with Entity_type = 1 (means government school) to form our dataframe. Finally, we use inner_join function to ensure that our dataframe can retrieve the enrolment number, address and area of one school.
Source:
School Locations
school_location <- read_csv("dv279-schoollocations2019.csv")
head(school_location)
| Variables | Description | Data Type |
|---|---|---|
| Education_sector | Education Sector | Char (Catholic, Government, Independent) |
| Entity_Type | Entity Type | Double (1, 2) 1 for government, 2 for non-government |
| School_No | School Number | Double |
| School_Name | School Name | Char |
| School_Type | School Type | Char (Language, Primary, Secondary, Pri/Sec, Special) |
| School_Status | School Status | Char (C, O) C for close, O for open |
| Address_Line_1 | Address Line 1 | Char |
| Address_Line_2 | Address Line 2 | Char |
| Address_Town | Address Town | Char |
| Address_State | Address_State | Char (only one value: VIC) |
| Address_Postcode | Address Postcode | Double |
| Postal_Address_Line_1 | Postal Address Line_1 | Char |
| Postal_Address_Line_2 | Postal Address Line_2 | Char |
| Postal_Town | Postal Town | Char |
| Postal_State | Postal | Char (only one value: VIC) |
| Postal_Postcode | Postal Postcode | Double |
| Full_Phone_No | Full Phone Number | Char |
| LGA_ID | Local Government Areas ID | Double |
| LGA_Name | Local Government Areas Name | Char |
| X | Longitude | Double |
| Y | Latitude | Double |
school_area <- read_excel("DV3-StateSchoolsBuildingFloorArea2013.xlsx", skip = 5, n_max = 1916)
head(school_area)
| Variables | Description | Data Type |
|---|---|---|
| School ID | School ID | Double |
| School_Name | School Name | Char |
| Campus ID | Campus ID | Double |
| School/Campus Type | School/Campus Type | Char (LANGUAGE, PRIMARY, SECONDARY, TECHNICAL, OTHER, SPECIAL DEVELOPMENTAL SCHOOL, RECREATION CAMP, PRIMARY/SECONDARY, ) |
| Total Floor Area (sqm)2 | Total Floor Area (\(m^2\)) | Double |
school_enroll <- read_csv("37-dv2013fte-enrolments.csv")
head(school_enroll)
| Variables | Description | Data Type |
|---|---|---|
| Education_sector | Education Sector | Char (Catholic, Government, Independent) |
| Entity_Type | Entity Type | Double (1, 2) 1 for government, 2 for non-government |
| School_No | School Number | Double |
| School_Name | School Name | Char |
| School_Type | School Type | Char (Language, Primary, Secondary, Pri/Sec, Special) |
| School_Status | School Status | Char (C, O) C for close, O for open |
| Prep Males Total, Prep Females Total, Prep Total | Number of Male/Female/All Students Enrolling in Preparatory Level | Double |
| Primary Ungraded Males Total, Primary Ungraded Females Total, Primary Ungraded Total | Number of Male/Female/All Students Enrolling in Primary Ungraded Level | Double |
| Postprimary Ungraded Males, Postprimary Ungraded Females, Postprimary Ungraded. otal | Number of Male/Female/All Students Enrolling in Postprimary Ungraded Level | Double |
| Secondary Ungraded Males, Secondary Ungraded Females, Secondary Ungraded. otal | Number of Male/Female/All Students Enrolling in Ungraded Secondary Level | Double |
| Secondary Total | Number of All Students Enrolling in Secondary Level | Double |
| Grand Total | Number of All Students Enrolling in the School | Double |
| Year 1 Males Total ⋮ Year 12 Total |
Number of Male Students in Year 1 ⋮ Number of All Students in Year 12 |
Double ⋮ Double |
school_area <- school_area %>% group_by(`School ID`) %>% summarise(Total_area = sum(`Total Floor Area (sqm)2`))
school_enroll <- school_enroll %>% filter(Entity_Type == 1) %>% select(c(3, 54))
school_location <- school_location %>% filter(Entity_Type == 1) %>% select(c(3:5, 7:11))
df <- school_area %>% inner_join(school_location, by = c("School ID" = "School_No")) %>%
inner_join(school_enroll, by = c("School ID" = "School_No"))
head(df)
| Variables | Description | Data Type |
|---|---|---|
| School ID | School ID | Double |
| School_Name | School Name | Char |
| Total_area | Total Floor Area (\(m^2\)) | Double |
| School_Type | School Type | Char (Language, Primary, Secondary, Pri/Sec, Special) |
| Address_Line_1 | Address Line 1 | Char |
| Address_Line_2 | Address Line 2 | Char |
| Address_Town | Address Town | Char |
| Address_State | Address_State | Char (only one value: VIC) |
| Address_Postcode | Address Postcode | Double |
| Address_Postcode | Address Postcode | Double |
| Grand Total | Number of All Students Enrolling in the School | Double |
First we have calculated the dimensions of the new dataframe after joining both datasets, we have 1494 observations across 10 variables. Next we have the structure of the dataframe which tells us the class of all our variables. Based on this, we factorized School_Type accordingly to show its categories of the variables. We have renamed variables to proper names. In the end, we used the structure command again to have a look at the changes we’ve made and the new classes of the variables.
dim(df)
## [1] 1493 10
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1493 obs. of 10 variables:
## $ School ID : num 1 3 4 8 12 26 28 33 37 40 ...
## $ Total_area : num 1379 2551 1565 958 2888 ...
## $ School_Name : chr "Alberton Primary School" "Allansford and District Primary School" "Avoca Primary School" "Avenel Primary School" ...
## $ School_Type : chr "Primary" "Primary" "Primary" "Primary" ...
## $ Address_Line_1 : chr "21 Thomson Street" "Frank Street" "118 Barnett Street" "40 Anderson Street" ...
## $ Address_Line_2 : chr NA NA NA NA ...
## $ Address_Town : chr "Alberton" "Allansford" "Avoca" "Avenel" ...
## $ Address_State : chr "VIC" "VIC" "VIC" "VIC" ...
## $ Address_Postcode: num 3971 3277 3467 3664 3113 ...
## $ Grand Total : num 66 157 95 107 298 ...
df <- df %>% mutate(School_Type = factor(School_Type, levels = c("Language", "Primary", "Secondary", "Pri/Sec", "Special")))
levels(df$School_Type)
## [1] "Language" "Primary" "Secondary" "Pri/Sec" "Special"
names(df)[c(1, 10)] <- c("School_ID", "Total_enrol")
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1493 obs. of 10 variables:
## $ School_ID : num 1 3 4 8 12 26 28 33 37 40 ...
## $ Total_area : num 1379 2551 1565 958 2888 ...
## $ School_Name : chr "Alberton Primary School" "Allansford and District Primary School" "Avoca Primary School" "Avenel Primary School" ...
## $ School_Type : Factor w/ 5 levels "Language","Primary",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Address_Line_1 : chr "21 Thomson Street" "Frank Street" "118 Barnett Street" "40 Anderson Street" ...
## $ Address_Line_2 : chr NA NA NA NA ...
## $ Address_Town : chr "Alberton" "Allansford" "Avoca" "Avenel" ...
## $ Address_State : chr "VIC" "VIC" "VIC" "VIC" ...
## $ Address_Postcode: num 3971 3277 3467 3664 3113 ...
## $ Total_enrol : num 66 157 95 107 298 ...
We had the address for schools distributed across five different variables, so in this step we have united all the those variables into a single column called “Address”. After that, we tidied our data to not show the string “NA” in the Address column.
df_tidy <- df %>% unite(Address, Address_Line_1, Address_Line_2, Address_Town,Address_Postcode, Address_State, sep = ",") %>%
mutate(Address = str_replace_all(Address, 'NA,?', ''))
head(df_tidy)
In this step, we have created a new variable called Density using the mutate function. Density represents the total number of enrolments of a school divided by the school’s floor area.
df_new <- df_tidy %>% mutate(Density = Total_enrol/Total_area)
head(df_new)
We used colSums and is.na to have a look at the number of missing values in each attribute. After looking at which columns had missing values, we removed the missing values for the column Total Area because there is only one in our dataframe (<5%). In the end we used the colSums function again to show that we have dealt with the missing values and there are none of them in our dataframe. The dimension of the new dataframe (df_rmNA) is also showed.
colSums(is.na(df_new))
## School_ID Total_area School_Name School_Type Address Total_enrol
## 0 1 0 0 0 0
## Density
## 1
df_rmNA <- na.omit(df_new)
colSums(is.na(df_rmNA))
## School_ID Total_area School_Name School_Type Address Total_enrol
## 0 0 0 0 0 0
## Density
## 0
dim(df_rmNA)
## [1] 1492 7
We have used mvn function to scan our data for possible outliers as there is a relation (Density) between Total_area and Total_enrol. At the beginning, we found the relation is similar to logarithm, so we used mvn with log function to find the possible outliers. After figuring out outliers in our dataset, we deleted the outliers because there only 90 outliers in the dataframe (90/1493 ~= 5%).
area_enrol <- df_rmNA %>% select(Total_area, Total_enrol)
results1 <- mvn(data = area_enrol, multivariateOutlierMethod = "quan", showOutliers = TRUE)
results2 <- mvn(data = log(area_enrol), multivariateOutlierMethod = "quan", showOutliers = TRUE, showNewData = TRUE)
df2 <- exp(results2$newData)
df2 <- df2 %>% mutate(Density = Total_enrol/Total_area)
The variable we have used for transformation is Density. The transformation we have used in this step is the log transformation and we have used it to eliminate the skewness of Density. In the end we have normalised the histogram for better representation.
hist(df2$Density)
density_log <- log(df2$Density)
hist(density_log)
z_density_log <- scale(density_log, center = TRUE, scale = TRUE)
hist(z_density_log)