library(readr)
library(readxl)
library(tidyr)
library(dplyr)
library(stringr)
library(forecast)
library(magrittr)
Please note that while magrittr was initially not listed in the assignment specification, it was confirmed by Dr Taheri on the discussion board that it was acceptable to explicitly load the magrittr package to enable the usage of the %<>% pipe.
The data pre-processing in this assignment intends to bring together measures of housing affordability for Victorian locations with measures of the socio-economic advantage of those areas.
Two primary data sets are used, the Index of Relative Socio-economic Advantage and Disadvantage (IRSAD) published by the ABS, and the Victorian Government’s annual data on median unit prices for the past decade. To enable the join between these data sets, a third source is used that contains a list of all postcodes and the suburb they are associated with (from the GeoNames geographical database).
This data is loaded and inspected, and the unit price data is tidied using the dplyr package. Data types are examined and amended where appropriate, and the sources are subsetted to only the relevant information. Following string manipulations to ensure consistency between the different data sets the data is successfully joined. The data is scanned for missing and special values, and none are found. Univariate outliers are identified using the boxplot function, and an example is provided of how these outliers could be excluded if it proved advantageous for the type of analysis that was intended following the preparation of the data set.
New variables are created based on the compiled data, creating a percentile view of the median unit price in each suburb, and then showing the gap between each area’s relative unit price and the area’s relative socio-economic advantage or disadvantage.
Finally, the Box-Cox transformation is used to create a more-normalised variable based on the median unit price. Following the pre-processing steps descried above, the output data is ready for further analysis and investigation, however no such analysis is included as it is beyond the scope of the report.
The aim of the data preprocessing in this assignment is to create a dataset that will enable analysis to be undertaken to investigate the relationship between housing affordability and wellbeing, through comparison of housing prices in Victoria with governmental measures of socio-economic advantage or disadvantage.
Three data sets will be used as the source data for this preprocessing. These sources will be imported in the tibble format, which is a data structure comparable to a data frame, as it allows for multiple data types to exist in the same frames. As the data sets contain a mix of numeric, string and categorical variables, a flexible data structure such as data frames or tibbles will be essential to ensure critical information is not lost.
SOCIO-ECONOMIC INDICATORS
The measure that will be used to analyse socio-economic advantage and disadvantage will be the Index of Relative Socio-economic Advantage and Disadvantage (IRSAD), published by the Australian Bureau of Statistics (ABS). This data was most recently published in 2018, based of analysis that came through the Australian census in 2016. The IRSAD is an index designed to compare an area’s relative socio-economic advantage (or disadvantage), defined by the ABS as “people’s access to material and social resources, and their ability to participate in society” (ABS, 2018)
SOURCE OF DATA
Australian Bureau of Statistics. (2018). Index of Relative Socio-economic Advantage and Disadvantage (IRSAD) [Data cube]. Census of Population and Housing: Socio-Economic Indexes for Areas (SEIFA), Australia, 2016. Retrieved September 30, 2020, https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/2033.0.55.0012016?OpenDocument. Released under a Creative Commons Attribution 4.0 license.
LOADING THE DATA
As the format that the ABS releases is in a rather messy excel file (with multiple blank columns, empty rows of commentary etc.), the data will be read in using the readxl package, skipping the blank columns and rows of commentary.
IRSAD2016 <- read_excel("2033055001 - poa indexes.xls",
sheet = "Table 3", col_types = c("numeric",
"numeric", "numeric", "skip", "numeric",
"numeric", "numeric", "skip", "text",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"text", "text"),
col_names = c("2016PostCode","UsualResidentPopltn","Score","AusRank",
"AusDecile","AusPercentile","State","StateRank",
"StateDecile","StatePercentile","MinScoreSA1s",
"MaxScoreSA1s","PropWithoutSA1Score","CautionNeeded",
"CrossesBoundaries"),skip = 6)
| Variable | Description |
|---|---|
| 2016PostCode | The postcode of an area based on the Australian Statistical Geographical Standard (ASGS) |
| UsualResidentPopltn | Usual resident population - count of population based on the census night reporting of usual residency location |
| Score | The IRSAD index, a lower score indicates relative disadvantage compared to areas with higher scores. The index is standardised with a mean of 1000 and a standard deviation of 100 |
| AusRank | Each area’s rank against all other postcode areas in Australia, with the lowest number representing the lowest rank (i.e. the lowest score in Australia would be #1) |
| AusDecile | Where in each decile the area’s national ranking landed, from the lowest 10% in decile “1” up to the top 10% in decile “10” |
| AusPercentile | Where in each percentile the area’s national ranking landed, from the lowest percentile of “1” up to the top of “100” |
| State | Abbreviated code for the area’s state |
| StateRank | Each area’s rank against all other postcode areas in their same state, with the lowest number representing the lowest rank (i.e. the lowest score would be #1) |
| StateDecile | Where in each decile the area’s state ranking landed, from the lowest 10% in decile “1” up to the top 10% in decile “10” |
| StatePercentile | Where in each percentile the area’s state ranking landed, from the lowest percentile of “1” up to the top of “100” |
| MinScoreSA1s | Where an postal area is based on multiple Statistical Area Level 1 (SA1) locations, the lowest IRSAD score for an SA1 within the postcode |
| MaxScoreSA1s | Where an postal area is based on multiple Statistical Area Level 1 (SA1) locations, the highest IRSAD score for an SA1 within the postcode |
| PropWithoutSA1Score | The percentage of the usual residents in the postal area who did not have an associated SA1 score (due to low populations, high levels of non-response to the census, etc.) |
| CautionNeeded | “Y” if caution should be used for the area - if it is not considered by the ABS to be well represented by the SA1s that contributed to the calcluations of the index (null values provided if no concern flagged) |
| CrossesBoundaries | “Y” if the area crosses state or territory boundaries (null values otherwise) |
HEAD OF THE DATA SET
head(IRSAD2016, n=4L)
SUBSETTING
As the scope of the intended analysis is only for Victoria, the dataset will be filtered to only include Victorian data
IRSAD2016 %<>% filter(State == "VIC")
MEDIAN VICTORIAN UNIT PRICES
The Victorian State Government’s department of Environment, Land, Water and Planning collate and publish sales data for residential and commercial properties across the state. For this preprocessing, the pricing data that will be referred to is the median unit price, by suburb (by year). As the intent of the analysis is to analyse affordability, it makes sense to base the analysis on unit prices rather than housing prices, as units tend to be substantially more affordable than standalone houses, while providing the same access to the relative advantage or disadvantage of the area where it is located.
SOURCE
Annual property sales data, Units by Suburb 2009-2019 (2020), State of Victoria (Department of Environment, Land, Water and Planning), accessed October 10, 2020, https://www.propertyandlandtitles.vic.gov.au/property-information/property-prices#navigation. Released under a Creative Commons Attribution 4.0 license.
LOADING THE DATA
Similarly to the ABS data, the readxl package will again be used to load the data, skipping the rows of commentary that exist in the spreadsheet above the data itself.
MedianUnitPrices <- read_excel("Suburb_Unit_Final.xls",
col_names = c("locality",c(2009:2019),"prelim2020",
"Change_18-19","change09-19","GrowthPA"),
skip = 3)
| Variable | Description |
|---|---|
| locality | Suburb Name |
| Year (from 2009-2019) | The median unit sale price for that suburb during the year indicated in the column name |
| prelim2020 | The median unit sale price for that suburb based on 2020 data available prior to release |
| Change_18-19 | % change in median price year on year |
| change09-19 | % change in median price from 2009 to 2019 |
| GrowthPA | % growth rate in unit prices per annum |
HEAD OF THE DATA SET
head(MedianUnitPrices, n=4L)
POSTCODES DATA
To join the Victorian Government’s suburban unit pricing data with the ABS IRSAD measures a third data set is required to bridge the suburb names with their respective postcodes.
The dataset that would have the highest fidelity to undertake this analysis would be the official Australian postcodes data, which is sold and licensed by Australia Post. As an alternative, the global postal code data compiled by the GeoNames geographical database will be used, which unlike the Australia Post data, is released for free under a creative commons license.
SOURCE
GeoNames geographical database - All Countries Postal Codes - https://download.geonames.org/export/zip/, retrieved 10 October 2020. Released under a Creative Commons Attribution 4.0 license.
LOADING THE DATA
The GeoNames extract of all global postcodes is a very large file (over 1.5 million rows of data), so the read_tsv function from the readr package will be used to load the data, which results in a faster processing time compared to the base r functions for reading in tab-separated files.
postcodesall <- read_tsv('allCountriespostcodes.txt',
col_names = c("country_code","postal_code","place_name",
"admin_nm_1","admin_cd_1","admin_nm_2",
"admin_cd_2","admin_nm_3","admin_cd_3",
"latitude","longitude","accuracy"))
| Variable | Description |
|---|---|
| country_code | International Organization for Standardization (ISO) country code |
| postal_code | The post code for each area (based on each countries formatting) |
| place_name | The name of the region associated with the post code (essentially the suburb in Australian terms) |
| admin_nm_1 | State |
| admin_cd_1 | Abbreviated code for state |
| admin_nm_2 | County/province (where relevant) |
| admin_cd_2 | Code for county/province (where relevant) |
| admin_nm_3 | Community (where relevant) |
| admin_cd_3 | Code for community (where relevant) |
| latitude | Estimated latitude |
| longitude | Estimated longitude |
| accuracy | Category of estimated accuracy of latitude and longitude |
HEAD OF THE DATA SET
head(postcodesall, n=4L)
SUBSETTING
As the scope of the intended analysis is only for Victoria, the dataset will be filtered to only include Victorian data
postcodesaus <- filter(postcodesall, country_code == "AU")
vicpostcodes <- filter(postcodesaus, admin_nm_1 == "Victoria")
Additionally, there are a number of variables that have been added that are unnecessary to enable the join, so the dataset will now be simplified to only include the critical information of all Victorian postcodes and suburb names.
vicpostcodes <- vicpostcodes[,c(2:3)]
head(MedianUnitPrices, n=4L)
The Median Unit Prices Data is presently not tidy. For a data set to be tidy each variable must have its own column, each observation must have its own row, and each value must have its own cell (Wickham, 2014 p.4), whereas the pricing data currently has 12 observations of price on each row, as well as three calculated fields showing rates of change and growth for each suburb.
As the analysis will only be looking at the pricing data, firstly the data will be subsetted to only include the median price measures
MedianUnitPrices <- MedianUnitPrices[,c(1:13)]
head(MedianUnitPrices, n=4L)
The pricing data is now isolated in wide format, and is able to be tidied using the gather function from the tidyr package
tidymedianunits <- gather(MedianUnitPrices, c(2:13), key = "Year", value = "MedianUnitPrice")
head(tidymedianunits, n=4L)
The data is now successfully tidied in long format. As only the most recent full-year data (the 2019 observation) will be used for the analysis, the data will be filtered to only the observations from 2019 using the filter function from the dplyr package.
tidymedianunits %<>% filter(Year == 2019)
head(tidymedianunits, n=4L)
This data set is now appropriately tidied, but the variables are not yet in their optimal format, which shall be rectified below using the as.numeric and as.integer functions.
tidymedianunits$MedianUnitPrice <- as.numeric(tidymedianunits$MedianUnitPrice)
tidymedianunits$Year <- as.integer(tidymedianunits$Year)
head(tidymedianunits, n=4L)
With the median unit price data now tidied and ready, the other data sets will be examined.
head(IRSAD2016, n=4L)
The data types are largely appropriate for the IRSAD observations, with the exception of the State attribute which would ordinarily be converted to a factor, however the processing will proceed by subsetting this data and removing this attribute, as all observations are from Victoria, leaving this attribute redundant.
As the analysis will only be looking at a comparison within Victoria, the data will be cut to remove the information about each region’s relative Australian ranking, and will keep only the raw IRSAD score, and Victorian ranking and percentile
VicIRSAD <- IRSAD2016[,c(1,3,8:10)]
Finally, the postcode data will be examined
head(vicpostcodes, n=4L)
The postal_code attribute is appearing as a character variable rather than a number. While postcodes are technically a categorical variable and not a numeric value, this will cause an error when attempts are made to join and filter the datasets in future steps, so it will be converted to a numeric value now, before this is rectified and the variable is converted to a labelled factor in a later step (please see the below section Data 2 of 2).
vicpostcodes$postal_code %<>% as.numeric()
As the joining of the datasets will rely on the suburb names, this field will be converted to upper case and trimmed in both the postcode data and the median unit price data (using functions from the stringr package) to ensure that no rogue case differences or white space inconsistencies will prevent the success of the join.
vicpostcodes$place_name %<>% toupper()
vicpostcodes$place_name %<>% str_trim(side = "both")
tidymedianunits$locality %<>% toupper()
tidymedianunits$locality %<>% str_trim(side = "both")
head(vicpostcodes, n=4L)
head(tidymedianunits, n=4L)
With the data sets now appropriately formatted and simplified, the data is almost ready to be joined, but first the nrow and complete.cases functions will be used to confirm that there are no missing values in any of the data sets, by comparing the total number of rows with the number of rows that are complete cases (meaning there are no missing values in that row). If the following comparison returns “TRUE” it means there are no missing values in each dataframe.
nrow(vicpostcodes) == (complete.cases(vicpostcodes) %>% sum())
## [1] TRUE
nrow(tidymedianunits) == (complete.cases(tidymedianunits) %>% sum())
## [1] TRUE
nrow(VicIRSAD) == (complete.cases(VicIRSAD) %>% sum())
## [1] TRUE
This has confirmed that there is a complete case for each row, so the process can confidently continue on to joining the data.
The first join that will be undertaken is to join the postcode data associated with each of the suburbs in the median unit price data. Prior to attempting the join, the anti_join function will be used to check if there are any suburb values in the pricing data that will not be found in the postcode data.
Whatsmissing <- anti_join(tidymedianunits,vicpostcodes, by = c("locality" = "place_name"))
head(Whatsmissing, n=4L)
nrow(Whatsmissing)
## [1] 23
Manual inspection of the the anti-join uncover that 23 postcodes were unable to join to a postcode, largely due to formatting concerns.
The missing values have been manually compiled through a combination of the GeoNames data where the errors were due to obvious name formatting challenges, and using manual inquiries by the author to the whitepages.com.au business address directory to find current postcodes for the few suburbs where values were still missing.
This information will be loaded using the readxl package and combined into the postcodes dataset using the dplyr bind_rows function. The dplyr filter function will then be used to remove commercial postcodes (values of 8000+) as they are irrelevant to analysis of residential areas.
SuppPostcodes <- read_excel("Supplementary Vic Postcodes.xlsx")
head(SuppPostcodes, n=4L)
fullPostCodes <- bind_rows(vicpostcodes,SuppPostcodes)
fullPostCodes %<>% filter(postal_code <7999)
summary(fullPostCodes$postal_code)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3000 3294 3485 3506 3723 3996
Now that there is a full set of victorian residential postcodes, the postcode data can be joined into the unit price data, using a left join.
FirstJoin <- tidymedianunits %>% left_join(fullPostCodes, by = c("locality" = "place_name"))
head(FirstJoin, n=4L)
With the first join successfully completed, the associated IRSAD data for each suburb can now be joined in. This will use an inner_join function to ensure that any suburbs that are associated with many postcodes are only retained for further analysis where there is a match with a residential area in the IRSAD data.
COMPILEDVICDATA <- FirstJoin %>% inner_join(VicIRSAD, by = c("postal_code" = "2016PostCode"))
nrow(COMPILEDVICDATA)
## [1] 452
The final output has two more rows than the initial median data house prices set (452 vs the initial 450), which will be investigated to understand what has caused this variance, and whether it indicates anything concerning.
Firstly, the all.equal function will be used in combination with the unique function to see whether the suburb values themselves have remained identical through the joins, or whether any have either changed or been lost through the process.
all.equal(unique(tidymedianunits$locality),unique(COMPILEDVICDATA$locality))
## [1] TRUE
This confirms that the values are the same, which infers that some values have been duplicated through the process, likely due to having multiple postcodes associated with a single suburb. This will be confirmed by using the dplyr group_by, summarise and arrange functions to show which values have duplicated.
COMPILEDVICDATA %>% group_by(locality) %>% summarise(n = n()) %>% arrange( desc(n)) %>% head(,n=4L)
The above code confirms that the only differences in the suburb values between the initial dataset and the output of the joins is that there are now two rows each for Benalla and Melbourne. This is due to these suburbs spanning two postcodes while retaining the same suburb name. As each postcode has a discrete IRSAD score, and the median unit price for the suburb as a whole is likely still the best available indicator of expected house prices for the area, these additional two rows will be retained.
The data has been successfully been joined, and the remainder of our preprocessing activities can now take place in the combined dataset.
head(COMPILEDVICDATA, n=4L)
While we’ve treated postcodes as a number in previous steps to assist with joining the data and filtering out non-residential postcodes, technically postcode is a categorical variable and accordingly it will now be converted into a labelled factor to ensure it’s not incorrectly used in future analysis for nonsensical numeric calculations or transformations.
COMPILEDVICDATA$postal_code <- factor(COMPILEDVICDATA$postal_code,
levels = c(3000:3999),labels = c(3000:3999))
In the process of joining the data sets together the “Year” variable is now unclear and potentially misleading - this refers to the year of the Median Unit Price observation, not the most recent IRSAD score (which is drawn from 2016 census data). This will be addressed by subsetting the data to remove the standalone variable and renaming the price variable to ensure the year of record is not completely lost from the data.
COMPILEDVICDATA <- COMPILEDVICDATA[,c(1,3:8)]
COMPILEDVICDATA %<>% rename(MedianUnPrce19 = MedianUnitPrice)
head(COMPILEDVICDATA, n=4L)
While each of the data sets were individually checked for missing values prior to undertaking the joins, as a matter of caution the compiled dataset will now be checked using the same methodology (comparing the count of rows with the count of complete cases)
nrow(COMPILEDVICDATA) == (complete.cases(COMPILEDVICDATA) %>% sum())
## [1] TRUE
Having confirmed there are no missing values, the next check will be to see whether there are any special values (infinite numbers or non-number/NaN values). The sapply function is used to apply the cell-level “is.infinite” and “is.nan” functions to the whole dataset.
sapply(COMPILEDVICDATA, is.infinite) %>% sum()
## [1] 0
sapply(COMPILEDVICDATA, is.nan) %>% sum()
## [1] 0
Having confirmed that there are no special values of concern, the preprocessing can continue.
The data will be scanned for outliers by using the summary function to output quartiles, minimum and maximum values and measures of central tendency which will be quickly checked visually for anything that looks particularly odd.
summary(COMPILEDVICDATA)
## locality MedianUnPrce19 postal_code Score
## Length:452 Min. : 135000 3350 : 12 Min. : 802
## Class :character 1st Qu.: 380000 3550 : 10 1st Qu.: 956
## Mode :character Median : 510000 3977 : 7 Median :1011
## Mean : 511225 3219 : 6 Mean :1011
## 3rd Qu.: 627625 3012 : 4 3rd Qu.:1071
## Max. :1210000 3020 : 4 Max. :1160
## (Other):409
## StateRank StateDecile StatePercentile
## Min. : 2.0 Min. : 1.000 Min. : 1.0
## 1st Qu.:187.0 1st Qu.: 3.000 1st Qu.: 27.0
## Median :421.0 Median : 7.000 Median : 61.0
## Mean :388.6 Mean : 6.055 Mean : 56.4
## 3rd Qu.:591.5 3rd Qu.: 9.000 3rd Qu.: 86.0
## Max. :694.0 Max. :10.000 Max. :100.0
##
No concerning trends are immediately apparent in the quick summary - in particular, all decile and percentile values appear within the expected range. While there is some duplication of postcodes, this is to be expected due to the reality of many suburbs sharing postcodes, and is not due to any error in the data.
Next, each of the numeric values will be checked for outliers using Tukey’s method (as built into the boxplot function). The boxplot function will be used to quickly check visually if any univariate outliers are present in the distribution of the variables.
par(mfrow = c(2,2))
boxplot(COMPILEDVICDATA$Score, main = "IRSAD Score")
boxplot(COMPILEDVICDATA$StatePercentile, main = "State Percentile")
boxplot(COMPILEDVICDATA$StateDecile, main = "State Decile")
boxplot(COMPILEDVICDATA$StateRank, main = "State Rank")
par(mfrow = c(1,1))
bplot <- boxplot(COMPILEDVICDATA$MedianUnPrce19, main = "Median Unit Price 2019", ylab = "Median Price ($)")
No outliers were found in the values associated with the IRSAD outputs, but a small number were found in the median price data. The inbuilt boxplot attributes will be used to see what the calculated boundaries are for the data and which values exactly were interpreted as outliers.
bplot$stats
## [,1]
## [1,] 135000
## [2,] 380000
## [3,] 510000
## [4,] 627750
## [5,] 990000
bplot$out
## [1] 1144500 1010000 1038000 1210000 1060000 1041500 1085500
The plot indicates that there are seven outliers in the data, all of which are displaying median unit prices over the upper whisker of $990,000. A view of the data will be created using the filter function (from the dplyr package) to look at the specific suburbs that are associated with these values
PriceOutliers <- filter(COMPILEDVICDATA, MedianUnPrce19 >990000)
PriceOutliers
While these values are notably high, it appears that they are indeed not in error, and the analysis to be undertaken with this data could be detrimentally impacted by removing these values. While the rest of the preprocessing steps in this assignment will proceed with the full data set, the below code is an example to show how these values could be easily excluded using the filter function if it was so desired.
DataNoOutliers <- filter(COMPILEDVICDATA, MedianUnPrce19 <=990000)
head(DataNoOutliers, n=4L)
In the data set at present there is a relative measure of socioeconomic advantage, but only an absolute value of unit prices. To enable comparative analysis the mutate function will be used to create a relative comparison point for the property values - a percentile ranking of the Median Unit price. This will be achieved through a combination of the mutate and percent_rank functions from the dplyr package.
COMPILEDVICDATA %<>% mutate(UnitPricePercentile = (percent_rank(MedianUnPrce19)*100))
Next a variable will mutated to compare the newly-created percentile ranking for the median unit cost in each area with their suburb’s associated rankings of relative socio-economic advantage and disadvantage.
This will provide a comparison point for the difference between the relative cost of purchasing a unit in an area vs the relative socio-economic advantage/disadvantage associated with that locality.
COMPILEDVICDATA %<>% mutate(PercentileDifference = (StatePercentile - UnitPricePercentile))
head(COMPILEDVICDATA, n=4L)
hist(COMPILEDVICDATA$MedianUnPrce19, main = "Histogram of non-transformed data")
The median unit price distribution is right skewed which would make many statistical analysis techniques that rely on a normal distribution problematic.
A selection of transformation methods will be illustrated that are intended to transform this variable to a more normal distribution. These outputs will be examined to determine which of the attempted methods is most successful at helping transform the data to a more normal distribution, and this transformation will be added to the dataset.
As the source data is right-skewed, the three transformations that will be compared are a Box-Cox transformation, a reciprocal transformation, and a natural logarithm transformation.
par(mfrow = c(2,2))
hist(COMPILEDVICDATA$MedianUnPrce19, main = "Histogram of non-transformed data")
boxcoxunitprice <- BoxCox(COMPILEDVICDATA$MedianUnPrce19,lambda = "auto")
hist(boxcoxunitprice)
attr(boxcoxunitprice, which = "lambda")
## [1] 0.2478451
reciprocalunitprice <- (COMPILEDVICDATA$MedianUnPrce19^(-1))
hist(reciprocalunitprice)
ln_unitprice <- log(COMPILEDVICDATA$MedianUnPrce19)
hist(ln_unitprice)
From a comparison of the outputs it appears that the Box-Cox transformation is the most effective option, so the Box-Cox transformed data will be added to the data set.
COMPILEDVICDATA$TransformedUnitPrice <- BoxCox(COMPILEDVICDATA$MedianUnPrce19,lambda = "auto")
par(mfrow = c(1,1))
hist(COMPILEDVICDATA$TransformedUnitPrice, main = "Box-Cox Transformed Unit Price (lambda = 0.2478451)")
The transformation of the variable is completed, and the data set is now fully compiled and ready for further analysis, exploration and visualisation.
head(COMPILEDVICDATA, n=4L)
Australian Bureau of Statistics, (2018), 2033.0.55.001 - SEIFA 2016 Technical Paper. Retrieved 18 October 2020 (https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/2033.0.55.0012016?OpenDocument)
Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1 - 23.