library(readxl) # for xls files
library(dplyr) # for pipe, select, filter etc
library(tidyr) # for replacing missing data
library(infotheo) # for binning
library(knitr) # for tables
A dataframe has been created, uniting median house prices with the number of active retailing liquor licences, in Victorian suburbs as of March 2020.
Two source datasets have been obtained from government sites. They have been scanned for accuracy, and read in, applying changes to variable types and column names as required. Observations and variables have been filtered out that are not applicable to the final purpose.
Before merging by suburb, considerable effort has been made to identify and, where appropriate, replace missing Suburb values, both standard and non-standard, using summarising, factoring, filtering joins, and subsetting methods. The datasets have been merged, and then grouped by suburb and summarised to produce a merged dataframe, consisting of 744 observations of 5 variables.
The source dataset for house prices has been discussed as relates to tidy data principles. Manipulation of the dataset has been made to demonstrate a tidy dataframe.
Two new numeric variables have been created and added to the merged dataframe, creating the final dataframe of 744 observations of 7 variables. All variables of this dataframe have been scanned for missing values, and replaced as appropriate. Numeric variables have been scanned for special values and validity checks have been applied.
Numeric variables have been checked for outliers with boxplots, and the distribution shape of the data has been identified as right-skewed.
Two transformation techniques have been applied to the Quarterly Median House Price variable. A natural logarithmic transformation has brought the data into a more symmetrical distribution, as shown in a histogram. A discretisation technique has distributed the values evenly between categories, as shown in a dotplot.
The datasets chosen are “Quarterly Property Sales: Houses by Suburb” and “Victorian Liquor Licences by Location”.
The dataset reports on the median sale prices of houses by suburb in Victoria, over a period of 15 months.
The data is updated quarterly by the Victorian Valuer-General, as part of the Victorian Property Sales Report. The latest figures are current up to March 2020 (as released in September 2020). It is made available by the Victorian State Government, Department of Environment, Land, Water and Planning (DELWP 2020).
The dataset contains 754 observations over 10 variables:
Suburb - A suburb within Victoria.
Jan - Mar 19 - Median sale price of houses within the 1st quarter of 2019
Apr - Jun 19 - Median sale price of houses within the 2nd quarter of 2019
Jul - Sep 19 - Median sale price of houses within the 3rd quarter of 2019
Oct - Dec 19 - Median sale price of houses within the 4th quarter of 2019
Jan - Mar 19 - Median sale price of houses within the 1st quarter of 2020
No of Sales Jan - Mar 2020 - Total number of sales within the first quarter of 2020
No of Sales YTD - Cummulative total number of sales in the current year
Change % Jan - Mar 19 Jan - Mar 2020 - Percentage change in house prices between the 1st quarters of 2019 and 2020
% Change Dec 2019 - Mar 2020 - Percentage change in house prices between the 4th quarter of 2019 and 1st quarter of 2020
NB. The last variable name is incorrect in the source dataset. The correct variable name has been obtained from the published report, Victorian Property Sales Report: March 2020 Quarter (Department of Environment, Land, Water and Planning 2020, p.3), and is corrected when reading in the data using the col_names argument.
The dataset is provided as an Excel file with one sheet of data. Although the data begins on row 1, the variable names span rows 1 to 4. A footnote is given at the end of the data on row 759. To import the data efficiently, the range and col_names arguments are used, and variable names have been simplified.
The dataframe is saved as House_Sales.
# Read in data, limiting to a range, and replacing variable names.
House_Sales <- read_excel("C:/R/Learning R/MATH2349/Datasets/Houses_1st_Quarter_2020.xls", range = "A5:J758",
col_names = c("Suburb", "Q1 2019", "Q2 2019", "Q3 2019", "Q4 2019", "Q1 2020",
"Sales Q1 2020", "Sales YTD", "% Change 2019-2020", "% Change Q1 2020"))
head(House_Sales)
The dataset is a stocktake of all active liquor licences across Victoria, current as of 31 March 2020. It is made available by the Victorian Commission for Gambling and Liquor Regulation (VCGLR 2020).
The dataset contains 23448 observations over 19 variables:
Licence Num - An 8 digit licence number
Licensee - Name of licence holder, which may be an individual or business
Trading As - Trading name of licensee
Category - Type of licence issued
Trading Hours - Licensed trading hours
After 11pm - Number of hours licensed to trade after 11pm, where 0 is none and 99 is not applicable
Maximum Capacity - Maximum capacity of business
Address - Street address of business
Suburb - Suburb of business
Postcode - Postcode of business
Latitude - Location of business, co-ordinate of latitude
Longitude - Location of business, co-ordinate of longitude
Postal Address - Licensee’s postal address, street
Postal Suburb - Licensee’s postal address, suburb
Postal Postcode - Licensee’s postal address, postcode
Council - Local Government Area
Region - Region within Victoria
Metro/Regional - Region category, whether Metro or Regional
Gaming - Gaming venue, (Y)es or (N)o
The dataset is provided as an Excel file with one sheet of data, starting at row 4. Columns “M” and “N” are hidden, merged within column “L”. They contain no data, and are therefore excluded using the col_types argument.
The dataframe is saved as LLicences.
# Read in data, starting at row 4, and skipping columns M and N
LLicences <- read_excel("C:/R/Learning R/MATH2349/Datasets/liquor_licences_march_2020.xlsx", skip = 3,
col_types = c("text", "text", "text", "text", "text", "numeric", "numeric", "text", "text", "text",
"numeric", "numeric", "skip", "skip", "text", "text", "text", "text", "text", "text",
"text"))
head(LLicences)
The datasets will be merged in Data - Part 2.
House_Sales has 754 observations over 10 variables. Columns 2-6 and 8-10 contain numeric values, but have been read in as character types. This corresponds to the formating of the columns in the source spreadsheet, where column 7 is formatted as Numeric, and all else are formatted as General.
Sales Q1 2020 has been read in as a double-type numeric. As it is a count of house sales, it can be converted to an integer-type numeric.
The variables Q1 2019, Q2 2019, Q3 2019, Q4 2019, Q1 2020, % Change 2019-2020, and % Change Q1 2020 have been converted to double numeric. Sales Q1 2020 and Sales YTD have been converted to integer numeric.
Suburb variable should not be factored, as each of its values in this dataset is unique.
# convert from character type to numeric
House_Sales$`Q1 2019` <- as.numeric(House_Sales$`Q1 2019`)
House_Sales$`Q2 2019` <- as.numeric(House_Sales$`Q2 2019`)
House_Sales$`Q3 2019` <- as.numeric(House_Sales$`Q3 2019`)
House_Sales$`Q4 2019` <- as.numeric(House_Sales$`Q4 2019`)
House_Sales$`Q1 2020` <- as.numeric(House_Sales$`Q1 2020`)
House_Sales$`Sales Q1 2020` <- as.integer(House_Sales$`Sales Q1 2020`)
House_Sales$`Sales YTD` <- as.integer(House_Sales$`Sales YTD`)
House_Sales$`% Change 2019-2020` <- as.numeric(House_Sales$`% Change 2019-2020`)
House_Sales$`% Change Q1 2020` <- as.numeric(House_Sales$`% Change Q1 2020`)
str(House_Sales)
## tibble [754 x 10] (S3: tbl_df/tbl/data.frame)
## $ Suburb : chr [1:754] "ABBOTSFORD" "ABERFELDIE" "AINTREE" "AIREYS INLET" ...
## $ Q1 2019 : num [1:754] 992500 1420000 575000 925000 749000 ...
## $ Q2 2019 : num [1:754] 982500 1172500 570000 719000 758000 ...
## $ Q3 2019 : num [1:754] 1140000 1352500 575000 700000 810500 ...
## $ Q4 2019 : num [1:754] 1027500 1650000 620400 1400000 851000 ...
## $ Q1 2020 : num [1:754] 1175000 1665000 675000 1005000 797000 ...
## $ Sales Q1 2020 : int [1:754] 15 7 21 8 21 14 24 7 18 51 ...
## $ Sales YTD : int [1:754] 15 7 21 8 21 14 24 7 18 51 ...
## $ % Change 2019-2020: num [1:754] 18.4 17.3 17.4 8.6 6.4 6.1 0.6 10.8 9.6 7.5 ...
## $ % Change Q1 2020 : num [1:754] 14.4 0.9 8.8 -28.2 -6.3 6.7 -22.6 8.4 19.5 3.1 ...
LLicences has 23,448 observations of 19 variables. Variables have been converted to the correct types when reading in the data.
Category, Region, Metro/Regional and Gaming variables have been factored. The Gaming variable has also been labelled.
The Suburb variable will be factored last, before merging, as a step in finding non-standard missing values (See Scan I).
# factor variables
LLicences$Category <- factor(LLicences$Category)
LLicences$Region <- factor(LLicences$Region)
LLicences$`Metro/Regional` <- factor(LLicences$`Metro/Regional`)
LLicences$Gaming <- factor(LLicences$Gaming, levels = c("N", "Y"), labels = c("No", "Yes"))
# get structure
str(LLicences)
## tibble [23,276 x 19] (S3: tbl_df/tbl/data.frame)
## $ Licence Num : chr [1:23276] "31100099" "31100120" "31100138" "31100154" ...
## $ Licensee : chr [1:23276] "ARNOLD BRUCE" "OAKLAND HUNT CLUB" "C S I R SKI CLUB" "SKILIB ALPINE CLUB" ...
## $ Trading As : chr [1:23276] "UNIVERSITY SKI CLUB" "OAKLAND HUNT CLUB" "C S I R SKI CLUB" "SKILIB ALPINE CLUB" ...
## $ Category : Factor w/ 14 levels "BYO Permit","Full Club Licence",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Trading Hours : chr [1:23276] "BYO Permit" "BYO Permit" "BYO Permit" "BYO Permit" ...
## $ After 11 pm : num [1:23276] 99 99 99 99 99 99 99 99 99 99 ...
## $ Maximum Capacity: num [1:23276] NA NA NA NA NA NA NA NA NA NA ...
## $ Address : chr [1:23276] "U S C LODGE ALLOTMENTS 15 & 16 ALPINE VILL" "SOMERTON ROAD" "SITE 65 1 SCHUSS STREET" "2 GOAL POSTS RD" ...
## $ Suburb : chr [1:23276] "MT BULLER" "GREENVALE" "FALLS CREEK" "MT BULLER" ...
## $ Postcode : chr [1:23276] "3723" "3047" "3699" "3723" ...
## $ Latitude : num [1:23276] -37.1 -37.6 -36.9 -37.1 -37.9 ...
## $ Longitude : num [1:23276] 146 145 147 146 145 ...
## $ Postal Address : chr [1:23276] "PO BOX 93" "SOMERTON ROAD" "PO BOX 15" "PO BOX 140" ...
## $ Postal Suburb : chr [1:23276] "SANDOWN VILLAGE" "GREENVALE" "CARNEGIE" "CANTERBURY" ...
## $ Postal Postcode : chr [1:23276] "3171" "3047" "3163" "3126" ...
## $ Council : chr [1:23276] "MANSFIELD SHIRE COUNCIL" "HUME CITY COUNCIL" "ALPINE SHIRE COUNCIL" "MANSFIELD SHIRE COUNCIL" ...
## $ Region : Factor w/ 11 levels "Barwon South-West",..: 4 7 4 4 8 6 4 1 4 4 ...
## $ Metro/Regional : Factor w/ 2 levels "Metro","Regional": 2 1 2 2 1 1 2 2 2 2 ...
## $ Gaming : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
Factoring Category has revealed all possible values of licence type.
As licences that don’t permit sale or service of alcohol to the public will not be useful in the final dataframe, licences of that type (Pre-retail) have been filtered out. The filtered result is saved as LL_Retail.
The attributes of the merged data will be examined in Understand - Part 2
# filter out Pre-retail Licence rows and check result with a table
LL_Retail <- LLicences %>% filter(Category != "Pre-retail Licence")
table(LL_Retail$Category)
##
## BYO Permit Full Club Licence
## 1110 738
## General Licence Late night (general) Licence
## 1695 470
## Late night (on-premises) Licence Late night (packaged liquor) Licence
## 355 3
## Limited Licence On-premises Licence
## 5402 1
## On-Premises Licence Packaged Liquor Licence
## 2024 2141
## Pre-retail Licence Producer's Licence
## 0 913
## Restaurant and cafe Licence Restricted Club Licence
## 6503 862
The datasets will be matched by Suburb. Therefore, the suburb values in both datasets have been made as complete and accurate as possible.
A count has been taken of the number of standard missing values in Suburb (Sullivan 2019). A count of unique values is included as an additional way to understand the variable.
R has found 0 missing suburb values in House_Sales, and 2 missing suburb values in LL_Retail.
# summarise unique values and missing values
House_Sales %>% summarise('data' = "House_Sales -- Suburb",
'unique values' = n_distinct(`Suburb`, na.rm = TRUE),
'missing values' = sum(is.na(`Suburb`)))
# summarise unique values and missing values
LL_Retail %>% summarise('data' = "LL_Retail -- Suburb",
'unique values' = n_distinct(`Suburb`, na.rm = TRUE),
'missing values' = sum(is.na(`Suburb`)))
Factoring the suburb variable can be useful to look for misspellings and missing values that are not obvious (Cai 2015). It isn’t useful to factor the suburb variable in House_Sales, as each suburb is only included once. But it is useful in *LL_Retail**, as there are far fewer unique values than observations.
Suburb has been factored, and then sorted, bringing to the top [Not applicable] -, and [NOT APPLICABLE] -.
A visual scan of the dataframe has also revealed NO FIXED ADDRESS as a missing value.
These have been replaced with NA, so that R can recognise them as missing values.
# find all possible values of Suburb
LL_Retail$Suburb <- factor(LL_Retail$Suburb)
# sort by suburb and display
LL_Retail %>% arrange(Suburb) %>% select(Suburb, `Trading As`, Address, Category) %>% head(6)
# subset to find all instances of missing values and convert to NA
LL_Retail$Suburb[LL_Retail$Suburb == "[Not applicable] -" | LL_Retail$Suburb == "[NOT APPLICABLE] -" |
LL_Retail$Suburb == "NO FIXED ADDRESS"] <- NA
LL_Retail %>% summarise('data' = "LL_Retail -- Suburb",
'unique values' = n_distinct(`Suburb`),
'missing values' = sum(is.na(`Suburb`)))
The data has been filtered to examine the rows with missing suburbs, and appropriate values found by searching the business names online. A visual scan has also found inconsistencies and irregularities in Suburb values. These have been corrected.
To prevent loss of data when inserting values, the data has been filtered and checked for existing suburb values in other rows. MORNINGTON PENINSULA BREWERY has two rows, one with a suburb value which is used to fill the missing value in the other row as well.
Any values that are new to the Suburbs variable have been added as a new factor level (DataMentor n.d).
Rows that cannot be appropriately given a suburb value, due to having no fixed location or point of sale, will necessarily be omitted when merging the datasets.
# Examine rows with missing values
LL_Retail %>% filter(., is.na(Suburb)) %>% select(Suburb, `Trading As`, Category, Address) %>% arrange(`Trading As`)
# filter data to check for duplicate trading names with existing suburbs
LL_Retail %>% filter(`Trading As` == "FESTIVAL CITY WINES" | `Trading As` == "CHIRPING BIRD" | `Trading As` == "GENTLE ANNIE WINES" | `Trading As` == "WAGTAIL WINES" | `Trading As` == "LAURISTON VINEYARD" | `Trading As` == "NAPPA MERRI VINEYARDS" | `Trading As` == "BELLARINE BREWING COMPANY" | `Trading As` == "SAVARA IN BREWERY" | `Trading As` == "BELLINGHAM ESTATE" | `Trading As` == "MORNINGTON PENINSULA BREWERY" | `Trading As` == "DOLPHIN BREWERY" | `Trading As` == "BELVOIR WINES" | `Trading As` == "CAVALIER BREWING" ) %>% select(Suburb, `Trading As`, Category, Address ) %>% arrange(Suburb) %>% head(6)
# add new factor levels
levels(LL_Retail$Suburb) <- c(levels(LL_Retail$Suburb), "RAVENSWOOD", "ARTHURS SEAT", "MORNINGTON", "LAURISTON", "CORADJIL")
# insert suburb values where missing
LL_Retail$Suburb[LL_Retail$`Trading As` == "FESTIVAL CITY WINES"] <- "TRUGANINA"
LL_Retail$Suburb[LL_Retail$`Trading As` == "CHIRPING BIRD"] <- "MOOROODUC"
LL_Retail$Suburb[LL_Retail$`Trading As` == "GENTLE ANNIE WINES"] <- "DOOKIE"
LL_Retail$Suburb[LL_Retail$`Trading As` == "WAGTAIL WINES"] <- "TAGGERTY"
LL_Retail$Suburb[LL_Retail$`Trading As` == "LAURISTON VINEYARD"] <- "LAURISTON"
LL_Retail$Suburb[LL_Retail$`Trading As` == "NAPPA MERRI VINEYARDS"] <- "BUSHFIELD"
LL_Retail$Suburb[LL_Retail$`Trading As` == "BELLARINE BREWING COMPANY"] <- "SOUTH GEELONG"
LL_Retail$Suburb[LL_Retail$`Trading As` == "SAVARA IN BREWERY"] <- "SALE"
LL_Retail$Suburb[LL_Retail$`Trading As` == "BELLINGHAM ESTATE"] <- "ARTHURS SEAT"
LL_Retail$Suburb[LL_Retail$`Trading As` == "MORNINGTON PENINSULA BREWERY"] <- "MORNINGTON"
LL_Retail$Suburb[LL_Retail$`Trading As` == "DOLPHIN BREWERY"] <- "DAYLESFORD"
LL_Retail$Suburb[LL_Retail$`Trading As` == "BELVOIR WINES"] <- "RAVENSWOOD"
LL_Retail$Suburb[LL_Retail$`Trading As` == "CAVALIER BREWING"] <- "DERRIMUT"
# adjust suburb values where appropriate
LL_Retail$Suburb[LL_Retail$Suburb == "GLEN AIRE"] <- "GLENAIRE"
LL_Retail$Suburb[LL_Retail$Suburb == "QUEENSCLIFFE"] <- "QUEENSCLIFF"
LL_Retail$Suburb[LL_Retail$Suburb == "WYNDHAMVALE"] <- "WYNDHAM VALE"
LL_Retail$Suburb[LL_Retail$Suburb == "CORADJIL VIA COBDEN"] <- "CORADJIL"
LL_Retail$Suburb[LL_Retail$Suburb == "CURLEWIS VIA DRYSDALE"] <- "CURLEWIS"
LL_Retail$Suburb[LL_Retail$Suburb == "GEELONG SOUTH"] <- "SOUTH GEELONG"
LL_Retail$Suburb[LL_Retail$Suburb == "WEST SUNSHINE"] <- "SUNSHINE WEST"
LL_Retail$Suburb[LL_Retail$Suburb == "FOOTSCRAY WEST"] <- "WEST FOOTSCRAY"
LL_Retail$Suburb[LL_Retail$Suburb == "NORTH WILLIAMSTOWN"] <- "WILLIAMSTOWN NORTH"
LL_Retail$Suburb[LL_Retail$Suburb == "NORTH SUNSHINE"] <- "SUNSHINE NORTH"
LL_Retail$Suburb[LL_Retail$Suburb == "GEELONG NORTH"] <- "NORTH GEELONG"
LL_Retail$Suburb[LL_Retail$Suburb == "NORTH ESSENDON"] <- "ESSENDON NORTH"
LL_Retail$Suburb[LL_Retail$Suburb == "EAST KEILOR"] <- "KEILOR EAST"
LL_Retail$Suburb[LL_Retail$Suburb == "GEELONG EAST"] <- "EAST GEELONG"
LL_Retail$Suburb[LL_Retail$Suburb == "CHILLWELL"] <- "NEWTOWN"
LL_Retail$Suburb[LL_Retail$Suburb == "DRUMCONDA"] <- "DRUMCONDRA"
LL_Retail$Suburb[LL_Retail$Suburb == "ESSENDON AIRPORT"] <- "ESSENDON FIELDS"
LL_Retail$Suburb[LL_Retail$Suburb == "FOOTSCRAY SOUTH"] <- "FOOTSCRAY"
LL_Retail$Suburb[LL_Retail$Suburb == "GELLIBRAND RIVER"] <- "GELLIBRAND"
LL_Retail$Suburb[LL_Retail$Suburb == "LAKE GILLEAR"] <- "ALLANSFORD"
LL_Retail$Suburb[LL_Retail$Suburb == "NORLANE WEST"] <- "NORLANE"
LL_Retail$Suburb[LL_Retail$Suburb == "WERRIBEE PARK"] <- "WERRIBEE SOUTH"
# show result
LL_Retail %>% summarise('data' = "LL_Retail -- Suburb",
'unique values' = n_distinct(`Suburb`, na.rm = TRUE),
'missing values' = sum(is.na(`Suburb`)))
To scan for mismatches, a filtering join has been used to find all rows in House_Sales that don’t have a match in LL_Retail. There are 75 suburbs in House_Sales without a match in LL_Retail.
Irregular and values unrecognised by Australia Post have been replaced, and the match between datasets has been checked again.
The output shows that after cleaning Suburb values, there are only 58 out of 744 suburbs in House_Sales with no match in LL_Retail.
Missing values will be examined again after merging the data in SCAN I - Part 2
# scan for suburbs without a match
House_Sales %>% anti_join(LL_Retail, by = "Suburb")
# replace problem suburbs
House_Sales$Suburb[House_Sales$Suburb == "ASCOT (GREATER BENDIGO)"] <- "ASCOT"
House_Sales$Suburb[House_Sales$Suburb == "BELLFIELD (BANYULE)"] <- "BELLFIELD"
House_Sales$Suburb[House_Sales$Suburb == "CHELTENHAM NORTH"] <- "CHELTENHAM"
House_Sales$Suburb[House_Sales$Suburb == "COBURG EAST"] <- "COBURG"
House_Sales$Suburb[House_Sales$Suburb == "COONANS HILL"] <- "PASCOE VALE SOUTH"
House_Sales$Suburb[House_Sales$Suburb == "COWES WEST"] <- "COWES"
House_Sales$Suburb[House_Sales$Suburb == "GOLDEN POINT (BALLARAT)"] <- "GOLDEN POINT"
House_Sales$Suburb[House_Sales$Suburb == "HASTINGS WEST"] <- "HASTINGS"
House_Sales$Suburb[House_Sales$Suburb == "HILLSIDE (MELTON)"] <- "HILLSIDE"
House_Sales$Suburb[House_Sales$Suburb == "JEERALANG NORTH"] <- "JEERALANG"
House_Sales$Suburb[House_Sales$Suburb == "KANGAROO GROUND SOUTH"] <- "KANGAROO GROUND"
House_Sales$Suburb[House_Sales$Suburb == "KEW NORTH"] <- "KEW"
House_Sales$Suburb[House_Sales$Suburb == "KILLARA (WODONGA)"] <- "KILLARA"
House_Sales$Suburb[House_Sales$Suburb == "LAVERTON SOUTH"] <- "LAVERTON"
House_Sales$Suburb[House_Sales$Suburb == "MERINDA PARK"] <- "CRANBOURNE"
House_Sales$Suburb[House_Sales$Suburb == "NEWTOWN (GREATER GEELONG)"] <- "NEWTOWN"
House_Sales$Suburb[House_Sales$Suburb == "PATTERSON GARDENS"] <- "KEILOR"
House_Sales$Suburb[House_Sales$Suburb == "THOMSON (GREATER GEELONG)"] <- "THOMSON"
House_Sales$Suburb[House_Sales$Suburb == "WENDOUREE WEST"] <- "WENDOUREE"
House_Sales$Suburb[House_Sales$Suburb == "WOOLAMAI WATERS"] <- "CAPE WOOLAMAI"
# summarise unique values and missing values
House_Sales %>% anti_join(LL_Retail, by = "Suburb") %>% summarise('unique values in House_Sales' = n_distinct(House_Sales$Suburb, na.rm = TRUE),
'unmatched values in House_Sales' = n_distinct(`Suburb`, na.rm = TRUE),
'missing values in House_Sales' = sum(is.na(`Suburb`)))
To add matching LL_Retail variables to House_Sales, and keeping all rows in House_Sales, the datasets have been left-joined to create a new dataframe Houses_and_Licences.
# Merge House_Sales and LL_Retail to keep all observations in House Sales
Houses_and_Licences <- left_join(House_Sales, LL_Retail, by = "Suburb")
head(Houses_and_Licences)
Variables of interest have been selected to create the basis of the final merged dataframe, saved as Suburbs_Summary_March2020. The data has been grouped by suburb and arranged by number of liquor licences in descending order, greatest to least.
Variables have been selected for the following reasons:
Suburb as a key variable
Metro/Regional as a clue to the probable density of the population
Q1 2020 as the most recently available median house price
Sales Q1 2020 as a clue to the reliability of the median house price (more sales = more reliable)
License Num to provide a count of the number of licences in each suburb
Suburb_Summary_March2020 <- Houses_and_Licences %>% group_by(Suburb) %>%
summarise('Metro/Regional' = first(`Metro/Regional`),
'Quarterly Median House Price' = first(`Q1 2020`),
'Quarterly Sales' = first(`Sales Q1 2020`),
'Retail Liquor Licences' = n_distinct(`Licence Num`, na.rm = TRUE)) %>%
arrange(desc(`Retail Liquor Licences`))
head(Suburb_Summary_March2020)
The resulting dataframe has 744 observations of 5 variables:
Suburb - the suburb within Victoria (character, and should not be factored as each value is unique)
Metro/Regional - classification as metropolitan Melbourne or regional Victoria, according to the VCGLR (factor with 2 levels: Metro, Regional)
Quarterly Median House Price - the median price of houses sold from January to March 2020 (double numeric)
Quarterly Sales - the number of houses sold from January to March 2020 (integer numeric)
Retail Liquor Licences - the number of active liquor licences permitting sale or service of liquor to the public in March 2020 (integer numeric)
The required conversions of data types were completed in Understand I
# show structure
str(Suburb_Summary_March2020)
## tibble [744 x 5] (S3: tbl_df/tbl/data.frame)
## $ Suburb : chr [1:744] "RICHMOND" "FITZROY" "ST KILDA" "SOUTH YARRA" ...
## $ Metro/Regional : Factor w/ 2 levels "Metro","Regional": 1 1 1 1 2 2 1 1 1 1 ...
## $ Quarterly Median House Price: num [1:744] 1243000 1305500 1771800 1993800 795000 ...
## $ Quarterly Sales : int [1:744] 43 18 8 18 11 37 8 15 29 29 ...
## $ Retail Liquor Licences : int [1:744] 295 253 249 234 208 191 183 182 169 162 ...
House_Sales has many variables with dates in column names. To adhere to tidy data principles, the quarter and year should be variables.
For example, Jan - Mar 19 (read in as Q1 2019) would be three variables; “Quarter”, “Year”, “Median Sale Price”.
The Sales values for 2019 are not available in the data supplied. But the % Change values for 2019 could be imputed based on the Median values.
# Select and gather median sale price values, into "Period" and "Median" variables
HS_Tidy_Quarters <- House_Sales %>% select(`Suburb`, `Q1 2019`, `Q2 2019`, `Q3 2019`,
`Q4 2019`, `Q1 2020`) %>%
gather(`Q1 2019`, `Q2 2019`, `Q3 2019`, `Q4 2019`, `Q1 2020`,
key = "Period", value = "Median") %>% arrange(Suburb)
# In a new df, select and gather sales values, into "Period" and "Sales"
HS_Tidy_Sales <- House_Sales %>% select(`Suburb`, `Sales Q1 2020`) %>%
gather(`Sales Q1 2020`, key = "Period", value = "Sales") %>% arrange(Suburb)
# Rename "Period" values
HS_Tidy_Sales$Period <- HS_Tidy_Sales$Period <- "Q1 2020"
# In a new df, select and gather % change values, into "Period" and "% Change"
HS_Tidy_Change <- House_Sales %>% select(`Suburb`, `% Change Q1 2020`) %>%
gather(`% Change Q1 2020`, key = "Period", value = "% Change") %>% arrange(Suburb)
# Rename "Period" values
HS_Tidy_Change$Period <- "Q1 2020"
# Now merge the three dataframes, starting with Sales and Change
HS_Tidy_1 <- full_join(HS_Tidy_Sales, HS_Tidy_Change, by = "Suburb")
# Create a key to preserve NA values
HS_Tidy_2 <- HS_Tidy_1 %>% unite(Key, Suburb, Period.x, sep = "/")
HS_Tidy_Quarters_2 <- HS_Tidy_Quarters %>% unite(Key, Suburb, Period, sep = "/")
# Join by the created key
HS_Tidy_3 <- full_join(HS_Tidy_2, HS_Tidy_Quarters_2, by = "Key")
# Separate the created key
HS_Tidy_4 <- HS_Tidy_3 %>% separate(Key, into = c("Suburb", "Period"), sep = "/")
# Select variables and separate "Period" into "Quarter" and "Year"
HS_Tidy_Complete <- HS_Tidy_4 %>% select(Suburb, Period, Median, Sales, `% Change`) %>%
separate(Period, into = c("Quarter", "Year"), sep = " ") %>% arrange(Suburb, Year, Quarter)
head(HS_Tidy_Complete)
Returning to Suburb_Summary_March2020, two new column variables have been created:
% Share - each suburb’s share of retailing licences*
Rank - each suburb’s rank in terms of median house price, from most expensive (rank 1) to least expensive (rank 562)
Of the possible ranking functions, dense_rank has been chosen because it does not prioritise one observation over another when they share the same position relative to the whole (Github n.d). For this data, dense_rank is the more intuitive than row_number or min_rank which preserve the total number of rows in the ranking algorithm, and would cause the least expensive suburb to be ranked 744 instead of 562.
*It should be remembered that the data does not include suburbs with licenses where there was no median house price information.
# create two new variables
SubS_March_2020 <- Suburb_Summary_March2020 %>%
mutate(., "% Share" = `Retail Liquor Licences` / sum(`Retail Liquor Licences`, na.rm = TRUE) * 100,
"Rank" = dense_rank(desc(`Quarterly Median House Price`)))
head(select(SubS_March_2020, Suburb, `% Share`, Rank))
SubS_March_2020 has been scanned for standard missing values.
Because there were 58 unmatched suburbs in House_Sales before the data was merged, there should be 58 incomplete rows of data. Metro/Regional shows 58 missing values, as expected.
Retail Liquor Licenses has 0 missing values, but this is because they appropriately contain a value of 0, obtained by counting distinct values of the Licence Num variable.
Missing Metro/Regional values can only be reliably imputed by Suburb, as house price can vary greatly. A comprehensive list of Metro suburbs would be a useful resource for writing a function to impute these values.
# check dataframe for missing values
colSums(is.na(SubS_March_2020))
## Suburb Metro/Regional
## 0 58
## Quarterly Median House Price Quarterly Sales
## 0 0
## Retail Liquor Licences % Share
## 0 0
## Rank
## 0
# find incomplete cases
SubS_March_2020[!complete.cases(SubS_March_2020), ]
# add missing Metro/Regional values
SubS_March_2020$`Metro/Regional`[SubS_March_2020$Suburb == "ALBANVALE" | SubS_March_2020$Suburb == "BALCOMBE" |
SubS_March_2020$Suburb == "BALNARRING BEACH" | SubS_March_2020$Suburb == "BELGRAVE HEIGHTS" |
SubS_March_2020$Suburb == "BOTANIC RIDGE" | SubS_March_2020$Suburb == "BROOKFIELD" |
SubS_March_2020$Suburb == "CAPEL SOUND" | SubS_March_2020$Suburb == "CHELTENHAM EAST" |
SubS_March_2020$Suburb == "COBBLEBANK" | SubS_March_2020$Suburb == "EAST WARBURTON" |
SubS_March_2020$Suburb == "ESSENDON WEST" | SubS_March_2020$Suburb == "HARKNESS" |
SubS_March_2020$Suburb == "KINGS PARK" | SubS_March_2020$Suburb == "KURUNJANG" |
SubS_March_2020$Suburb == "OFFICER SOUTH" | SubS_March_2020$Suburb == "OSBORNE" |
SubS_March_2020$Suburb == "STUDFIELD" | SubS_March_2020$Suburb == "THORNHILL PARK" |
SubS_March_2020$Suburb == "WATSONIA NORTH" | SubS_March_2020$Suburb == "WEIR VIEWS"] <- "Metro"
SubS_March_2020$`Metro/Regional`[SubS_March_2020$Suburb == "ASCOT" | SubS_March_2020$Suburb == "BALLARAT NORTH" |
SubS_March_2020$Suburb == "BANDIANA" | SubS_March_2020$Suburb == "BONSHAW" |
SubS_March_2020$Suburb == "CANADIAN" | SubS_March_2020$Suburb == "CHARLEMONT" |
SubS_March_2020$Suburb == "COONGULLA" | SubS_March_2020$Suburb == "" |
SubS_March_2020$Suburb == "EAST BAIRNSDALE" | SubS_March_2020$Suburb == "EAST BENDIGO" |
SubS_March_2020$Suburb == "EUREKA" | SubS_March_2020$Suburb == "FLORA HILL" |
SubS_March_2020$Suburb == "INVERMAY PARK" | SubS_March_2020$Suburb == "JACKASS FLAT" |
SubS_March_2020$Suburb == "JEERALANG" | SubS_March_2020$Suburb == "KILLARA" |
SubS_March_2020$Suburb == "LAKE GARDENS" | SubS_March_2020$Suburb == "LAKE WENDOUREE" |
SubS_March_2020$Suburb == "LENEVA" | SubS_March_2020$Suburb == "MARSHALL" |
SubS_March_2020$Suburb == "MCKENZIE HILL" | SubS_March_2020$Suburb == "MITCHELL PARK" |
SubS_March_2020$Suburb == "NEWINGTON" | SubS_March_2020$Suburb == "NEWLANDS ARM" |
SubS_March_2020$Suburb == "NORTH WONTHAGGI" | SubS_March_2020$Suburb == "PARADISE BEACH" |
SubS_March_2020$Suburb == "PIONEER BAY" | SubS_March_2020$Suburb == "RAYMOND ISLAND" |
SubS_March_2020$Suburb == "SAILORS GULLY" | SubS_March_2020$Suburb == "SHEPPARTON NORTH" |
SubS_March_2020$Suburb == "SUNDERLAND BAY" | SubS_March_2020$Suburb == "SUNSET STRIP" |
SubS_March_2020$Suburb == "SURF BEACH" | SubS_March_2020$Suburb == "THE HONEYSUCKLES" |
SubS_March_2020$Suburb == "WALKERVILLE" | SubS_March_2020$Suburb == "WANDANA HEIGHTS" |
SubS_March_2020$Suburb == "WHITTINGTON" | SubS_March_2020$Suburb == "WIMBLEDON HEIGHTS" |
SubS_March_2020$Suburb == "WINTER VALLEY"] <- "Regional"
# check Metro/Regional for missing values
sum(is.na(SubS_March_2020$`Metro/Regional`))
## [1] 0
Numeric variables have been checked for special values
# check for infinite and nan values
sum(is.infinite(SubS_March_2020$`Quarterly Median House Price`))
## [1] 0
sum(is.nan(SubS_March_2020$`Quarterly Median House Price`))
## [1] 0
sum(is.infinite(SubS_March_2020$`Quarterly Sales`))
## [1] 0
sum(is.nan(SubS_March_2020$`Quarterly Sales`))
## [1] 0
sum(is.infinite(SubS_March_2020$`Retail Liquor Licences`))
## [1] 0
sum(is.nan(SubS_March_2020$`Retail Liquor Licences`))
## [1] 0
sum(is.infinite(SubS_March_2020$`% Share`))
## [1] 0
sum(is.nan(SubS_March_2020$`% Share`))
## [1] 0
sum(is.infinite(SubS_March_2020$Rank))
## [1] 0
sum(is.nan(SubS_March_2020$Rank))
## [1] 0
Numeric values have been checked for obvious errors, using the following parameters:
Quarterly Median House Price is greater than 50,000.
Quarterly Sales are not less than 0
Retail Liquor Licences are not less than 0
# validate price, sales, and licences
valid_price <- SubS_March_2020$`Quarterly Median House Price` >= 50000
valid_sales <- SubS_March_2020$`Quarterly Sales` >= 0
valid_licences <- SubS_March_2020$`Retail Liquor Licences` >= 0
table(valid_price)
## valid_price
## TRUE
## 744
table(valid_sales)
## valid_sales
## TRUE
## 744
table(valid_licences)
## valid_licences
## TRUE
## 744
The numeric values in SubS_March_2020 have been checked for outliers.
There are many outliers indicating all numeric variables are strongly right-skewed. They are likely to be accurate reflections of the data.
There are two extreme outliers in the Metro median house price. These have been addressed in the next step, Transform.
# create boxplots
boxplot(SubS_March_2020$`Quarterly Median House Price` ~ SubS_March_2020$`Metro/Regional`,
main="Median House Price by Region", ylab="Price", col = "beige")
# other variables show similar right-skew
par(mfrow = c(1, 3))
boxplot(SubS_March_2020$`Quarterly Sales` ~ SubS_March_2020$`Metro/Regional`,
main="Quarterly Sales by Region", ylab="Sales", col = "beige")
boxplot(SubS_March_2020$`Retail Liquor Licences` ~ SubS_March_2020$`Metro/Regional`,
main="Liquor Licences by Region", ylab="Licences", col = "beige")
boxplot(SubS_March_2020$`% Share` ~ SubS_March_2020$`Metro/Regional`,
main="% Share of Licences by Region", ylab="% Share", col = "beige")
Quarterly Median House Price has been transformed to reduce right-skewedness and suppress outliers.
Two methods have been applied.
The house price variable has been plotted as it occurs in the data to show the right skew, and then a logarithmic transformation has been applied.
The transformed data has been plotted to show the effect of the transformation, approximating to a more normal distribution.
par(mfrow = c(2,1))
# plot house prices as normal
hist(SubS_March_2020$`Quarterly Median House Price`, main = "Median House Price", col = "beige")
# transform values
ln_median <- log(SubS_March_2020$`Quarterly Median House Price`)
# plot transformed values
hist(ln_median, main = "Natural Logarithm", col = "beige")
This method has suppressed outliers by evenly distributing the values among categories. The extreme outliers have been placed within the last category.
A dotplot of the binned values reveals a gentle curve trending from least expensive to most expensive. To help make sense of the values, the median price is shown in thousands.
As each bin contains the same number of values, it is easily seen that most suburbs had a median house price of less than 1 million in March 2020. A few suburbs had a median house price from 1 million to about 1.3 million, and an equally few number from about 1.3 million to over 4 million.
# save median price as a vector in thousands AUD
median_price <- SubS_March_2020$`Quarterly Median House Price` / 1000
# bin values by equal frequency
price_binned <- discretize(median_price, disc = "equalfreq")
# join price values to bins
price_and_bins <- median_price %>% bind_cols(price_binned)
## New names:
## * NA -> ...1
# show distribution of bins
table(price_and_bins$X)
##
## 1 2 3 4 5 6 7 8 9
## 83 82 83 84 79 82 82 82 87
# draw plot of price by bins
price_and_bins %>% plot(...1 ~ X, data = ., main = "Median House Price March 2020",
xlab = "Bins", ylab = "Thousands, AUD")
grid(nx = NULL)
Australia Post 2020, Find a Postcode, viewed 14 October 2020, https://auspost.com.au/postcode
Australian Bureau of Statistics (ABS) 2020, 2016 Census QuickStats: Port Campbell, viewed 14 October 2020, https://quickstats.censusdata.abs.gov.au/census_services/getproduct/census/2016/quickstat/SSC22092
Cai, Eric 2015, How to Get the Frequency Table of a Categorical Variable as a Data Frame in R, viewed 12 October 2020, https://chemicalstatistician.wordpress.com/2015/02/03/how-to-get-the-frequency-table-of-a-categorical-variable-as-a-data-frame-in-r/
DataMentor n.d., R Factors, viewed 13 October 2020, https://www.datamentor.io/r-programming/factor/
Department of Environment, Land, Water and Planning (DELWP) 2020, March 2020 quarter: Houses by Suburb, viewed 10 October 2020, https://www.propertyandlandtitles.vic.gov.au/__data/assets/excel_doc/0035/485936/Houses_1st_Quarter_2020.xls
Department of Environment, Land, Water and Planning (DELWP) 2020, Property Prices, viewed 10 October 2020, https://www.propertyandlandtitles.vic.gov.au/property-information/property-prices
Department of Environment, Land, Water and Planning (DELWP) 2020, Victorian Property Sales Report: March 2020 Quarter, viewed 11 October 2020, https://www.propertyandlandtitles.vic.gov.au/__data/assets/pdf_file/0032/485915/VPSR_Mar2020_final.pdf
GitHub n.d., Count/tally observations by group, viewed 12 October 2020, https://dplyr.tidyverse.org/reference/tally.html
GitHub n.d., Windowed Rank Functions, viewed 18 October 2020, https://dplyr.tidyverse.org/reference/ranking.html
Municipal Association of Victoria (MAV) n.d., Victorian Councils Map, viewed 12 October 2020, https://www.viccouncils.asn.au/find-your-council/council-map
Sullivan, John 2019, Data Cleaning with R and the Tidyverse: Detecting Missing Values, viewed 12 October 2020, https://towardsdatascience.com/data-cleaning-with-r-and-the-tidyverse-detecting-missing-values-ea23c519bc62
Victorian Commission for Gambling and Liquor Regulation (VCGLR) 2020, Current Victorian Liquor Licences as of 31/03/2020, viewed 10 October 2020, https://www.vcglr.vic.gov.au/sites/default/files/current_victorian_licences_by_location_geo-coded_march_2020.xlsx
Victorian Commission for Gambling and Liquor Regulation (VCGLR) 2020, Designated Areas, viewed 17 October 2020, https://www.vcglr.vic.gov.au/community-services/government-initiatives/designated-areas
Victorian Places 2015, Newtown and Chilwell, viewed 14 October 2020, https://www.victorianplaces.com.au/newtown-and-chilwell