Required packages

library(readxl)       # for xls files
library(dplyr)        # for pipe, select, filter etc
library(tidyr)        # for replacing missing data
library(infotheo)     # for binning
library(knitr)        # for tables

Executive Summary

A dataframe has been created, uniting median house prices with the number of active retailing liquor licences, in Victorian suburbs as of March 2020.

Two source datasets have been obtained from government sites. They have been scanned for accuracy, and read in, applying changes to variable types and column names as required. Observations and variables have been filtered out that are not applicable to the final purpose.

Before merging by suburb, considerable effort has been made to identify and, where appropriate, replace missing Suburb values, both standard and non-standard, using summarising, factoring, filtering joins, and subsetting methods. The datasets have been merged, and then grouped by suburb and summarised to produce a merged dataframe, consisting of 744 observations of 5 variables.

The source dataset for house prices has been discussed as relates to tidy data principles. Manipulation of the dataset has been made to demonstrate a tidy dataframe.

Two new numeric variables have been created and added to the merged dataframe, creating the final dataframe of 744 observations of 7 variables. All variables of this dataframe have been scanned for missing values, and replaced as appropriate. Numeric variables have been scanned for special values and validity checks have been applied.

Numeric variables have been checked for outliers with boxplots, and the distribution shape of the data has been identified as right-skewed.

Two transformation techniques have been applied to the Quarterly Median House Price variable. A natural logarithmic transformation has brought the data into a more symmetrical distribution, as shown in a histogram. A discretisation technique has distributed the values evenly between categories, as shown in a dotplot.

Data - Part 1

Read in source data

The datasets chosen are “Quarterly Property Sales: Houses by Suburb” and “Victorian Liquor Licences by Location”.

Quarterly Property Sales: Houses by Suburb

The dataset reports on the median sale prices of houses by suburb in Victoria, over a period of 15 months.

The data is updated quarterly by the Victorian Valuer-General, as part of the Victorian Property Sales Report. The latest figures are current up to March 2020 (as released in September 2020). It is made available by the Victorian State Government, Department of Environment, Land, Water and Planning (DELWP 2020).

The dataset contains 754 observations over 10 variables:

  1. Suburb - A suburb within Victoria.

  2. Jan - Mar 19 - Median sale price of houses within the 1st quarter of 2019

  3. Apr - Jun 19 - Median sale price of houses within the 2nd quarter of 2019

  4. Jul - Sep 19 - Median sale price of houses within the 3rd quarter of 2019

  5. Oct - Dec 19 - Median sale price of houses within the 4th quarter of 2019

  6. Jan - Mar 19 - Median sale price of houses within the 1st quarter of 2020

  7. No of Sales Jan - Mar 2020 - Total number of sales within the first quarter of 2020

  8. No of Sales YTD - Cummulative total number of sales in the current year

  9. Change % Jan - Mar 19 Jan - Mar 2020 - Percentage change in house prices between the 1st quarters of 2019 and 2020

  10. % Change Dec 2019 - Mar 2020 - Percentage change in house prices between the 4th quarter of 2019 and 1st quarter of 2020

NB. The last variable name is incorrect in the source dataset. The correct variable name has been obtained from the published report, Victorian Property Sales Report: March 2020 Quarter (Department of Environment, Land, Water and Planning 2020, p.3), and is corrected when reading in the data using the col_names argument.

The dataset is provided as an Excel file with one sheet of data. Although the data begins on row 1, the variable names span rows 1 to 4. A footnote is given at the end of the data on row 759. To import the data efficiently, the range and col_names arguments are used, and variable names have been simplified.

The dataframe is saved as House_Sales.

# Read in data, limiting to a range, and replacing variable names.

House_Sales <- read_excel("C:/R/Learning R/MATH2349/Datasets/Houses_1st_Quarter_2020.xls", range = "A5:J758", 
                          col_names = c("Suburb", "Q1 2019", "Q2 2019", "Q3 2019", "Q4 2019", "Q1 2020", 
                                        "Sales Q1 2020", "Sales YTD", "% Change 2019-2020", "% Change Q1 2020"))

head(House_Sales) 

Victorian Liquor Licences by Location

The dataset is a stocktake of all active liquor licences across Victoria, current as of 31 March 2020. It is made available by the Victorian Commission for Gambling and Liquor Regulation (VCGLR 2020).

The dataset contains 23448 observations over 19 variables:

  1. Licence Num - An 8 digit licence number

  2. Licensee - Name of licence holder, which may be an individual or business

  3. Trading As - Trading name of licensee

  4. Category - Type of licence issued

  5. Trading Hours - Licensed trading hours

  6. After 11pm - Number of hours licensed to trade after 11pm, where 0 is none and 99 is not applicable

  7. Maximum Capacity - Maximum capacity of business

  8. Address - Street address of business

  9. Suburb - Suburb of business

  10. Postcode - Postcode of business

  11. Latitude - Location of business, co-ordinate of latitude

  12. Longitude - Location of business, co-ordinate of longitude

  13. Postal Address - Licensee’s postal address, street

  14. Postal Suburb - Licensee’s postal address, suburb

  15. Postal Postcode - Licensee’s postal address, postcode

  16. Council - Local Government Area

  17. Region - Region within Victoria

  18. Metro/Regional - Region category, whether Metro or Regional

  19. Gaming - Gaming venue, (Y)es or (N)o

The dataset is provided as an Excel file with one sheet of data, starting at row 4. Columns “M” and “N” are hidden, merged within column “L”. They contain no data, and are therefore excluded using the col_types argument.

The dataframe is saved as LLicences.

# Read in data, starting at row 4, and skipping columns M and N

LLicences <- read_excel("C:/R/Learning R/MATH2349/Datasets/liquor_licences_march_2020.xlsx",  skip = 3,
    col_types = c("text", "text", "text", "text", "text", "numeric", "numeric", "text", "text", "text", 
                  "numeric", "numeric", "skip", "skip", "text", "text", "text", "text", "text", "text", 
                  "text"))

head(LLicences)  

The datasets will be merged in Data - Part 2.

Understand - Part 1

Understand source data

Attributes and conversions in House_Sales

House_Sales has 754 observations over 10 variables. Columns 2-6 and 8-10 contain numeric values, but have been read in as character types. This corresponds to the formating of the columns in the source spreadsheet, where column 7 is formatted as Numeric, and all else are formatted as General.

Sales Q1 2020 has been read in as a double-type numeric. As it is a count of house sales, it can be converted to an integer-type numeric.

The variables Q1 2019, Q2 2019, Q3 2019, Q4 2019, Q1 2020, % Change 2019-2020, and % Change Q1 2020 have been converted to double numeric. Sales Q1 2020 and Sales YTD have been converted to integer numeric.

Suburb variable should not be factored, as each of its values in this dataset is unique.

# convert from character type to numeric

House_Sales$`Q1 2019` <- as.numeric(House_Sales$`Q1 2019`)
House_Sales$`Q2 2019` <- as.numeric(House_Sales$`Q2 2019`)
House_Sales$`Q3 2019` <- as.numeric(House_Sales$`Q3 2019`)
House_Sales$`Q4 2019` <- as.numeric(House_Sales$`Q4 2019`)
House_Sales$`Q1 2020` <- as.numeric(House_Sales$`Q1 2020`)
House_Sales$`Sales Q1 2020` <- as.integer(House_Sales$`Sales Q1 2020`)
House_Sales$`Sales YTD` <- as.integer(House_Sales$`Sales YTD`)
House_Sales$`% Change 2019-2020` <- as.numeric(House_Sales$`% Change 2019-2020`)
House_Sales$`% Change Q1 2020` <- as.numeric(House_Sales$`% Change Q1 2020`)

str(House_Sales)
## tibble [754 x 10] (S3: tbl_df/tbl/data.frame)
##  $ Suburb            : chr [1:754] "ABBOTSFORD" "ABERFELDIE" "AINTREE" "AIREYS INLET" ...
##  $ Q1 2019           : num [1:754] 992500 1420000 575000 925000 749000 ...
##  $ Q2 2019           : num [1:754] 982500 1172500 570000 719000 758000 ...
##  $ Q3 2019           : num [1:754] 1140000 1352500 575000 700000 810500 ...
##  $ Q4 2019           : num [1:754] 1027500 1650000 620400 1400000 851000 ...
##  $ Q1 2020           : num [1:754] 1175000 1665000 675000 1005000 797000 ...
##  $ Sales Q1 2020     : int [1:754] 15 7 21 8 21 14 24 7 18 51 ...
##  $ Sales YTD         : int [1:754] 15 7 21 8 21 14 24 7 18 51 ...
##  $ % Change 2019-2020: num [1:754] 18.4 17.3 17.4 8.6 6.4 6.1 0.6 10.8 9.6 7.5 ...
##  $ % Change Q1 2020  : num [1:754] 14.4 0.9 8.8 -28.2 -6.3 6.7 -22.6 8.4 19.5 3.1 ...

Attributes and conversions in LLicences

LLicences has 23,448 observations of 19 variables. Variables have been converted to the correct types when reading in the data.

Category, Region, Metro/Regional and Gaming variables have been factored. The Gaming variable has also been labelled.

The Suburb variable will be factored last, before merging, as a step in finding non-standard missing values (See Scan I).

# factor variables

LLicences$Category <- factor(LLicences$Category)
LLicences$Region <- factor(LLicences$Region)
LLicences$`Metro/Regional` <- factor(LLicences$`Metro/Regional`)
LLicences$Gaming <- factor(LLicences$Gaming, levels = c("N", "Y"), labels = c("No", "Yes"))

# get structure

str(LLicences)
## tibble [23,276 x 19] (S3: tbl_df/tbl/data.frame)
##  $ Licence Num     : chr [1:23276] "31100099" "31100120" "31100138" "31100154" ...
##  $ Licensee        : chr [1:23276] "ARNOLD BRUCE" "OAKLAND HUNT CLUB" "C S I R SKI CLUB" "SKILIB ALPINE CLUB" ...
##  $ Trading As      : chr [1:23276] "UNIVERSITY SKI CLUB" "OAKLAND HUNT CLUB" "C S I R SKI CLUB" "SKILIB ALPINE CLUB" ...
##  $ Category        : Factor w/ 14 levels "BYO Permit","Full Club Licence",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Trading Hours   : chr [1:23276] "BYO Permit" "BYO Permit" "BYO Permit" "BYO Permit" ...
##  $ After 11 pm     : num [1:23276] 99 99 99 99 99 99 99 99 99 99 ...
##  $ Maximum Capacity: num [1:23276] NA NA NA NA NA NA NA NA NA NA ...
##  $ Address         : chr [1:23276] "U S C LODGE ALLOTMENTS 15 & 16 ALPINE VILL" "SOMERTON ROAD" "SITE 65 1 SCHUSS STREET" "2 GOAL POSTS RD" ...
##  $ Suburb          : chr [1:23276] "MT BULLER" "GREENVALE" "FALLS CREEK" "MT BULLER" ...
##  $ Postcode        : chr [1:23276] "3723" "3047" "3699" "3723" ...
##  $ Latitude        : num [1:23276] -37.1 -37.6 -36.9 -37.1 -37.9 ...
##  $ Longitude       : num [1:23276] 146 145 147 146 145 ...
##  $ Postal Address  : chr [1:23276] "PO BOX 93" "SOMERTON ROAD" "PO BOX 15" "PO BOX 140" ...
##  $ Postal Suburb   : chr [1:23276] "SANDOWN VILLAGE" "GREENVALE" "CARNEGIE" "CANTERBURY" ...
##  $ Postal Postcode : chr [1:23276] "3171" "3047" "3163" "3126" ...
##  $ Council         : chr [1:23276] "MANSFIELD SHIRE COUNCIL" "HUME CITY COUNCIL" "ALPINE SHIRE COUNCIL" "MANSFIELD SHIRE COUNCIL" ...
##  $ Region          : Factor w/ 11 levels "Barwon South-West",..: 4 7 4 4 8 6 4 1 4 4 ...
##  $ Metro/Regional  : Factor w/ 2 levels "Metro","Regional": 2 1 2 2 1 1 2 2 2 2 ...
##  $ Gaming          : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...

Filter out non-retail licences

Factoring Category has revealed all possible values of licence type.

As licences that don’t permit sale or service of alcohol to the public will not be useful in the final dataframe, licences of that type (Pre-retail) have been filtered out. The filtered result is saved as LL_Retail.

The attributes of the merged data will be examined in Understand - Part 2

# filter out Pre-retail Licence rows and check result with a table

LL_Retail <- LLicences %>% filter(Category != "Pre-retail Licence")

table(LL_Retail$Category)
## 
##                           BYO Permit                    Full Club Licence 
##                                 1110                                  738 
##                      General Licence         Late night (general) Licence 
##                                 1695                                  470 
##     Late night (on-premises) Licence Late night (packaged liquor) Licence 
##                                  355                                    3 
##                      Limited Licence                  On-premises Licence 
##                                 5402                                    1 
##                  On-Premises Licence              Packaged Liquor Licence 
##                                 2024                                 2141 
##                   Pre-retail Licence                   Producer's Licence 
##                                    0                                  913 
##          Restaurant and cafe Licence              Restricted Club Licence 
##                                 6503                                  862

Scan I - Part 1

Find and replace missing values in merge key

The datasets will be matched by Suburb. Therefore, the suburb values in both datasets have been made as complete and accurate as possible.

Standard missing values

A count has been taken of the number of standard missing values in Suburb (Sullivan 2019). A count of unique values is included as an additional way to understand the variable.

R has found 0 missing suburb values in House_Sales, and 2 missing suburb values in LL_Retail.

# summarise unique values and missing values

House_Sales %>% summarise('data' = "House_Sales -- Suburb",
                          'unique values' = n_distinct(`Suburb`, na.rm = TRUE),
                          'missing values' = sum(is.na(`Suburb`)))
# summarise unique values and missing values

LL_Retail %>% summarise('data' = "LL_Retail -- Suburb",
                        'unique values' = n_distinct(`Suburb`, na.rm = TRUE),
                        'missing values' = sum(is.na(`Suburb`)))

Non-standard missing values

Factoring the suburb variable can be useful to look for misspellings and missing values that are not obvious (Cai 2015). It isn’t useful to factor the suburb variable in House_Sales, as each suburb is only included once. But it is useful in *LL_Retail**, as there are far fewer unique values than observations.

Suburb has been factored, and then sorted, bringing to the top [Not applicable] -, and [NOT APPLICABLE] -.

A visual scan of the dataframe has also revealed NO FIXED ADDRESS as a missing value.

These have been replaced with NA, so that R can recognise them as missing values.

# find all possible values of Suburb

LL_Retail$Suburb <- factor(LL_Retail$Suburb) 

# sort by suburb and display

LL_Retail %>% arrange(Suburb) %>% select(Suburb, `Trading As`, Address, Category) %>% head(6)
# subset to find all instances of missing values and convert to NA

LL_Retail$Suburb[LL_Retail$Suburb == "[Not applicable] -" | LL_Retail$Suburb == "[NOT APPLICABLE] -" |
                   LL_Retail$Suburb == "NO FIXED ADDRESS"] <- NA

LL_Retail %>% summarise('data' = "LL_Retail -- Suburb",
                        'unique values' = n_distinct(`Suburb`),
                        'missing values' = sum(is.na(`Suburb`)))

The data has been filtered to examine the rows with missing suburbs, and appropriate values found by searching the business names online. A visual scan has also found inconsistencies and irregularities in Suburb values. These have been corrected.

To prevent loss of data when inserting values, the data has been filtered and checked for existing suburb values in other rows. MORNINGTON PENINSULA BREWERY has two rows, one with a suburb value which is used to fill the missing value in the other row as well.

Any values that are new to the Suburbs variable have been added as a new factor level (DataMentor n.d).

Rows that cannot be appropriately given a suburb value, due to having no fixed location or point of sale, will necessarily be omitted when merging the datasets.

# Examine rows with missing values

LL_Retail %>% filter(., is.na(Suburb)) %>% select(Suburb, `Trading As`, Category, Address) %>% arrange(`Trading As`)
# filter data to check for duplicate trading names with existing suburbs

LL_Retail %>% filter(`Trading As` == "FESTIVAL CITY WINES" | `Trading As` == "CHIRPING BIRD" | `Trading As` == "GENTLE ANNIE WINES" | `Trading As` == "WAGTAIL WINES" | `Trading As` == "LAURISTON VINEYARD" | `Trading As` == "NAPPA MERRI VINEYARDS" | `Trading As` == "BELLARINE BREWING COMPANY" | `Trading As` == "SAVARA IN BREWERY" | `Trading As` == "BELLINGHAM ESTATE" | `Trading As` == "MORNINGTON PENINSULA BREWERY" | `Trading As` == "DOLPHIN BREWERY" | `Trading As` == "BELVOIR WINES" | `Trading As` == "CAVALIER BREWING" ) %>% select(Suburb, `Trading As`, Category, Address ) %>% arrange(Suburb) %>% head(6)
# add new factor levels

levels(LL_Retail$Suburb) <- c(levels(LL_Retail$Suburb), "RAVENSWOOD", "ARTHURS SEAT", "MORNINGTON", "LAURISTON", "CORADJIL") 

# insert suburb values where missing

LL_Retail$Suburb[LL_Retail$`Trading As` == "FESTIVAL CITY WINES"] <- "TRUGANINA"
LL_Retail$Suburb[LL_Retail$`Trading As` == "CHIRPING BIRD"] <- "MOOROODUC"
LL_Retail$Suburb[LL_Retail$`Trading As` == "GENTLE ANNIE WINES"] <- "DOOKIE"
LL_Retail$Suburb[LL_Retail$`Trading As` == "WAGTAIL WINES"] <- "TAGGERTY"
LL_Retail$Suburb[LL_Retail$`Trading As` == "LAURISTON VINEYARD"] <- "LAURISTON"
LL_Retail$Suburb[LL_Retail$`Trading As` == "NAPPA MERRI VINEYARDS"] <- "BUSHFIELD"
LL_Retail$Suburb[LL_Retail$`Trading As` == "BELLARINE BREWING COMPANY"] <- "SOUTH GEELONG"
LL_Retail$Suburb[LL_Retail$`Trading As` == "SAVARA IN BREWERY"] <- "SALE"
LL_Retail$Suburb[LL_Retail$`Trading As` == "BELLINGHAM ESTATE"] <- "ARTHURS SEAT"
LL_Retail$Suburb[LL_Retail$`Trading As` == "MORNINGTON PENINSULA BREWERY"] <- "MORNINGTON"
LL_Retail$Suburb[LL_Retail$`Trading As` == "DOLPHIN BREWERY"] <- "DAYLESFORD"
LL_Retail$Suburb[LL_Retail$`Trading As` == "BELVOIR WINES"] <- "RAVENSWOOD"
LL_Retail$Suburb[LL_Retail$`Trading As` == "CAVALIER BREWING"] <- "DERRIMUT"

# adjust suburb values where appropriate

LL_Retail$Suburb[LL_Retail$Suburb == "GLEN AIRE"] <- "GLENAIRE"
LL_Retail$Suburb[LL_Retail$Suburb == "QUEENSCLIFFE"] <- "QUEENSCLIFF"
LL_Retail$Suburb[LL_Retail$Suburb == "WYNDHAMVALE"] <- "WYNDHAM VALE"
LL_Retail$Suburb[LL_Retail$Suburb == "CORADJIL VIA COBDEN"] <- "CORADJIL"
LL_Retail$Suburb[LL_Retail$Suburb == "CURLEWIS VIA DRYSDALE"] <- "CURLEWIS"
LL_Retail$Suburb[LL_Retail$Suburb == "GEELONG SOUTH"] <- "SOUTH GEELONG"
LL_Retail$Suburb[LL_Retail$Suburb == "WEST SUNSHINE"] <- "SUNSHINE WEST"
LL_Retail$Suburb[LL_Retail$Suburb == "FOOTSCRAY WEST"] <- "WEST FOOTSCRAY"
LL_Retail$Suburb[LL_Retail$Suburb == "NORTH WILLIAMSTOWN"] <- "WILLIAMSTOWN NORTH"
LL_Retail$Suburb[LL_Retail$Suburb == "NORTH SUNSHINE"] <- "SUNSHINE NORTH"
LL_Retail$Suburb[LL_Retail$Suburb == "GEELONG NORTH"] <- "NORTH GEELONG"
LL_Retail$Suburb[LL_Retail$Suburb == "NORTH ESSENDON"] <- "ESSENDON NORTH"
LL_Retail$Suburb[LL_Retail$Suburb == "EAST KEILOR"] <- "KEILOR EAST"
LL_Retail$Suburb[LL_Retail$Suburb == "GEELONG EAST"] <- "EAST GEELONG"
LL_Retail$Suburb[LL_Retail$Suburb == "CHILLWELL"] <- "NEWTOWN"
LL_Retail$Suburb[LL_Retail$Suburb == "DRUMCONDA"] <- "DRUMCONDRA"
LL_Retail$Suburb[LL_Retail$Suburb == "ESSENDON AIRPORT"] <- "ESSENDON FIELDS"
LL_Retail$Suburb[LL_Retail$Suburb == "FOOTSCRAY SOUTH"] <- "FOOTSCRAY"
LL_Retail$Suburb[LL_Retail$Suburb == "GELLIBRAND RIVER"] <- "GELLIBRAND"
LL_Retail$Suburb[LL_Retail$Suburb == "LAKE GILLEAR"] <- "ALLANSFORD"
LL_Retail$Suburb[LL_Retail$Suburb == "NORLANE WEST"] <- "NORLANE"
LL_Retail$Suburb[LL_Retail$Suburb == "WERRIBEE PARK"] <- "WERRIBEE SOUTH"

# show result 

LL_Retail %>% summarise('data' = "LL_Retail -- Suburb",
                        'unique values' = n_distinct(`Suburb`, na.rm = TRUE),
                        'missing values' = sum(is.na(`Suburb`)))

Final check before merge

To scan for mismatches, a filtering join has been used to find all rows in House_Sales that don’t have a match in LL_Retail. There are 75 suburbs in House_Sales without a match in LL_Retail.

Irregular and values unrecognised by Australia Post have been replaced, and the match between datasets has been checked again.

The output shows that after cleaning Suburb values, there are only 58 out of 744 suburbs in House_Sales with no match in LL_Retail.

Missing values will be examined again after merging the data in SCAN I - Part 2

# scan for suburbs without a match

House_Sales %>% anti_join(LL_Retail, by = "Suburb") 
# replace problem suburbs

House_Sales$Suburb[House_Sales$Suburb == "ASCOT (GREATER BENDIGO)"] <- "ASCOT"
House_Sales$Suburb[House_Sales$Suburb == "BELLFIELD (BANYULE)"] <- "BELLFIELD"
House_Sales$Suburb[House_Sales$Suburb == "CHELTENHAM NORTH"] <- "CHELTENHAM"     
House_Sales$Suburb[House_Sales$Suburb == "COBURG EAST"] <- "COBURG"              
House_Sales$Suburb[House_Sales$Suburb == "COONANS HILL"] <- "PASCOE VALE SOUTH"  
House_Sales$Suburb[House_Sales$Suburb == "COWES WEST"] <- "COWES"
House_Sales$Suburb[House_Sales$Suburb == "GOLDEN POINT (BALLARAT)"] <- "GOLDEN POINT"
House_Sales$Suburb[House_Sales$Suburb == "HASTINGS WEST"] <- "HASTINGS"
House_Sales$Suburb[House_Sales$Suburb == "HILLSIDE (MELTON)"] <- "HILLSIDE"
House_Sales$Suburb[House_Sales$Suburb == "JEERALANG NORTH"] <- "JEERALANG"
House_Sales$Suburb[House_Sales$Suburb == "KANGAROO GROUND SOUTH"] <- "KANGAROO GROUND"
House_Sales$Suburb[House_Sales$Suburb == "KEW NORTH"] <- "KEW"
House_Sales$Suburb[House_Sales$Suburb == "KILLARA (WODONGA)"] <- "KILLARA"
House_Sales$Suburb[House_Sales$Suburb == "LAVERTON SOUTH"] <- "LAVERTON"
House_Sales$Suburb[House_Sales$Suburb == "MERINDA PARK"] <- "CRANBOURNE"
House_Sales$Suburb[House_Sales$Suburb == "NEWTOWN (GREATER GEELONG)"] <- "NEWTOWN"
House_Sales$Suburb[House_Sales$Suburb == "PATTERSON GARDENS"] <- "KEILOR"
House_Sales$Suburb[House_Sales$Suburb == "THOMSON (GREATER GEELONG)"] <- "THOMSON"
House_Sales$Suburb[House_Sales$Suburb == "WENDOUREE WEST"] <- "WENDOUREE"
House_Sales$Suburb[House_Sales$Suburb == "WOOLAMAI WATERS"] <- "CAPE WOOLAMAI"

# summarise unique values and missing values

House_Sales %>% anti_join(LL_Retail, by = "Suburb")  %>% summarise('unique values in House_Sales' = n_distinct(House_Sales$Suburb, na.rm = TRUE),
                          'unmatched values in House_Sales' = n_distinct(`Suburb`, na.rm = TRUE),
                          'missing values in House_Sales' = sum(is.na(`Suburb`)))

Data - Part 2

Merge datasets

To add matching LL_Retail variables to House_Sales, and keeping all rows in House_Sales, the datasets have been left-joined to create a new dataframe Houses_and_Licences.

# Merge House_Sales and LL_Retail to keep all observations in House Sales

Houses_and_Licences <- left_join(House_Sales, LL_Retail, by = "Suburb") 
head(Houses_and_Licences)

Variables of interest have been selected to create the basis of the final merged dataframe, saved as Suburbs_Summary_March2020. The data has been grouped by suburb and arranged by number of liquor licences in descending order, greatest to least.

Variables have been selected for the following reasons:

  • Suburb as a key variable

  • Metro/Regional as a clue to the probable density of the population

  • Q1 2020 as the most recently available median house price

  • Sales Q1 2020 as a clue to the reliability of the median house price (more sales = more reliable)

  • License Num to provide a count of the number of licences in each suburb

Suburb_Summary_March2020 <- Houses_and_Licences %>% group_by(Suburb) %>% 
  summarise('Metro/Regional' = first(`Metro/Regional`), 
            'Quarterly Median House Price' = first(`Q1 2020`), 
            'Quarterly Sales' = first(`Sales Q1 2020`), 
            'Retail Liquor Licences' = n_distinct(`Licence Num`, na.rm = TRUE)) %>% 
  arrange(desc(`Retail Liquor Licences`))

head(Suburb_Summary_March2020)

Understand - Part 2

Understand merged data

The resulting dataframe has 744 observations of 5 variables:

  1. Suburb - the suburb within Victoria (character, and should not be factored as each value is unique)

  2. Metro/Regional - classification as metropolitan Melbourne or regional Victoria, according to the VCGLR (factor with 2 levels: Metro, Regional)

  3. Quarterly Median House Price - the median price of houses sold from January to March 2020 (double numeric)

  4. Quarterly Sales - the number of houses sold from January to March 2020 (integer numeric)

  5. Retail Liquor Licences - the number of active liquor licences permitting sale or service of liquor to the public in March 2020 (integer numeric)

The required conversions of data types were completed in Understand I

# show structure

str(Suburb_Summary_March2020)
## tibble [744 x 5] (S3: tbl_df/tbl/data.frame)
##  $ Suburb                      : chr [1:744] "RICHMOND" "FITZROY" "ST KILDA" "SOUTH YARRA" ...
##  $ Metro/Regional              : Factor w/ 2 levels "Metro","Regional": 1 1 1 1 2 2 1 1 1 1 ...
##  $ Quarterly Median House Price: num [1:744] 1243000 1305500 1771800 1993800 795000 ...
##  $ Quarterly Sales             : int [1:744] 43 18 8 18 11 37 8 15 29 29 ...
##  $ Retail Liquor Licences      : int [1:744] 295 253 249 234 208 191 183 182 169 162 ...

Tidy and Manipulate Data I

Tidy data

House_Sales has many variables with dates in column names. To adhere to tidy data principles, the quarter and year should be variables.

For example, Jan - Mar 19 (read in as Q1 2019) would be three variables; “Quarter”, “Year”, “Median Sale Price”.

The Sales values for 2019 are not available in the data supplied. But the % Change values for 2019 could be imputed based on the Median values.

# Select and gather median sale price values, into "Period" and "Median" variables

HS_Tidy_Quarters <- House_Sales %>% select(`Suburb`, `Q1 2019`, `Q2 2019`, `Q3 2019`, 
                                           `Q4 2019`, `Q1 2020`) %>% 
  gather(`Q1 2019`, `Q2 2019`, `Q3 2019`, `Q4 2019`, `Q1 2020`, 
         key = "Period", value = "Median") %>% arrange(Suburb)

# In a new df, select and gather sales values, into "Period" and "Sales"

HS_Tidy_Sales <- House_Sales %>% select(`Suburb`, `Sales Q1 2020`) %>% 
  gather(`Sales Q1 2020`, key = "Period", value = "Sales") %>% arrange(Suburb)

# Rename "Period" values

HS_Tidy_Sales$Period <- HS_Tidy_Sales$Period <- "Q1 2020"

# In a new df,  select and gather % change values, into "Period" and "% Change"

HS_Tidy_Change <- House_Sales %>% select(`Suburb`, `% Change Q1 2020`) %>% 
  gather(`% Change Q1 2020`, key = "Period", value = "% Change") %>% arrange(Suburb)

# Rename "Period" values

HS_Tidy_Change$Period <- "Q1 2020"

# Now merge the three dataframes, starting with Sales and Change

HS_Tidy_1 <- full_join(HS_Tidy_Sales, HS_Tidy_Change, by = "Suburb")

# Create a key to preserve NA values

HS_Tidy_2 <- HS_Tidy_1 %>% unite(Key, Suburb, Period.x, sep = "/")

HS_Tidy_Quarters_2 <- HS_Tidy_Quarters %>% unite(Key, Suburb, Period, sep = "/")

# Join by the created key

HS_Tidy_3 <- full_join(HS_Tidy_2, HS_Tidy_Quarters_2, by = "Key")

# Separate the created key

HS_Tidy_4 <- HS_Tidy_3 %>% separate(Key, into = c("Suburb", "Period"), sep = "/")

# Select variables and separate "Period" into "Quarter" and "Year"

HS_Tidy_Complete <- HS_Tidy_4 %>% select(Suburb, Period, Median, Sales, `% Change`) %>% 
  separate(Period, into = c("Quarter", "Year"), sep = " ") %>% arrange(Suburb, Year, Quarter)

head(HS_Tidy_Complete)

Tidy and Manipulate Data II

Returning to Suburb_Summary_March2020, two new column variables have been created:

  1. % Share - each suburb’s share of retailing licences*

  2. Rank - each suburb’s rank in terms of median house price, from most expensive (rank 1) to least expensive (rank 562)

Of the possible ranking functions, dense_rank has been chosen because it does not prioritise one observation over another when they share the same position relative to the whole (Github n.d). For this data, dense_rank is the more intuitive than row_number or min_rank which preserve the total number of rows in the ranking algorithm, and would cause the least expensive suburb to be ranked 744 instead of 562.

*It should be remembered that the data does not include suburbs with licenses where there was no median house price information.

# create two new variables

SubS_March_2020 <- Suburb_Summary_March2020 %>% 
  mutate(., "% Share" = `Retail Liquor Licences` / sum(`Retail Liquor Licences`, na.rm = TRUE) * 100, 
         "Rank" = dense_rank(desc(`Quarterly Median House Price`)))

head(select(SubS_March_2020, Suburb, `% Share`, Rank))

Scan I - Part 2

Find and replace missing values in merged data

SubS_March_2020 has been scanned for standard missing values.

Because there were 58 unmatched suburbs in House_Sales before the data was merged, there should be 58 incomplete rows of data. Metro/Regional shows 58 missing values, as expected.

Retail Liquor Licenses has 0 missing values, but this is because they appropriately contain a value of 0, obtained by counting distinct values of the Licence Num variable.

Missing Metro/Regional values can only be reliably imputed by Suburb, as house price can vary greatly. A comprehensive list of Metro suburbs would be a useful resource for writing a function to impute these values.

# check dataframe for missing values

colSums(is.na(SubS_March_2020))
##                       Suburb               Metro/Regional 
##                            0                           58 
## Quarterly Median House Price              Quarterly Sales 
##                            0                            0 
##       Retail Liquor Licences                      % Share 
##                            0                            0 
##                         Rank 
##                            0
# find incomplete cases

SubS_March_2020[!complete.cases(SubS_March_2020), ]
# add missing Metro/Regional values

SubS_March_2020$`Metro/Regional`[SubS_March_2020$Suburb == "ALBANVALE" | SubS_March_2020$Suburb == "BALCOMBE" |
                                   SubS_March_2020$Suburb == "BALNARRING BEACH" | SubS_March_2020$Suburb == "BELGRAVE HEIGHTS" | 
                                   SubS_March_2020$Suburb == "BOTANIC RIDGE" | SubS_March_2020$Suburb == "BROOKFIELD" | 
                                   SubS_March_2020$Suburb == "CAPEL SOUND" | SubS_March_2020$Suburb == "CHELTENHAM EAST" | 
                                   SubS_March_2020$Suburb == "COBBLEBANK" | SubS_March_2020$Suburb == "EAST WARBURTON" | 
                                   SubS_March_2020$Suburb == "ESSENDON WEST" | SubS_March_2020$Suburb == "HARKNESS" | 
                                   SubS_March_2020$Suburb == "KINGS PARK" | SubS_March_2020$Suburb == "KURUNJANG" | 
                                   SubS_March_2020$Suburb == "OFFICER SOUTH" | SubS_March_2020$Suburb == "OSBORNE" |
                                   SubS_March_2020$Suburb == "STUDFIELD" | SubS_March_2020$Suburb == "THORNHILL PARK" | 
                                   SubS_March_2020$Suburb == "WATSONIA NORTH" | SubS_March_2020$Suburb == "WEIR VIEWS"] <- "Metro"
SubS_March_2020$`Metro/Regional`[SubS_March_2020$Suburb == "ASCOT" | SubS_March_2020$Suburb == "BALLARAT NORTH" |
                                   SubS_March_2020$Suburb == "BANDIANA" | SubS_March_2020$Suburb == "BONSHAW" | 
                                   SubS_March_2020$Suburb == "CANADIAN" | SubS_March_2020$Suburb == "CHARLEMONT" | 
                                   SubS_March_2020$Suburb == "COONGULLA" | SubS_March_2020$Suburb == "" |
                                   SubS_March_2020$Suburb == "EAST BAIRNSDALE" | SubS_March_2020$Suburb == "EAST BENDIGO" | 
                                   SubS_March_2020$Suburb == "EUREKA" | SubS_March_2020$Suburb == "FLORA HILL" | 
                                   SubS_March_2020$Suburb == "INVERMAY PARK" | SubS_March_2020$Suburb == "JACKASS FLAT" | 
                                   SubS_March_2020$Suburb == "JEERALANG" | SubS_March_2020$Suburb == "KILLARA" | 
                                   SubS_March_2020$Suburb == "LAKE GARDENS" | SubS_March_2020$Suburb == "LAKE WENDOUREE" | 
                                   SubS_March_2020$Suburb == "LENEVA" | SubS_March_2020$Suburb == "MARSHALL" |
                                   SubS_March_2020$Suburb == "MCKENZIE HILL" | SubS_March_2020$Suburb == "MITCHELL PARK" | 
                                   SubS_March_2020$Suburb == "NEWINGTON" | SubS_March_2020$Suburb == "NEWLANDS ARM" | 
                                   SubS_March_2020$Suburb == "NORTH WONTHAGGI" | SubS_March_2020$Suburb == "PARADISE BEACH" | 
                                   SubS_March_2020$Suburb == "PIONEER BAY" | SubS_March_2020$Suburb == "RAYMOND ISLAND" | 
                                   SubS_March_2020$Suburb == "SAILORS GULLY" | SubS_March_2020$Suburb == "SHEPPARTON NORTH" | 
                                   SubS_March_2020$Suburb == "SUNDERLAND BAY" | SubS_March_2020$Suburb == "SUNSET STRIP" |
                                   SubS_March_2020$Suburb == "SURF BEACH" | SubS_March_2020$Suburb == "THE HONEYSUCKLES" | 
                                   SubS_March_2020$Suburb == "WALKERVILLE" | SubS_March_2020$Suburb == "WANDANA HEIGHTS" | 
                                   SubS_March_2020$Suburb == "WHITTINGTON" | SubS_March_2020$Suburb == "WIMBLEDON HEIGHTS" |
                                   SubS_March_2020$Suburb == "WINTER VALLEY"] <- "Regional"

# check Metro/Regional for missing values

sum(is.na(SubS_March_2020$`Metro/Regional`))
## [1] 0

Special values and obvious errors

Numeric variables have been checked for special values

# check for infinite and nan values

sum(is.infinite(SubS_March_2020$`Quarterly Median House Price`))
## [1] 0
sum(is.nan(SubS_March_2020$`Quarterly Median House Price`))
## [1] 0
sum(is.infinite(SubS_March_2020$`Quarterly Sales`))
## [1] 0
sum(is.nan(SubS_March_2020$`Quarterly Sales`))
## [1] 0
sum(is.infinite(SubS_March_2020$`Retail Liquor Licences`))
## [1] 0
sum(is.nan(SubS_March_2020$`Retail Liquor Licences`))
## [1] 0
sum(is.infinite(SubS_March_2020$`% Share`))
## [1] 0
sum(is.nan(SubS_March_2020$`% Share`))
## [1] 0
sum(is.infinite(SubS_March_2020$Rank))
## [1] 0
sum(is.nan(SubS_March_2020$Rank))
## [1] 0

Numeric values have been checked for obvious errors, using the following parameters:

  • Quarterly Median House Price is greater than 50,000.

  • Quarterly Sales are not less than 0

  • Retail Liquor Licences are not less than 0

# validate price, sales, and licences

valid_price <- SubS_March_2020$`Quarterly Median House Price` >= 50000

valid_sales <- SubS_March_2020$`Quarterly Sales` >= 0

valid_licences <- SubS_March_2020$`Retail Liquor Licences` >= 0

table(valid_price) 
## valid_price
## TRUE 
##  744
table(valid_sales) 
## valid_sales
## TRUE 
##  744
table(valid_licences)
## valid_licences
## TRUE 
##  744

Scan II

The numeric values in SubS_March_2020 have been checked for outliers.

There are many outliers indicating all numeric variables are strongly right-skewed. They are likely to be accurate reflections of the data.

There are two extreme outliers in the Metro median house price. These have been addressed in the next step, Transform.

# create boxplots

boxplot(SubS_March_2020$`Quarterly Median House Price` ~ SubS_March_2020$`Metro/Regional`, 
        main="Median House Price by Region", ylab="Price", col = "beige")

# other variables show similar right-skew

par(mfrow = c(1, 3))
boxplot(SubS_March_2020$`Quarterly Sales` ~ SubS_March_2020$`Metro/Regional`,
        main="Quarterly Sales by Region", ylab="Sales", col = "beige")
boxplot(SubS_March_2020$`Retail Liquor Licences` ~ SubS_March_2020$`Metro/Regional`, 
        main="Liquor Licences by Region", ylab="Licences", col = "beige")
boxplot(SubS_March_2020$`% Share` ~ SubS_March_2020$`Metro/Regional`, 
        main="% Share of Licences by Region", ylab="% Share", col = "beige")

Transform

Quarterly Median House Price has been transformed to reduce right-skewedness and suppress outliers.

Two methods have been applied.

1) Natural logarithm transformation

The house price variable has been plotted as it occurs in the data to show the right skew, and then a logarithmic transformation has been applied.

The transformed data has been plotted to show the effect of the transformation, approximating to a more normal distribution.

par(mfrow = c(2,1))

# plot house prices as normal

hist(SubS_March_2020$`Quarterly Median House Price`, main = "Median House Price", col = "beige")

# transform values

ln_median <- log(SubS_March_2020$`Quarterly Median House Price`)

# plot transformed values

hist(ln_median, main = "Natural Logarithm", col = "beige")

2) Binning by equal frequency

This method has suppressed outliers by evenly distributing the values among categories. The extreme outliers have been placed within the last category.
A dotplot of the binned values reveals a gentle curve trending from least expensive to most expensive. To help make sense of the values, the median price is shown in thousands.

As each bin contains the same number of values, it is easily seen that most suburbs had a median house price of less than 1 million in March 2020. A few suburbs had a median house price from 1 million to about 1.3 million, and an equally few number from about 1.3 million to over 4 million.

# save median price as a vector in thousands AUD

median_price <- SubS_March_2020$`Quarterly Median House Price` / 1000

# bin values by equal frequency

price_binned <- discretize(median_price, disc = "equalfreq")

# join price values to bins

price_and_bins <- median_price %>% bind_cols(price_binned)
## New names:
## * NA -> ...1
# show distribution of bins

table(price_and_bins$X)
## 
##  1  2  3  4  5  6  7  8  9 
## 83 82 83 84 79 82 82 82 87
# draw plot of price by bins

price_and_bins %>% plot(...1 ~ X, data = ., main = "Median House Price March 2020", 
                        xlab = "Bins", ylab = "Thousands, AUD") 

grid(nx = NULL)

References

Australia Post 2020, Find a Postcode, viewed 14 October 2020, https://auspost.com.au/postcode

Australian Bureau of Statistics (ABS) 2020, 2016 Census QuickStats: Port Campbell, viewed 14 October 2020, https://quickstats.censusdata.abs.gov.au/census_services/getproduct/census/2016/quickstat/SSC22092

Cai, Eric 2015, How to Get the Frequency Table of a Categorical Variable as a Data Frame in R, viewed 12 October 2020, https://chemicalstatistician.wordpress.com/2015/02/03/how-to-get-the-frequency-table-of-a-categorical-variable-as-a-data-frame-in-r/

DataMentor n.d., R Factors, viewed 13 October 2020, https://www.datamentor.io/r-programming/factor/

Department of Environment, Land, Water and Planning (DELWP) 2020, March 2020 quarter: Houses by Suburb, viewed 10 October 2020, https://www.propertyandlandtitles.vic.gov.au/__data/assets/excel_doc/0035/485936/Houses_1st_Quarter_2020.xls

Department of Environment, Land, Water and Planning (DELWP) 2020, Property Prices, viewed 10 October 2020, https://www.propertyandlandtitles.vic.gov.au/property-information/property-prices

Department of Environment, Land, Water and Planning (DELWP) 2020, Victorian Property Sales Report: March 2020 Quarter, viewed 11 October 2020, https://www.propertyandlandtitles.vic.gov.au/__data/assets/pdf_file/0032/485915/VPSR_Mar2020_final.pdf

GitHub n.d., Count/tally observations by group, viewed 12 October 2020, https://dplyr.tidyverse.org/reference/tally.html

GitHub n.d., Windowed Rank Functions, viewed 18 October 2020, https://dplyr.tidyverse.org/reference/ranking.html

Municipal Association of Victoria (MAV) n.d., Victorian Councils Map, viewed 12 October 2020, https://www.viccouncils.asn.au/find-your-council/council-map

Sullivan, John 2019, Data Cleaning with R and the Tidyverse: Detecting Missing Values, viewed 12 October 2020, https://towardsdatascience.com/data-cleaning-with-r-and-the-tidyverse-detecting-missing-values-ea23c519bc62

Victorian Commission for Gambling and Liquor Regulation (VCGLR) 2020, Current Victorian Liquor Licences as of 31/03/2020, viewed 10 October 2020, https://www.vcglr.vic.gov.au/sites/default/files/current_victorian_licences_by_location_geo-coded_march_2020.xlsx

Victorian Commission for Gambling and Liquor Regulation (VCGLR) 2020, Designated Areas, viewed 17 October 2020, https://www.vcglr.vic.gov.au/community-services/government-initiatives/designated-areas

Victorian Places 2015, Newtown and Chilwell, viewed 14 October 2020, https://www.victorianplaces.com.au/newtown-and-chilwell