Introduction

Problem Statement

We are looking at prices of Victorian two bedroom units that were sold on the last week of the 2018/19 financial year. We are also looking at the median incomes for Victorian postcodes during that same financial year. We’d like to know if a postcode’s median income correlates with property price.
We will use simple linear regression to answer this question.

Important variables descriptions:

Beds - how many bedrooms does the property have. {ordinal}
Postcode - what is the postcode the property is in. {nominal}
Price - how much did the property sell for. {scale $}
Median_income - what is the median income of the postcode. {scale $}
Property - what is the property type. Five different types of dwelling.
- levels: “House” “Terrace” “Townhouse” “Unit” “Villa”
Sale_outcome
- levels: “Passed in” “Private exchange” “Sold” “Sold after auction” “Sold prior to auction” “Withdrawn”

Data - Rvest Part 1/3

Source: Domain.com.au, 2019, Melbourne Auction Results, domain.com.au, Melbourne, viewed 01 June 2022, https://www.domain.com.au/auction-results/melbourne/2019-06-29

url <- "https://www.domain.com.au/auction-results/melbourne/2019-06-29"
my_session <- session(url, add_headers('user-agent' = 'arbitrary-string'))
response <- my_session$response
content_parsed <- content(response, as = "parsed")

parta <- content_parsed %>%
    html_elements(".css-3xqrp1")
partb <- parta %>%
    html_elements(".css-1b9txmm") 
partc <- parta %>%
  html_elements("li") %>%
  html_text() %>%
  .[c(FALSE, FALSE, TRUE, FALSE)]

address <- partb %>%
    html_text()

house_sales <- tibble(address) 

house_sales$postcode <- partb %>%
  html_attr("href") %>%
  gsub(".*(-vic-)", "",.) %>%
  gsub("(-2).*", "",.) %>%
  gsub("(http).*", "",.)

Data - Rvest Part 2/3

house_sales$property <- parta %>%
  html_elements(".css-dpwygs span") %>%
  html_text() %>%
  .[c(TRUE, FALSE)] %>%
  as.factor()

house_sales$beds <- parta %>%
  html_elements("span.css-1g2pbs1") %>%
  html_text() %>%
  gsub("[^0-9.-]", "", .) %>%
  as.numeric

house_sales$sale_outcome <- partc %>%
  gsub("\\$.*","", .) %>%
  gsub("Price.*", "", .) %>%
  as.factor()

house_sales$price <- partc %>%
  gsub(" max bid", "", .) %>%
  gsub("max bid", "", .)

house_sales <- house_sales %>% mutate(
    mult_char = str_sub(house_sales$price, start = -1),
    mult = case_when(
      mult_char == "m" ~ 1000000,
      mult_char == "k" ~ 1000),
    price = suppressWarnings(parse_number(house_sales$price)) * mult) %>%
  select(-mult, -mult_char, )

Data - Rvest Part 3/3 & Median Import

print(head(house_sales))

## # A tibble: 6 × 6
##   address               postcode property  beds sale_outcome        price
##   <chr>                 <chr>    <fct>    <dbl> <fct>               <dbl>
## 1 202/10-16 Trenerry Cr 3067     Unit         2 Sold               501000
## 2 14 Kardinia Dr        3021     House        4 Sold                   NA
## 3 176 Kerferd Rd        3206     House        4 Sold after auction     NA
## 4 25 Ogrady St          3206     House        4 Sold                   NA
## 5 95 St Vincent St      3206     House        2 Sold               855000
## 6 22 Lowther St         3078     House        4 Sold                   NA

Median Income Import

Source: ATO, 2020, table 25, Taxation Statistics 2018-19, Individuals - table25, Australian Government, Australian Tax office, Melbourne, viewed 01 June 2022, https://data.gov.au/data/dataset/2805b28d-2c3b-47e2-87c3-50aacc6ea212/resource/467aeb6c-a779-4308-8b16-323a2ff66b75/download/ts19individual25countaveragemedianbypostcode.csv

median_income <- read_csv("data/ts19individual25countaveragemedianbypostcode.csv",
                          col_select = c("Postcode", "Median taxable income or loss"),
                          col_types = cols())

median_income <- median_income[min(grep(3000,median_income$Postcode)
  ):max(grep(4000, median_income$Postcode)),]

names(median_income)[1] <- tolower(names(median_income)[1])
names(median_income)[2] <- "median_income"

median_income$postcode <- as.character(median_income$postcode)

Data - Median Income Join & Data Clean

house_sales <- left_join(house_sales, median_income, by = "postcode")

names(house_sales)[7] <- "median_income"

print(head(house_sales))

## # A tibble: 6 × 7
##   address           postcode property  beds sale_outcome     price median_income
##   <chr>             <chr>    <fct>    <dbl> <fct>            <dbl>         <dbl>
## 1 202/10-16 Trener… 3067     Unit         2 Sold            501000         60164
## 2 14 Kardinia Dr    3021     House        4 Sold                NA         38543
## 3 176 Kerferd Rd    3206     House        4 Sold after auc…     NA         69969
## 4 25 Ogrady St      3206     House        4 Sold                NA         69969
## 5 95 St Vincent St  3206     House        2 Sold            855000         69969
## 6 22 Lowther St     3078     House        4 Sold                NA         56600

Data Clean
We are investigating if median_income correlates with price. Observations missing either, are of little use.

Steps to preprocess the data:

Remove NAs from postcode. There are two in the dataset.
Remove NAs from price. There are 134 in the dataset.
Subset for 2 bedroom units.

Data Clean

NAs from postcode
Justification:
Without a postcode the join function couldn’t locate a relevent median_income. Which means that obseervation can’t tell us anything about it’s relationship to house prices.

Investigating the missing values
The two expired links we rvested didn’t feature postcode or suburb name. - Both bedroom counts != 2 so will be removed from the final analysis.

house_sales_clean <- house_sales[complete.cases(house_sales$median_income),]

NAs from price
Justification:
We will remove observations missing a price value because the options to treat these missing prices aren’t appropriate. For example, we could impute mean prices grouped by suburb, beds and property_type. But there are so few values in each of these buckets, this wouldn’t be suitable. I will instead remove these rows. Note: The missing price constitutes 30% of our observations. This is an unusually large portion of a data set to remove. Acknowledging this, I will now remove them.

house_sales_clean <- house_sales_clean[complete.cases(house_sales_clean$price),]

Subset for 2 bedroom units

house_sales_2bed_unit <- filter(house_sales_clean, beds == 2 & property == "Unit")
nrow(house_sales_2bed_unit)

## [1] 48

Descriptive Statistics and Visualisation

options(scipen = 999)
plot(price ~ median_income, data = house_sales_2bed_unit, 
     main = "2 Bed Unit Price by Postcode's Median Income", xlab = "Median Income", ylab = "Sale Price")

cor(house_sales_2bed_unit$price, house_sales_2bed_unit$median_income)

## [1] 0.4519093

The data appears to have a moderate positive correlation. We’ve quickly calculated it at 0.45.

Hypothesis Testing

income_price_model <- lm(price ~ median_income, data = house_sales_2bed_unit)
income_price_model %>% summary()

## 
## Call:
## lm(formula = price ~ median_income, data = house_sales_2bed_unit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -264705  -82480   -3191   75659  239255 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   228618.050 105782.819   2.161  0.03592 * 
## median_income      6.659      1.938   3.436  0.00126 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 124000 on 46 degrees of freedom
## Multiple R-squared:  0.2042, Adjusted R-squared:  0.1869 
## F-statistic: 11.81 on 1 and 46 DF,  p-value: 0.001262

Let’s start by looking at R² = 0.204. this is the proportion of variability in the price variable that can be explained by a linear relationship with the predictor median_income variable.

Next, the F-test
$H0$ : The data do not fit the linear regression model.
$H_A$ : The data fit the linear regression model.
So we will start by confirming the p-value from the lm() summary().

Hypothesis Testing F-test

pf(q = 11.81, 1, 46, lower.tail = FALSE)

## [1] 0.001259827

We confirm the p-value reported in the summary of 0.00126. Meaning p < .01. As p is less than the 0.05 level of significance, we reject $H0$. There was statistically significant evidence that the data fit a linear regression model.

The coefficients table

income_price_model %>% summary() %>% coef()

##                   Estimate    Std. Error  t value   Pr(>|t|)
## (Intercept)   228618.04986 105782.819336 2.161202 0.03592033
## median_income      6.65851      1.937951 3.435850 0.00126248

The intercept/constant is reported as a = 228618.05. The constant, or intercept, is the average value for y when x = 0. To test the statistical significance of the constant, we set the following statistical hypotheses:
$H0 : a = 0$
$H_A : a \ne 0$

income_price_model %>% confint()

##                      2.5 %      97.5 %
## (Intercept)   15688.278373 441547.8213
## median_income     2.757617     10.5594

Hypothesis Testing - The Intercept

R reports the 95% CI for a to be [15688.278, 441547.821]. $H0 : a = 0$ is clearly not captured by this interval, so was rejected.

The slope of the regression line was reported as b = 6.659. The slope represents the average increase in y following a one unit increase in x. In our case an additional dollar in median income for the postcode.

The hypothesis test of the slope, b, was as follows:
$H0 : \beta = 0$
$H_A : \beta \ne 0$

The slope was also tested using the t statistic which was reported as t = 3.436 at a p < .01. To calculate the two-tailed p-value for the slope in this example, we compute:

2*pt(q = 3.44 ,df = 46 - 2, lower.tail = FALSE)

## [1] 0.001284864

and confirm that p < .01. As p < .05.

Looking back to the confint() function, R reports the 95% CI for b to be [2.758, 10.559]. This 95% CI does not capture $H0$, therefore it was rejected. There was a statistically significant positive relationship between median_income of a postcode and price of a sold 2 bed unit in that postcode.

Hypothesis Testing - Last Re-Test

A hypothesis test for r has the following statistical hypotheses:
$H_0: r = 0$
$H_A: r \ne 0$

bivariate<-as.matrix(select(house_sales_2bed_unit, price, median_income))
rcorr(bivariate, type = "pearson")

##               price median_income
## price          1.00          0.45
## median_income  0.45          1.00
## 
## n= 48 
## 
## 
## P
##               price  median_income
## price                0.0013       
## median_income 0.0013

rcorr confirms the correlation between price and median_income to be r = .45. The p-value = .0013, which we write as p < .01. Which means we must reject $H_0$. There is a fatalistically significant positive correlation between a postcode’s median income and a 2 bedroom unit’s sale price.

Hypothesis Testing - Summary Plot

options(scipen = 999)
plot(price ~ median_income, data = house_sales_2bed_unit, main = "2 Bed Unit Price by Postcode's Median Income", xlab = "Median Income", ylab = "Sale Price")
abline(income_price_model, col = "red")

Assumptions

plot(income_price_model, which = 1)

Residual vs. Fitted This plot shows no non-linear trends. The relationship looks flat which is a good sign. We can have some evidence here of homoscedasticity.

Assumptions - QQ Plot

plot(income_price_model, which = 2)

We need to check that ther have not been any great departures from data being normally distributed. There’s nothing here to say there has been.

Assumptions

plot(income_price_model, which = 3)

More evidence for homoscedasticity.

Assumptions

plot(income_price_model, which = 5)

This plot indicates that there may be a value unduly influencing the fit of the regression model.

Discussion

median_income of a postcode is correlated with 2 bedroom unit sale prices.
A limitation was the amount of sales observed.
A future investigation could look at all the 2 bedroom unit sales for the 2019/20 financial year.

References

Source: Domain.com.au, 2019, Melbourne Auction Results, domain.com.au, Melbourne, viewed 01 June 2022, https://www.domain.com.au/auction-results/melbourne/2019-06-29

James Baglin, 2022, Module 9, Simple Linear Regression and Correlation, RMIT, Melbourne, viewed 05 June 2022, https://astral-theory-157510.appspot.com/secured/MATH1324_Module_09.html

Property Prices & Postcode Incomes Linked?

Two bedroom unit prices correlated with postcode’s median incomes for the 2018/19 financial year?

Introduction

Data - Rvest Part 1/3

Data - Rvest Part 2/3

Data - Rvest Part 3/3 & Median Import

Data - Median Income Join & Data Clean

Data Clean

Descriptive Statistics and Visualisation

Hypothesis Testing

Hypothesis Testing F-test

Hypothesis Testing - The Intercept

Hypothesis Testing - Last Re-Test

Hypothesis Testing - Summary Plot

Assumptions

Assumptions - QQ Plot

Assumptions

Assumptions

Discussion

References