Luke Davey s3235330
Last updated: 07 June, 2022
Rpubs link: www.rpubslinkgoeshere.com
Problem Statement
We are looking at prices of Victorian two bedroom units that were
sold on the last week of the 2018/19 financial year. We are also looking
at the median incomes for Victorian postcodes during that same financial
year. We’d like to know if a postcode’s median income correlates with
property price.
We will use simple linear regression to answer this
question.
Important variables descriptions:
Source: Domain.com.au, 2019, Melbourne Auction Results, domain.com.au, Melbourne, viewed 01 June 2022, https://www.domain.com.au/auction-results/melbourne/2019-06-29
url <- "https://www.domain.com.au/auction-results/melbourne/2019-06-29"
my_session <- session(url, add_headers('user-agent' = 'arbitrary-string'))
response <- my_session$response
content_parsed <- content(response, as = "parsed")
parta <- content_parsed %>%
html_elements(".css-3xqrp1")
partb <- parta %>%
html_elements(".css-1b9txmm")
partc <- parta %>%
html_elements("li") %>%
html_text() %>%
.[c(FALSE, FALSE, TRUE, FALSE)]
address <- partb %>%
html_text()
house_sales <- tibble(address)
house_sales$postcode <- partb %>%
html_attr("href") %>%
gsub(".*(-vic-)", "",.) %>%
gsub("(-2).*", "",.) %>%
gsub("(http).*", "",.)house_sales$property <- parta %>%
html_elements(".css-dpwygs span") %>%
html_text() %>%
.[c(TRUE, FALSE)] %>%
as.factor()
house_sales$beds <- parta %>%
html_elements("span.css-1g2pbs1") %>%
html_text() %>%
gsub("[^0-9.-]", "", .) %>%
as.numeric
house_sales$sale_outcome <- partc %>%
gsub("\\$.*","", .) %>%
gsub("Price.*", "", .) %>%
as.factor()
house_sales$price <- partc %>%
gsub(" max bid", "", .) %>%
gsub("max bid", "", .)
house_sales <- house_sales %>% mutate(
mult_char = str_sub(house_sales$price, start = -1),
mult = case_when(
mult_char == "m" ~ 1000000,
mult_char == "k" ~ 1000),
price = suppressWarnings(parse_number(house_sales$price)) * mult) %>%
select(-mult, -mult_char, )## # A tibble: 6 × 6
## address postcode property beds sale_outcome price
## <chr> <chr> <fct> <dbl> <fct> <dbl>
## 1 202/10-16 Trenerry Cr 3067 Unit 2 Sold 501000
## 2 14 Kardinia Dr 3021 House 4 Sold NA
## 3 176 Kerferd Rd 3206 House 4 Sold after auction NA
## 4 25 Ogrady St 3206 House 4 Sold NA
## 5 95 St Vincent St 3206 House 2 Sold 855000
## 6 22 Lowther St 3078 House 4 Sold NA
Median Income Import
Source: ATO, 2020, table 25, Taxation Statistics 2018-19, Individuals - table25, Australian Government, Australian Tax office, Melbourne, viewed 01 June 2022, https://data.gov.au/data/dataset/2805b28d-2c3b-47e2-87c3-50aacc6ea212/resource/467aeb6c-a779-4308-8b16-323a2ff66b75/download/ts19individual25countaveragemedianbypostcode.csv
median_income <- read_csv("data/ts19individual25countaveragemedianbypostcode.csv",
col_select = c("Postcode", "Median taxable income or loss"),
col_types = cols())
median_income <- median_income[min(grep(3000,median_income$Postcode)
):max(grep(4000, median_income$Postcode)),]
names(median_income)[1] <- tolower(names(median_income)[1])
names(median_income)[2] <- "median_income"
median_income$postcode <- as.character(median_income$postcode)house_sales <- left_join(house_sales, median_income, by = "postcode")
names(house_sales)[7] <- "median_income"
print(head(house_sales))## # A tibble: 6 × 7
## address postcode property beds sale_outcome price median_income
## <chr> <chr> <fct> <dbl> <fct> <dbl> <dbl>
## 1 202/10-16 Trener… 3067 Unit 2 Sold 501000 60164
## 2 14 Kardinia Dr 3021 House 4 Sold NA 38543
## 3 176 Kerferd Rd 3206 House 4 Sold after auc… NA 69969
## 4 25 Ogrady St 3206 House 4 Sold NA 69969
## 5 95 St Vincent St 3206 House 2 Sold 855000 69969
## 6 22 Lowther St 3078 House 4 Sold NA 56600
Data Clean
We are investigating if median_income
correlates with price. Observations missing either, are of little
use.
Steps to preprocess the data:
NAs from postcode
Justification:
Without a
postcode the join function couldn’t locate a relevent median_income.
Which means that obseervation can’t tell us anything about it’s
relationship to house prices.
Investigating the missing values
The two
expired links we rvested didn’t feature postcode or suburb name. - Both
bedroom counts != 2 so will be removed from the final analysis.
NAs from price
Justification:
We will
remove observations missing a price value because the options to treat
these missing prices aren’t appropriate. For example, we could impute
mean prices grouped by suburb, beds and property_type. But there are so
few values in each of these buckets, this wouldn’t be suitable. I will
instead remove these rows. Note: The missing price
constitutes 30% of our observations. This is an unusually large portion
of a data set to remove. Acknowledging this, I will now remove them.
Subset for 2 bedroom units
house_sales_2bed_unit <- filter(house_sales_clean, beds == 2 & property == "Unit")
nrow(house_sales_2bed_unit)## [1] 48
options(scipen = 999)
plot(price ~ median_income, data = house_sales_2bed_unit,
main = "2 Bed Unit Price by Postcode's Median Income", xlab = "Median Income", ylab = "Sale Price")## [1] 0.4519093
The data appears to have a moderate positive correlation. We’ve quickly calculated it at 0.45.
income_price_model <- lm(price ~ median_income, data = house_sales_2bed_unit)
income_price_model %>% summary() ##
## Call:
## lm(formula = price ~ median_income, data = house_sales_2bed_unit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -264705 -82480 -3191 75659 239255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 228618.050 105782.819 2.161 0.03592 *
## median_income 6.659 1.938 3.436 0.00126 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124000 on 46 degrees of freedom
## Multiple R-squared: 0.2042, Adjusted R-squared: 0.1869
## F-statistic: 11.81 on 1 and 46 DF, p-value: 0.001262
Let’s start by looking at R2 = 0.204. this is the proportion of variability in the price variable that can be explained by a linear relationship with the predictor median_income variable.
Next, the F-test
\(H0\) : The
data do not fit the linear regression model.
\(H_A\) : The data fit the linear regression
model.
So we will start by confirming the p-value from the lm()
summary().
## [1] 0.001259827
We confirm the p-value reported in the summary of 0.00126. Meaning p < .01. As p is less than the 0.05 level of significance, we reject \(H0\). There was statistically significant evidence that the data fit a linear regression model.
The coefficients table
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 228618.04986 105782.819336 2.161202 0.03592033
## median_income 6.65851 1.937951 3.435850 0.00126248
The intercept/constant is reported as a = 228618.05. The constant, or
intercept, is the average value for y when x = 0. To test the
statistical significance of the constant, we set the following
statistical hypotheses:
\(H0 : a =
0\)
\(H_A : a \ne 0\)
## 2.5 % 97.5 %
## (Intercept) 15688.278373 441547.8213
## median_income 2.757617 10.5594
R reports the 95% CI for a to be [15688.278, 441547.821]. \(H0 : a = 0\) is clearly not captured by this interval, so was rejected.
The slope of the regression line was reported as b = 6.659. The slope represents the average increase in y following a one unit increase in x. In our case an additional dollar in median income for the postcode.
The hypothesis test of the slope, b, was as follows:
\(H0 : \beta = 0\)
\(H_A : \beta \ne 0\)
The slope was also tested using the t statistic which was reported as t = 3.436 at a p < .01. To calculate the two-tailed p-value for the slope in this example, we compute:
## [1] 0.001284864
and confirm that p < .01. As p < .05.
Looking back to the confint() function, R reports the 95% CI for b to be [2.758, 10.559]. This 95% CI does not capture \(H0\), therefore it was rejected. There was a statistically significant positive relationship between median_income of a postcode and price of a sold 2 bed unit in that postcode.
A hypothesis test for r has the following statistical hypotheses:
\(H_0: r = 0\)
\(H_A: r \ne 0\)
bivariate<-as.matrix(select(house_sales_2bed_unit, price, median_income))
rcorr(bivariate, type = "pearson")## price median_income
## price 1.00 0.45
## median_income 0.45 1.00
##
## n= 48
##
##
## P
## price median_income
## price 0.0013
## median_income 0.0013
rcorr confirms the correlation between price and median_income to be r = .45. The p-value = .0013, which we write as p < .01. Which means we must reject \(H_0\). There is a fatalistically significant positive correlation between a postcode’s median income and a 2 bedroom unit’s sale price.
options(scipen = 999)
plot(price ~ median_income, data = house_sales_2bed_unit, main = "2 Bed Unit Price by Postcode's Median Income", xlab = "Median Income", ylab = "Sale Price")
abline(income_price_model, col = "red")
Residual vs. Fitted This plot shows no non-linear
trends. The relationship looks flat which is a good sign. We can have
some evidence here of homoscedasticity.
We need to check that ther have not been any great departures from data
being normally distributed. There’s nothing here to say there has
been.
This plot indicates that there may be a value unduly influencing the fit
of the regression model.
Source: Domain.com.au, 2019, Melbourne Auction Results, domain.com.au, Melbourne, viewed 01 June 2022, https://www.domain.com.au/auction-results/melbourne/2019-06-29
Source: ATO, 2020, table 25, Taxation Statistics 2018-19, Individuals - table25, Australian Government, Australian Tax office, Melbourne, viewed 01 June 2022, https://data.gov.au/data/dataset/2805b28d-2c3b-47e2-87c3-50aacc6ea212/resource/467aeb6c-a779-4308-8b16-323a2ff66b75/download/ts19individual25countaveragemedianbypostcode.csv
James Baglin, 2022, Module 9, Simple Linear Regression and Correlation, RMIT, Melbourne, viewed 05 June 2022, https://astral-theory-157510.appspot.com/secured/MATH1324_Module_09.html