1 Melbourne house prices regression model

1.1 Load the data

setwd("E:/Master/Sydney/Semester2-2020/STAT5003_Computational Statistical Methods/Week02")
melbdata <- read.csv("Melbourne_housing_FULL.csv", header = TRUE)
dim(melbdata)
## [1] 34857    21
colnames(melbdata)
##  [1] "Suburb"        "Address"       "Rooms"         "Type"         
##  [5] "Price"         "Method"        "SellerG"       "Date"         
##  [9] "Distance"      "Postcode"      "Bedroom2"      "Bathroom"     
## [13] "Car"           "Landsize"      "BuildingArea"  "YearBuilt"    
## [17] "CouncilArea"   "Lattitude"     "Longtitude"    "Regionname"   
## [21] "Propertycount"

1.2 InitiaL data analysis

subset the data to only look at 3 suburbs - Brunswick, Craigieburn and Hawthorn. And use unique() to check if the filter is right.

melsuburbs <- subset(melbdata, Suburb == "Brunswick"|Suburb == "Craigieburn"|Suburb == "Hawthorn")
dim(melsuburbs)
## [1] 1127   21
unique(melsuburbs$Suburb)
## [1] "Brunswick"   "Hawthorn"    "Craigieburn"

Then we can view the data summary through str() and summary().

str(melsuburbs)
summary(melsuburbs)

Now let’s determine the average price in each of these three suburbs. Firstly, remove NA data in price.

melsuburbs <- melsuburbs[!is.na(melsuburbs$Price), ]
dim(melsuburbs)
## [1] 908  21
aggregate(Price~Suburb, data=melsuburbs, mean)
##        Suburb     Price
## 1   Brunswick  977988.8
## 2 Craigieburn  566173.5
## 3    Hawthorn 1238074.2

Here let us view some numeral variables’ relationship with price.

library(GGally)
library(plotly)

melsuburbs$Pricethousand<-melsuburbs$Price/1000

p2<-ggscatmat(
  data = melsuburbs, 
  columns = c("YearBuilt","Rooms","Car", "Pricethousand"),color = "Suburb" 
)+  theme(legend.position = "bottom") +
 labs(title="Multivariate scatter plot matrix", subtitle="By three suburbs:Brunswick,Craigieburn,Hawthorn")

ggplotly(p2 ,height = 600, width = 800) %>%
  layout(legend = list(
      orientation = "h",  y =-0.05
    )
  )

Note: Pricethousand=Price/1000, we use this new variable in the following models

From the charts above, we can see that Rooms and Car have high correlation with Price.

1.3 Finding association I

Firstly, we construct a simple linear regression using only BuildingArea as a predictor.

lmfit <- lm(Pricethousand ~ BuildingArea, data = melsuburbs)
summary(lmfit)
## 
## Call:
## lm(formula = Pricethousand ~ BuildingArea, data = melsuburbs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3421.8  -463.9  -148.0   259.1  6052.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  518.1921    66.5956   7.781 5.74e-14 ***
## BuildingArea   3.8007     0.4082   9.311  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 730 on 416 degrees of freedom
##   (490 observations deleted due to missingness)
## Multiple R-squared:  0.1725, Adjusted R-squared:  0.1705 
## F-statistic: 86.69 on 1 and 416 DF,  p-value: < 2.2e-16
plot(melsuburbs$Pricethousand ~ melsuburbs$BuildingArea,xlab="BuildingArea",ylab="Pricethousand")
abline(lmfit,col="red")

Using BuildingArea as predictor in our lmfit1 model, it shows “Adjusted R-squared: 0.1705”.

some other plots for regression diagnosis:

par(mfrow=c(2,2)) 
plot(lmfit)

Then, we would also like to try YearBuilt as a predictor.

lmfit <- lm(Pricethousand ~ YearBuilt, data = melsuburbs)
summary(lmfit)
## 
## Call:
## lm(formula = Pricethousand ~ YearBuilt, data = melsuburbs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1174.3  -355.3  -127.4   116.4  5824.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16992.0752  1339.5406   12.69   <2e-16 ***
## YearBuilt      -8.1409     0.6832  -11.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 683.8 on 472 degrees of freedom
##   (434 observations deleted due to missingness)
## Multiple R-squared:  0.2313, Adjusted R-squared:  0.2296 
## F-statistic:   142 on 1 and 472 DF,  p-value: < 2.2e-16
plot(melsuburbs$Pricethousand ~ melsuburbs$YearBuilt,xlab="YearBuilt",ylab="Pricethousand")
abline(lmfit,col="red")

We can see Adjusted R-squared: 0.2296, larger than 0.1705. In lmfit2 model, using YearBuilt seems have a better fit.

1.4 Finding association II

a

library(car)
## Loading required package: carData
lmfit2 <- lm(Pricethousand~ BuildingArea+Suburb, data = melsuburbs)
summary(lmfit2)
## 
## Call:
## lm(formula = Pricethousand ~ BuildingArea + Suburb, data = melsuburbs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4074.1  -248.5   -26.7   166.0  5479.9 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        509.7567    62.4510   8.163 3.99e-15 ***
## BuildingArea         4.4354     0.3469  12.786  < 2e-16 ***
## SuburbCraigieburn -660.2773    72.9115  -9.056  < 2e-16 ***
## SuburbHawthorn     400.7159    72.5435   5.524 5.87e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 609.9 on 414 degrees of freedom
##   (490 observations deleted due to missingness)
## Multiple R-squared:  0.425,  Adjusted R-squared:  0.4208 
## F-statistic:   102 on 3 and 414 DF,  p-value: < 2.2e-16
crPlots(lmfit2)

The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (BuildingAreaSuburb) and response (Price) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. In our model, the p-values are very close to zero. Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between BuildingAreaSuburb and Price.

par(mfrow=c(2,2)) 
plot(lmfit2)

b

lmfit3 <- lm(Pricethousand ~ BuildingArea+Suburb+Car, data = melsuburbs)
summary(lmfit3)
## 
## Call:
## lm(formula = Pricethousand ~ BuildingArea + Suburb + Car, data = melsuburbs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3417.3  -273.4   -59.2   252.5  5001.7 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        333.5607    67.0114   4.978 9.56e-07 ***
## BuildingArea         3.7617     0.3513  10.708  < 2e-16 ***
## SuburbCraigieburn -781.1077    73.7764 -10.588  < 2e-16 ***
## SuburbHawthorn     363.4368    71.1395   5.109 5.02e-07 ***
## Car                220.7405    34.6168   6.377 4.98e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 588.5 on 402 degrees of freedom
##   (501 observations deleted due to missingness)
## Multiple R-squared:  0.477,  Adjusted R-squared:  0.4718 
## F-statistic: 91.66 on 4 and 402 DF,  p-value: < 2.2e-16

Compared lmfit3’s Adjusted R-squared with lmfit2, 0.4718 > 0.4208. So, adding the number of car spaces as a predictor improved the prediction model.

1.5 Impact of outliers

outliers <- which(melsuburbs$BuildingArea <= 5 | melsuburbs$BuildingArea  > 300)
melsuburbs.nooutliers <- melsuburbs[-outliers,]
dim(melsuburbs.nooutliers)
## [1] 891  22
lmfit4 <- lm(Pricethousand ~ BuildingArea, data = melsuburbs.nooutliers)
summary(lmfit4)
## 
## Call:
## lm(formula = Pricethousand ~ BuildingArea, data = melsuburbs.nooutliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1186.4  -416.9   -41.9   273.8  5719.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  190.6445    81.5379   2.338   0.0199 *  
## BuildingArea   6.1269     0.5668  10.809   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 643.5 on 399 degrees of freedom
##   (490 observations deleted due to missingness)
## Multiple R-squared:  0.2265, Adjusted R-squared:  0.2246 
## F-statistic: 116.8 on 1 and 399 DF,  p-value: < 2.2e-16
plot(melsuburbs.nooutliers$Pricethousand ~ melsuburbs.nooutliers$BuildingArea,xlab="BuildingArea",ylab="Pricethousand")
abline(lmfit4,col="red")

When we compared the Adjusted R-squared between lmfit1 and lmfit4, we can see that after removing the outliers, we have improved the Adjusted R-squared from 0.1705 to 0.2246. And we can also see from the chart, lmfit4 has obvious better fit.

1.6 Prediction

predictData <- data.frame(BuildingArea = 100, Suburb= "Hawthorn",Car = 2)
lm3.pred <- predict(lmfit3, predictData, interval = "prediction", level = 0.95)
knitr::kable(lm3.pred)
fit lwr upr
1514.653 351.5393 2677.767

Note: Pricethousand=Price/1000, we use Pricethousand to train our model in lmfit3, so we need to multiply 1000 to return the original predict price

We believe that the result is 1514653, and the probability that the result falls within (351539.3, 2677767) is 95%.

2 Shiny app of a regression

click to view the shiny app:

Melbourne House Prices Linear Regression Model (Univariate)

APP. Student Info

Course: STAT5003_Computational Statistical Methods
Assignment: Lab Week 2
Student Name: Yujun Yao(June Yao)
SID: 500316995
Email: