Assignment_week02_June Yao_500316995
1 Melbourne house prices regression model
1.1 Load the data
setwd("E:/Master/Sydney/Semester2-2020/STAT5003_Computational Statistical Methods/Week02")
melbdata <- read.csv("Melbourne_housing_FULL.csv", header = TRUE)
dim(melbdata)
## [1] 34857 21
colnames(melbdata)
## [1] "Suburb" "Address" "Rooms" "Type"
## [5] "Price" "Method" "SellerG" "Date"
## [9] "Distance" "Postcode" "Bedroom2" "Bathroom"
## [13] "Car" "Landsize" "BuildingArea" "YearBuilt"
## [17] "CouncilArea" "Lattitude" "Longtitude" "Regionname"
## [21] "Propertycount"1.2 InitiaL data analysis
subset the data to only look at 3 suburbs - Brunswick, Craigieburn and Hawthorn. And use unique() to check if the filter is right.
melsuburbs <- subset(melbdata, Suburb == "Brunswick"|Suburb == "Craigieburn"|Suburb == "Hawthorn")
dim(melsuburbs)
## [1] 1127 21
unique(melsuburbs$Suburb)
## [1] "Brunswick" "Hawthorn" "Craigieburn"Then we can view the data summary through str() and summary().
Now let’s determine the average price in each of these three suburbs. Firstly, remove NA data in price.
aggregate(Price~Suburb, data=melsuburbs, mean)
## Suburb Price
## 1 Brunswick 977988.8
## 2 Craigieburn 566173.5
## 3 Hawthorn 1238074.2Here let us view some numeral variables’ relationship with price.
library(GGally)
library(plotly)
melsuburbs$Pricethousand<-melsuburbs$Price/1000
p2<-ggscatmat(
data = melsuburbs,
columns = c("YearBuilt","Rooms","Car", "Pricethousand"),color = "Suburb"
)+ theme(legend.position = "bottom") +
labs(title="Multivariate scatter plot matrix", subtitle="By three suburbs:Brunswick,Craigieburn,Hawthorn")
ggplotly(p2 ,height = 600, width = 800) %>%
layout(legend = list(
orientation = "h", y =-0.05
)
)Note: Pricethousand=Price/1000, we use this new variable in the following models
From the charts above, we can see that Rooms and Car have high correlation with Price.
1.3 Finding association I
Firstly, we construct a simple linear regression using only BuildingArea as a predictor.
lmfit <- lm(Pricethousand ~ BuildingArea, data = melsuburbs)
summary(lmfit)
##
## Call:
## lm(formula = Pricethousand ~ BuildingArea, data = melsuburbs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3421.8 -463.9 -148.0 259.1 6052.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 518.1921 66.5956 7.781 5.74e-14 ***
## BuildingArea 3.8007 0.4082 9.311 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 730 on 416 degrees of freedom
## (490 observations deleted due to missingness)
## Multiple R-squared: 0.1725, Adjusted R-squared: 0.1705
## F-statistic: 86.69 on 1 and 416 DF, p-value: < 2.2e-16
plot(melsuburbs$Pricethousand ~ melsuburbs$BuildingArea,xlab="BuildingArea",ylab="Pricethousand")
abline(lmfit,col="red")Using BuildingArea as predictor in our lmfit1 model, it shows “Adjusted R-squared: 0.1705”.
some other plots for regression diagnosis:
Then, we would also like to try YearBuilt as a predictor.
lmfit <- lm(Pricethousand ~ YearBuilt, data = melsuburbs)
summary(lmfit)
##
## Call:
## lm(formula = Pricethousand ~ YearBuilt, data = melsuburbs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1174.3 -355.3 -127.4 116.4 5824.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16992.0752 1339.5406 12.69 <2e-16 ***
## YearBuilt -8.1409 0.6832 -11.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 683.8 on 472 degrees of freedom
## (434 observations deleted due to missingness)
## Multiple R-squared: 0.2313, Adjusted R-squared: 0.2296
## F-statistic: 142 on 1 and 472 DF, p-value: < 2.2e-16
plot(melsuburbs$Pricethousand ~ melsuburbs$YearBuilt,xlab="YearBuilt",ylab="Pricethousand")
abline(lmfit,col="red")We can see Adjusted R-squared: 0.2296, larger than 0.1705. In lmfit2 model, using YearBuilt seems have a better fit.
1.4 Finding association II
a
library(car)
## Loading required package: carData
lmfit2 <- lm(Pricethousand~ BuildingArea+Suburb, data = melsuburbs)
summary(lmfit2)
##
## Call:
## lm(formula = Pricethousand ~ BuildingArea + Suburb, data = melsuburbs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4074.1 -248.5 -26.7 166.0 5479.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 509.7567 62.4510 8.163 3.99e-15 ***
## BuildingArea 4.4354 0.3469 12.786 < 2e-16 ***
## SuburbCraigieburn -660.2773 72.9115 -9.056 < 2e-16 ***
## SuburbHawthorn 400.7159 72.5435 5.524 5.87e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 609.9 on 414 degrees of freedom
## (490 observations deleted due to missingness)
## Multiple R-squared: 0.425, Adjusted R-squared: 0.4208
## F-statistic: 102 on 3 and 414 DF, p-value: < 2.2e-16
crPlots(lmfit2)The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (BuildingArea、Suburb) and response (Price) variables due to chance. Typically, a p-value of 5% or less is a good cut-off point. In our model, the p-values are very close to zero. Consequently, a small p-value for the intercept and the slope indicates that we can reject the null hypothesis which allows us to conclude that there is a relationship between BuildingArea、Suburb and Price.
b
lmfit3 <- lm(Pricethousand ~ BuildingArea+Suburb+Car, data = melsuburbs)
summary(lmfit3)
##
## Call:
## lm(formula = Pricethousand ~ BuildingArea + Suburb + Car, data = melsuburbs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3417.3 -273.4 -59.2 252.5 5001.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 333.5607 67.0114 4.978 9.56e-07 ***
## BuildingArea 3.7617 0.3513 10.708 < 2e-16 ***
## SuburbCraigieburn -781.1077 73.7764 -10.588 < 2e-16 ***
## SuburbHawthorn 363.4368 71.1395 5.109 5.02e-07 ***
## Car 220.7405 34.6168 6.377 4.98e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 588.5 on 402 degrees of freedom
## (501 observations deleted due to missingness)
## Multiple R-squared: 0.477, Adjusted R-squared: 0.4718
## F-statistic: 91.66 on 4 and 402 DF, p-value: < 2.2e-16Compared lmfit3’s Adjusted R-squared with lmfit2, 0.4718 > 0.4208. So, adding the number of car spaces as a predictor improved the prediction model.
1.5 Impact of outliers
outliers <- which(melsuburbs$BuildingArea <= 5 | melsuburbs$BuildingArea > 300)
melsuburbs.nooutliers <- melsuburbs[-outliers,]
dim(melsuburbs.nooutliers)
## [1] 891 22lmfit4 <- lm(Pricethousand ~ BuildingArea, data = melsuburbs.nooutliers)
summary(lmfit4)
##
## Call:
## lm(formula = Pricethousand ~ BuildingArea, data = melsuburbs.nooutliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1186.4 -416.9 -41.9 273.8 5719.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 190.6445 81.5379 2.338 0.0199 *
## BuildingArea 6.1269 0.5668 10.809 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 643.5 on 399 degrees of freedom
## (490 observations deleted due to missingness)
## Multiple R-squared: 0.2265, Adjusted R-squared: 0.2246
## F-statistic: 116.8 on 1 and 399 DF, p-value: < 2.2e-16
plot(melsuburbs.nooutliers$Pricethousand ~ melsuburbs.nooutliers$BuildingArea,xlab="BuildingArea",ylab="Pricethousand")
abline(lmfit4,col="red") When we compared the Adjusted R-squared between
lmfit1 and lmfit4, we can see that after removing the outliers, we have improved the Adjusted R-squared from 0.1705 to 0.2246. And we can also see from the chart, lmfit4 has obvious better fit.
1.6 Prediction
predictData <- data.frame(BuildingArea = 100, Suburb= "Hawthorn",Car = 2)
lm3.pred <- predict(lmfit3, predictData, interval = "prediction", level = 0.95)
knitr::kable(lm3.pred)| fit | lwr | upr |
|---|---|---|
| 1514.653 | 351.5393 | 2677.767 |
Note: Pricethousand=Price/1000, we use Pricethousand to train our model in lmfit3, so we need to multiply 1000 to return the original predict price
We believe that the result is 1514653, and the probability that the result falls within (351539.3, 2677767) is 95%.
2 Shiny app of a regression
click to view the shiny app:
APP. Student Info
Course: STAT5003_Computational Statistical Methods
Assignment: Lab Week 2
Student Name: Yujun Yao(June Yao)
SID: 500316995
Email: yyao2983@uni.sydney.edu.au