1 Introduction

The data set that will be used to conduct this analysis consists of data concerning the housing market in Melbourne, Australia, collected in January 2016 from publicly available information posted on a real estate website. It is comprised of 34,857 total observations and 21 different variables. Of these, 8 are categorical: Suburb, Address, Type, Method of Sale, Seller, Date Sold, Council Area, and Region Name. The 13 numerical variables are: Number of Rooms, Selling Price (in Australian Dollars), Distance from Melbourne’s Central Business District (in kilometers), Postcode, Number of Bedrooms, Number of Bathrooms, Number of Carspots, Land Size (in meters), Building Size (in meters), Year Built, Latitude, Longitude, and the Number of Properties that exist in the suburb.

This data set will be analyzed via exploratory data analysis and the construction of several multiple linear regression models to try to identify an association between selling price (the response variable) and several factors such as number of rooms and distance from the city’s Central Business District (the explanatory variables). One of the candidate models will then be chosen as best based on residual analysis and goodness-of-fit measures.

2 Exploratory Data Analysis

2.1 Creating Group Location Variable Using Longitude and Latitude Data

setwd("C:/Users/eh738/OneDrive/Documents/STA321")
HousingData <- read.csv("MelbourneHousing.csv")
lon <- HousingData$Longtitude
lat <- HousingData$Lattitude 
plot(lon, lat, main = "Sites of houses sold")
abline(v = 144.8, h = -38.0, col = "blue")
abline(v = 145.3, h = -37.6, col = "blue")

Looking at the scatter plot representing the geographic location of each house based on the latitude and longitude data, the area in which the vast majority of houses are located can be roughly defined. This can be used to create a logical variable which indicated whether or not each house falls into this main cluster or not, which can then be converted into a categorical value. Also, the “distance” variable can be converted from character to numeric values.

main.group = (lon > 144.8) & (lon < 145.3) & (lat > -38.0) & (lat < -37.6)
HousingData$main.group <- as.character(main.group)
HousingData$Distance.num <- as.numeric(HousingData$Distance)
## Warning: NAs introduced by coercion

2.2 Creating Final Data Set

Now a final data set including this location variable and the other variables which may be factors that affect a house’s selling price can be created on which to perform multiple linear regression.

HousingData.Final <- HousingData[, c(3,4,5,11,12,13,14,15,22,23)]

3 Performing Multiple Linear Regression On Data

3.1 Creating and Analyzing Initial Full Model

full.model = lm(Price ~ ., data = HousingData.Final)
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")
Statistics of Regression Coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 341598.26372 35441.748430 9.6383017 0.0000000
Rooms 191050.76381 20835.328167 9.1695587 0.0000000
Typet -330310.57353 19774.823888 -16.7035912 0.0000000
Typeu -397338.37437 17058.481520 -23.2927165 0.0000000
Bedroom2 -10296.44660 20751.931053 -0.4961681 0.6197876
Bathroom 259327.02431 9512.669773 27.2612243 0.0000000
Car 38405.27626 6011.405633 6.3887348 0.0000000
Landsize 23.36419 4.723378 4.9464991 0.0000008
BuildingArea 34.83513 11.819982 2.9471386 0.0032153
main.groupTRUE 204924.13563 23100.687825 8.8709106 0.0000000
Distance.num -36587.73611 1000.907423 -36.5545657 0.0000000
par(mfrow=c(2,2), mar = c(2, 2, 2, 2))
plot(full.model)

Based on the versus fits and Q-Q plots, this full model appears to contain serious violations of assumptions of both constant variance and normality of residuals, as well as some potential outliers.

vif(full.model)
##                   GVIF Df GVIF^(1/(2*Df))
## Rooms        14.743929  1        3.839782
## Type          1.550591  2        1.115898
## Bedroom2     14.697387  1        3.833717
## Bathroom      1.739321  1        1.318833
## Car           1.257059  1        1.121186
## Landsize      1.200601  1        1.095719
## BuildingArea  1.182319  1        1.087345
## main.group    1.551974  1        1.245782
## Distance.num  1.733691  1        1.316697

The vif values of the model suggests a serious multicollinearity issue between the variables “Rooms” and “Bedroom2”, or a house’s total number of rooms and its number of bedrooms, respectively. For lack of a better remedy at this time, “Bedroom2” will simply be omitted from subsequent models to avoid this issue. The data set will be updated to reflect this.

HousingData.Final.Reduced <- HousingData[, c(3,4,5,12,13,14,15,22,23)]

3.2 Box-Cox Transformation of Full Model

Because the residual plots of the full model suggested violations of assumptions of constant variance and normality, it is reasonable to employ a Box-Cox transformation of the response variable to attempt to produce a better model.

boxcox(lm(Price ~ ., data = HousingData.Final.Reduced), seq(-.3, -.2, length = 10))

lambda <- -.24
HousingData.Final.Reduced$Price.BC <- (HousingData.Final.Reduced$Price^lambda-1)/lambda
Box.Cox.Model <-  lm(Price.BC ~ Rooms + Type + Bathroom + Car + Landsize + BuildingArea + main.group + Distance.num, data = HousingData.Final.Reduced)
par(mfrow=c(2,2), mar = c(2, 2, 2, 2))
plot(Box.Cox.Model)

The residual plots of the model produced by the Box-Cox transformation of the response variable appear to align much more closely with model assumptions of constant variance and normality than those of the original full model. The versus fits plot exhibits a fairly random scatter of points about the horizontal, and the points of the Q-Q plot generally adhere to a straight line. There are however some aberrations at either extreme of both plots, so more transformations will be performed to attempt to find an even better model.

3.3 Square-Root Transformation of Price

Another model can be created by taking the square root of the response variable price.

Sqrt.Model <- lm((Price)^.5 ~ Rooms + Type + Bathroom + Car + Landsize + BuildingArea + main.group + Distance.num, data = HousingData.Final.Reduced)
par(mfrow=c(2,2), mar = c(2,2,2,2))
plot(Sqrt.Model)

Though exhibiting similar issues to those of the original full model suggesting violations of constant variance and normality, this square root transformed model’s residual plots suggest some slight improvements in both constant variance and normality. However, based on their respective residual plots, the Box-Cox model appears to be a preferable model.

3.4 Log Transformation of Price

A log transformation of the response variable Price can also be attempted to correct for the violations of assumptions of the original full model.

Log.Model = lm(log(Price) ~ Rooms + Type + Bathroom + Car + Landsize + BuildingArea + main.group + Distance.num, data = HousingData.Final.Reduced)
par(mfrow=c(2,2), mar = c(2, 2, 2, 2))
plot(Log.Model)

The residual plots of this log-transformed model largely correct the full model’s violations of model assumptions as the Box-Cox model did, with a fairly random scatter of points about the horizontal in the versus fits plot and a general straight line pattern in the Q-Q plot. However, the aberrations exhibited at either extreme of the plots are even more pronounced than in the plots of the Box-Cox model, and there is a slight curvature noticeable in this model’s versus fits plot. Thus, the Box-Cox model still appears to be the most trustworthy of all the models generated based on residual analysis.

3.5 Comparing Candidate Models by Goodness-of-Fit

Besides residual analysis, the candidate models can also be compared by computing their respective goodness-of-fit measures.

select=function(m){ # m is an object: model
 e = m$resid                           # residuals
 n0 = length(e)                        # sample size
 SSE=(m$df)*(summary(m)$sigma)^2       # sum of squared error
 R.sq=summary(m)$r.squared             # Coefficient of determination: R square!
 R.adj=summary(m)$adj.r                # Adjusted R square
 MSE=(summary(m)$sigma)^2              # square error
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))         # Mellow's p
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          # Akaike information criterion
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  # Schwarz Bayesian Information criterion
 X=model.matrix(m)                     # design matrix of the model
 H=X%*%solve(t(X)%*%X)%*%t(X)          # hat matrix
 d=e/(1-diag(H))                       
 PRESS=t(d)%*%d   # predicted residual error sum of squares (PRESS)- a cross-validation measure
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
}

output.sum = rbind(select(full.model), select(Box.Cox.Model), select(Sqrt.Model), select(Log.Model))
row.names(output.sum) = c("full.model","Box.Cox.Model", "Sqrt.Model", "Log.Model")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")
Goodness-of-fit Measures of Candidate Models
SSE R.sq R.adj Cp AIC SBC PRESS
full.model 2.330619e+15 0.4538147 0.4532226 11 242479.12 242557.56 2.997066e+15
Box.Cox.Model 1.504029e+00 0.5730458 0.5726293 10 -80533.29 -80461.98 1.849876e+00
Sqrt.Model 3.546316e+08 0.5257842 0.5253215 10 97503.13 97574.44 4.586422e+08
Log.Model 1.143915e+03 0.5659606 0.5655372 10 -19267.70 -19196.40 1.439668e+03

As seen in this table, the SSE, R-Squared, and R-Squared Adjusted values of the four candidate models largely reflect the inferences drawn from residual analysis. The Box-Cox model generally explains the most variation in the data set with an adjusted R-Squared value of 57.26%, followed very closely by the log model at 56.55%, then the square root model at 52.58%, and finally the original full model at 45.32%. Based on having the best goodness-of-fit measurements and the closest adherence to model assumptions according to residual analysis, the Box-Cox model should be chosen as the best of all the candidate models.

3.6 Final Model

The inferential statistics of the final working model are displayed below:

kable(summary(Box.Cox.Model)$coef, caption = "Inferential Statistics of Final Model")
Inferential Statistics of Final Model
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.9880855 0.0008989 4436.736954 0.0000000
Rooms 0.0060549 0.0002128 28.455261 0.0000000
Typet -0.0088665 0.0005019 -17.665360 0.0000000
Typeu -0.0191099 0.0004327 -44.163583 0.0000000
Bathroom 0.0054608 0.0002397 22.779941 0.0000000
Car 0.0008148 0.0001525 5.342843 0.0000001
Landsize 0.0000007 0.0000001 6.116052 0.0000000
BuildingArea 0.0000010 0.0000003 3.213317 0.0013166
main.groupTRUE 0.0107272 0.0005868 18.280685 0.0000000
Distance.num -0.0010369 0.0000254 -40.827076 0.0000000

The final model can be written as

\[ BoxCox(Price) = 3.9881 + 0.0061\times Rooms - 0.0089\times Typet - 0.0191\times Typeu + 0.0055\times Bathroom + 0.0008\times Car +0.0000007\times Landsize +0.000001\times BuildingArea + 0.0107\times main.groupTRUE - 0.0010\times Distance.num \] where BoxCox(Price) represents the value of price after the Box-Cox transformation is applied where lambda = -0.24. The regression coefficient for each variable represents the increase or decrease (depending on the sign of the coefficient) in Box-Cox transformed price per a one-unit increase (or change from baseline level) in the variable with all other variables being held constant. So, holding all other variables constant, for each additional room a house has, the model predicts a .0061 increase in the Box-Cox-transformed price, the house being a townhouse (Type T) as opposed to a house (Type H, the baseline) is estimated to result in a .0089 decrease in Box-Cox transformed price, and so on. Algebra can then be employed to reverse the Box-Cox transformation and derive the actual selling price value.

4 Conclusion & Discussion

Though the goodness-of-fit measures for the final model are far from ideal, a general sense of how the various factors affect a house’s selling price can be gleaned from it. The regression coefficients generally align with intuitive notions of the associations between the response variable and each explanatory variable, e.g., the selling price of a house is positively associated with the number of rooms based on the model, and negatively associated with its distance from a desirable part of the city, etc. That said, considering this model’s modest goodness-of-fit measures as well as the still-questionable adherence to model assumptions, exploration of different types of models and/or variable transformations is highly recommended to produce a more accurate and trustworthy model of the relationship between selling price and these various factors.