1 Introduction

The data set that will be used to conduct this analysis consists of data concerning the housing market in Melbourne, Australia, collected in January 2016 from publicly available information posted on a real estate website. It is comprised of 34,857 total observations and 21 different variables. Of these, 8 are categorical: Suburb, Address, Type, Method of Sale, Seller, Date Sold, Council Area, and Region Name. The 13 numerical variables are: Number of Rooms, Selling Price (in Australian Dollars), Distance from Melbourne’s Central Business District (in kilometers), Postcode, Number of Bedrooms, Number of Bathrooms, Number of Carspots, Land Size (in meters), Building Size (in meters), Year Built, Latitude, Longitude, and the Number of Properties that exist in the suburb.

This data set will be analyzed via exploratory data analysis and the construction of several multiple linear regression models to try to identify an association between selling price (the response variable) and several factors such as number of rooms and distance from the city’s Central Business District (the explanatory variables). One of the candidate models will then be chosen as best based on residual analysis and goodness-of-fit measures.

2 Exploratory Data Analysis

2.1 Creating Group Location Variable Using Longitude and Latitude Data

HousingData <- read.csv("https://pengdsci.github.io/datasets/MelbourneHousingMarket/MelbourneHousing.csv", header = TRUE)
lon <- HousingData$Longtitude
lat <- HousingData$Lattitude 
plot(lon, lat, main = "Sites of houses sold")
abline(v = 144.8, h = -38.0, col = "blue")
abline(v = 145.3, h = -37.6, col = "blue")

Looking at the scatter plot representing the geographic location of each house based on the latitude and longitude data, the area in which the vast majority of houses are located can be roughly defined. This can be used to create a logical variable which indicated whether or not each house falls into this main cluster or not, which can then be converted into a factor variable with two levels. “Type” can also be converted to a factor variable. Finally, the “distance” variable can be converted from character to numeric values.

main.group = (lon > 144.8) & (lon < 145.3) & (lat > -38.0) & (lat < -37.6)
HousingData$main.group <- factor(main.group)
HousingData$Type <- factor(HousingData$Type)
HousingData$Distance.num <- as.numeric(HousingData$Distance)

## Warning: NAs introduced by coercion

2.2 Creating Final Data Set

Now a final data set including this location variable and the other variables which may be factors that affect a house’s selling price can be created on which to perform multiple linear regression.

HousingData.Final <- HousingData[, c(3,4,5,11,12,13,14,15,22,23)]

3 Performing Multiple Linear Regression On Data

3.1 Creating and Analyzing Initial Full Model

full.model = lm(Price ~ ., data = HousingData.Final)
kable(summary(full.model)$coef, caption ="Statistics of Regression Coefficients")

Statistics of Regression Coefficients
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	341598.26372	35441.748430	9.6383017	0.0000000
Rooms	191050.76381	20835.328167	9.1695587	0.0000000
Typet	-330310.57353	19774.823888	-16.7035912	0.0000000
Typeu	-397338.37437	17058.481520	-23.2927165	0.0000000
Bedroom2	-10296.44660	20751.931053	-0.4961681	0.6197876
Bathroom	259327.02431	9512.669773	27.2612243	0.0000000
Car	38405.27626	6011.405633	6.3887348	0.0000000
Landsize	23.36419	4.723378	4.9464991	0.0000008
BuildingArea	34.83513	11.819982	2.9471386	0.0032153
main.groupTRUE	204924.13563	23100.687825	8.8709106	0.0000000
Distance.num	-36587.73611	1000.907423	-36.5545657	0.0000000

par(mfrow=c(2,2), mar = c(2, 2, 2, 2))
plot(full.model)

Based on the versus fits and Q-Q plots, this full model appears to contain serious violations of assumptions of both constant variance and normality of residuals, as well as some potential outliers.

kable(vif(full.model), caption = "VIF Values for Each Explanatory Variable")

VIF Values for Each Explanatory Variable
	GVIF	Df	GVIF^(1/(2*Df))
Rooms	14.743929	1	3.839782
Type	1.550591	2	1.115898
Bedroom2	14.697387	1	3.833717
Bathroom	1.739321	1	1.318833
Car	1.257059	1	1.121186
Landsize	1.200600	1	1.095719
BuildingArea	1.182319	1	1.087345
main.group	1.551974	1	1.245782
Distance.num	1.733691	1	1.316697

The vif values of the model suggests a serious multicollinearity issue between the variables “Rooms” and “Bedroom2”, or a house’s total number of rooms and its number of bedrooms, respectively. For lack of a better remedy at this time, “Bedroom2” will simply be omitted from subsequent models to avoid this issue. The data set will be updated to reflect this.

HousingData.Final.Reduced <- HousingData[, c(3,4,5,12,13,14,15,22,23)]

3.2 Box-Cox Transformation of Full Model

Because the residual plots of the full model suggested violations of assumptions of constant variance and normality, it is reasonable to employ a Box-Cox transformation of the response variable to attempt to produce a better model.

boxcox(lm(Price ~ ., data = HousingData.Final.Reduced), seq(-.3, -.2, length = 10))

lambda <- -.24
HousingData.Final.Reduced$Price.BC <- (HousingData.Final.Reduced$Price^lambda-1)/lambda
Box.Cox.Model <-  lm(Price.BC ~ Rooms + Type + Bathroom + Car + Landsize + BuildingArea + main.group + Distance.num, data = HousingData.Final.Reduced)
par(mfrow=c(2,2), mar = c(2, 2, 2, 2))
plot(Box.Cox.Model)

The residual plots of the model produced by the Box-Cox transformation of the response variable appear to align much more closely with model assumptions of constant variance and normality than those of the original full model. The versus fits plot exhibits a fairly random scatter of points about the horizontal, and the points of the Q-Q plot generally adhere to a straight line. Also, while the potential outliers have not been completely removed, many of them appear to have reduced in severity .There are however some aberrations at either extreme of both the versus fits and Q-Q plots, so more transformations will be performed to attempt to find an even better model.

3.3 Square-Root Transformation of Price

Another model can be created by taking the square root of the response variable price.

Sqrt.Model <- lm((Price)^.5 ~ Rooms + Type + Bathroom + Car + Landsize + BuildingArea + main.group + Distance.num, data = HousingData.Final.Reduced)
par(mfrow=c(2,2), mar = c(2,2,2,2))
plot(Sqrt.Model)

Though exhibiting similar issues to those of the original full model suggesting violations of constant variance and normality, this square root transformed model’s residual plots suggest some slight improvements in both constant variance and normality, as well as a reduction in the severity of potential outliers. However, based on their respective residual plots, the Box-Cox model appears to be a preferable model due to the clear curvature in the square root model’s versus fits plot and the more severe divergence of points from the straight line at the extremes of its Q-Q plot.

3.4 Log Transformation of Price

A log transformation of the response variable Price can also be attempted to correct for the violations of assumptions of the original full model.

Log.Model = lm(log(Price) ~ Rooms + Type + Bathroom + Car + Landsize + BuildingArea + main.group + Distance.num, data = HousingData.Final.Reduced)
par(mfrow=c(2,2), mar = c(2, 2, 2, 2))
plot(Log.Model)

The residual plots of this log-transformed model largely correct the full model’s violations of model assumptions as the Box-Cox model did, with a fairly random scatter of points about the horizontal in the versus fits plot, a general straight line pattern in the Q-Q plot, and a reduction of the severity of outliers. However, this model’s residual plots also exhibit abberations similar to those seen in the square root model’s, albeit somewhat less severe. Thus, the Box-Cox model still appears to be the most trustworthy of all the models generated based on residual analysis.

3.5 Comparing Candidate Models by Goodness-of-Fit

Besides residual analysis, the candidate models can also be compared by computing their respective goodness-of-fit measures.

select=function(m){ # m is an object: model
 e = m$resid                           # residuals
 n0 = length(e)                        # sample size
 SSE=(m$df)*(summary(m)$sigma)^2       # sum of squared error
 R.sq=summary(m)$r.squared             # Coefficient of determination: R square!
 R.adj=summary(m)$adj.r                # Adjusted R square
 MSE=(summary(m)$sigma)^2              # square error
 Cp=(SSE/MSE)-(n0-2*(n0-m$df))         # Mellow's p
 AIC=n0*log(SSE)-n0*log(n0)+2*(n0-m$df)          # Akaike information criterion
 SBC=n0*log(SSE)-n0*log(n0)+(log(n0))*(n0-m$df)  # Schwarz Bayesian Information criterion
 X=model.matrix(m)                     # design matrix of the model
 H=X%*%solve(t(X)%*%X)%*%t(X)          # hat matrix
 d=e/(1-diag(H))                       
 PRESS=t(d)%*%d   # predicted residual error sum of squares (PRESS)- a cross-validation measure
 tbl = as.data.frame(cbind(SSE=SSE, R.sq=R.sq, R.adj = R.adj, Cp = Cp, AIC = AIC, SBC = SBC, PRD = PRESS))
 names(tbl)=c("SSE", "R.sq", "R.adj", "Cp", "AIC", "SBC", "PRESS")
 tbl
}

output.sum = rbind(select(full.model), select(Box.Cox.Model), select(Sqrt.Model), select(Log.Model))
row.names(output.sum) = c("full.model","Box.Cox.Model", "Sqrt.Model", "Log.Model")
kable(output.sum, caption = "Goodness-of-fit Measures of Candidate Models")

Goodness-of-fit Measures of Candidate Models
	SSE	R.sq	R.adj	Cp	AIC	SBC	PRESS
full.model	2.330619e+15	0.4538147	0.4532226	11	242479.12	242557.56	2.997066e+15
Box.Cox.Model	1.504029e+00	0.5730458	0.5726293	10	-80533.29	-80461.98	1.849876e+00
Sqrt.Model	3.546316e+08	0.5257842	0.5253215	10	97503.13	97574.44	4.586422e+08
Log.Model	1.143915e+03	0.5659606	0.5655372	10	-19267.70	-19196.40	1.439668e+03

As seen in this table, the SSE, R-Squared, and R-Squared Adjusted values of the four candidate models largely reflect the inferences drawn from residual analysis. The Box-Cox model generally explains the most variation in the data set with an adjusted R-Squared value of 57.26%, followed very closely by the log model at 56.55%, then the square root model at 52.58%, and finally the original full model at 45.32%. Based on having the best goodness-of-fit measurements and the closest adherence to model assumptions according to residual analysis, the Box-Cox model should be chosen as the best of all the candidate models.

4 Bootstrap Multiple Linear Regression

Because the final model generated by a parametric approach still exhibited some violations of model assumptions, it makes sense to employ bootstrapping methods to construct confidence intervals for the regression coefficients which do not rely on assumptions of residual normality and constant variance. These methods can also significantly reduce the impact of the potential outliers on the regression coefficients.

4.1 Bootstrap Cases

The first bootstrapping method that will be used to construct confidence intervals for the regression coefficients is the cases method, in which a large number of bootstrap samples are taken by sampling from the the original observations with replacement, a bootstrap regression model is fitted to each bootstrap sample of observations, and the regression coefficients of each bootstrap model are extracted and used to construct 95% confidence intervals for each.

It should be noted here that in order to generate bootstrap regression models, it is necessary to modify the data set further by removing any observations with a missing value for the Box-Cox transformed price variable. However, this modification still leaves a sufficient number of observations from which to draw inferences. Therefore, a new Box-Cox model was generated using this new cleaned data set in order to make comparisons to the bootstrap coefficient confidence intervals.

Clean_Data <- subset(HousingData.Final.Reduced, !is.na(Price.BC))
BC.Model.Clean <- lm(Price.BC ~ Rooms + Type + Bathroom + Car + Landsize + BuildingArea + main.group + Distance.num, data = Clean_Data)
B = 1000       
num.p = dim(model.frame(BC.Model.Clean))[2]  
smpl.n = dim(model.frame(BC.Model.Clean))[1] 
coef.mtrx = matrix(rep(0, 10000), ncol = 10)       
for (i in 1:B){
  bootc.id = sample(1:smpl.n, smpl.n, replace = TRUE) 
Box.Cox.Boot <- lm(Price.BC ~ Rooms + Type + Bathroom + Car + Landsize + BuildingArea + main.group + Distance.num, data = Clean_Data[bootc.id,])     
  coef.mtrx[i,] = coef(Box.Cox.Boot)
}

With the bootstrap regression models generated and the corresponding regression coefficients extracted, histograms can be used to compare the distributions of the bootstrap coefficients for each explanatory variable to a normal density curve centered around the estimated coefficient values from the regression output.

boot.hist = function(cmtrx, bt.coef.mtrx, var.id, var.nm){
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
}

par(mfrow=c(3,3), mar = c(2, 2, 2, 2))
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=2, var.nm ="Rooms" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=3, var.nm ="Type" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=4, var.nm ="Bathroom" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=5, var.nm ="Car" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=6, var.nm ="Landsize" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=7, var.nm ="BuildingArea" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=8, var.nm ="main.group" )
boot.hist(bt.coef.mtrx=coef.mtrx, var.id=9, var.nm ="Distance.num" )

For most of the explanatory variables, the bootstrap coefficient distribution generally matches the normal density curve . However, the histogram for the “main.group” variable exhibits a distinct discrepancy between the two curves. Based on this discrepancy, the p-values of the parametric model may not be entirely valid, and therefore it is advisable to use the bootstrap confidence intervals instead for making inferences about the relationship between the variables.

The results of the two different methods can also be compared via table.

cmtrx <- summary(BC.Model.Clean)$coef
num.p = dim(coef.mtrx)[2] 
btc.ci = NULL
btc.wd = NULL
for (i in 1:10){
  lci.025 = round(quantile(coef.mtrx[, i], 0.025, type = 2),8)
  uci.975 = round(quantile(coef.mtrx[, i],0.975, type = 2 ),8)
  btc.wd[i] =  uci.975 - lci.025
  btc.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
 }
kable(as.data.frame(cbind(formatC(cmtrx,4,format="f"), btc.ci.95=btc.ci)), 
      caption = "Regression Coefficient Matrix")

Regression Coefficient Matrix
	Estimate	Std. Error	t value	Pr(>\|t\|)	btc.ci.95
(Intercept)	3.9881	0.0009	4436.7370	0.0000	[ 3.9696 , 3.9801 ]
Rooms	0.0061	0.0002	28.4553	0.0000	[ 0.0059 , 0.0079 ]
Typet	-0.0089	0.0005	-17.6654	0.0000	[ -0.0114 , -0.0088 ]
Typeu	-0.0191	0.0004	-44.1636	0.0000	[ -0.0203 , -0.0176 ]
Bathroom	0.0055	0.0002	22.7799	0.0000	[ 0.004 , 0.0062 ]
Car	0.0008	0.0002	5.3428	0.0000	[ 4e-04 , 0.0015 ]
Landsize	0.0000	0.0000	6.1161	0.0000	[ 0 , 0 ]
BuildingArea	0.0000	0.0000	3.2133	0.0013	[ 0 , 0 ]
main.groupTRUE	0.0107	0.0006	18.2807	0.0000	[ 0.0183 , 0.0277 ]
Distance.num	-0.0010	0.0000	-40.8271	0.0000	[ -0.0016 , -0.0014 ]

As seen in this table, the estimates of the parametric model generally fall within or at least close to the corresponding 95% confidence interval for each regression coefficient, with the exception of that of the “main.group” variable. This is consistent with the above histograms, and again suggests that the parametric model’s p-values may be invalid.

4.2 Bootstraps Residuals

The Residuals version of the bootstrapping method can also be employed in constructing non-parametric confidence intervals for the regression coefficients. In this method, a large number of bootstrap samples are taken from the residuals of the original model, and then the residuals in each sample are added to the fitted values of the corresponding observations, new sets of predicted y values. Each sample of new predicted y value together with the original observations is used to generate new regression models, and the regression coefficients of each are extracted and used to construct the confidence intervals.

In order to perform this bootstrapping procedure, the data set once again needs to be modified to remove missing values. In this instance, all observations with a missing value for any variable will be removed. However, the central limit theorem still holds based on the sample size for this resulting data set. A new Box-Cox model will be generated using this new data set to use in the bootstrapping procedure.

Cleaner_Data <- na.omit(Clean_Data)
BC.Model.Cleaner <- lm(Price.BC ~ Rooms + Type + Bathroom + Car + Landsize + BuildingArea + main.group + Distance.num, data = Cleaner_Data)

model.resid = BC.Model.Cleaner$residuals
B=1000
num.p = dim(model.matrix(BC.Model.Cleaner))[2]   
samp.n = dim(model.matrix(BC.Model.Cleaner))[1]  
btr.mtrx = matrix(rep(0, 10000), ncol=10) 
for (i in 1:B){
  Boot.R.Price = BC.Model.Cleaner$fitted.values + 
        sample(BC.Model.Cleaner$residuals, samp.n, replace = TRUE)
  Cleaner_Data$Boot.R.Price = Boot.R.Price
  Box.Cox.Boot.R = lm(Boot.R.Price ~ Rooms + Type + Bathroom + Car + Landsize + BuildingArea + main.group + Distance.num, data = Cleaner_Data[bootc.id,])   # b
  btr.mtrx[i,]=Box.Cox.Boot.R$coefficients
}

Once again, the distributions of the bootstrap regression coefficients can be compared to normal curves via histogram.

boot.hist = function(bt.coef.mtrx, var.id, var.nm){
  x1.1 <- seq(min(bt.coef.mtrx[,var.id]), max(bt.coef.mtrx[,var.id]), length=300 )
  y1.1 <- dnorm(x1.1, mean(bt.coef.mtrx[,var.id]), sd(bt.coef.mtrx[,var.id]))
  highestbar = max(hist(bt.coef.mtrx[,var.id], plot = FALSE)$density) 
  ylimit <- max(c(y1.1,highestbar))
  hist(bt.coef.mtrx[,var.id], probability = TRUE, main = var.nm, xlab="", 
       col = "azure1",ylim=c(0,ylimit), border="lightseagreen")
  lines(x = x1.1, y = y1.1, col = "red3")               
  lines(density(bt.coef.mtrx[,var.id], adjust=2), col="blue") 
}

par(mfrow=c(3,3), mar = c(2, 2, 2, 2))
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=1, var.nm ="Intercept" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=2, var.nm ="Rooms" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=3, var.nm ="Type" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=4, var.nm ="Bathroom" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=5, var.nm ="Car" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=6, var.nm ="Landsize" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=7, var.nm ="BuildingArea" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=8, var.nm ="main.group" )
boot.hist(bt.coef.mtrx=btr.mtrx, var.id=9, var.nm ="Distance.num" )

Unlike those of the cases bootstrap regression, all of the regression coefficients of the residuals bootstrap model generally match the normal curves, including the distribution for the “main.group” variable. In isolation, this suggests that the p-values of the regression model are valid. However, given the clear discrepancy between the curves for the “main.group” variable seen in the cases bootstrap histogram, bootstrapping may still be the best approach.

Again the estimated regression coefficient and bootstrap confidence intervals can also be compared in table form.

cmtrx.r <- summary(BC.Model.Cleaner)$coef
num.p = dim(coef.mtrx)[2]  
btr.ci = NULL
btr.wd = NULL
for (i in 1:num.p){
  lci.025 = round(quantile(btr.mtrx[, i], 0.025, type = 2),8)
  uci.975 = round(quantile(btr.mtrx[, i],0.975, type = 2 ),8)
  btr.wd[i] = uci.975 - lci.025
  btr.ci[i] = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
}
kable(as.data.frame(cbind(formatC(cmtrx.r,4,format="f"), btr.ci.95=btr.ci)), 
      caption = "Regression Coefficient Matrix with 95% Residual Bootstrap CI")

Regression Coefficient Matrix with 95% Residual Bootstrap CI
	Estimate	Std. Error	t value	Pr(>\|t\|)	btr.ci.95
(Intercept)	3.9881	0.0009	4436.7370	0.0000	[ 3.9857 , 3.9907 ]
Rooms	0.0061	0.0002	28.4553	0.0000	[ 0.0054 , 0.0067 ]
Typet	-0.0089	0.0005	-17.6654	0.0000	[ -0.0103 , -0.0075 ]
Typeu	-0.0191	0.0004	-44.1636	0.0000	[ -0.0204 , -0.018 ]
Bathroom	0.0055	0.0002	22.7799	0.0000	[ 0.0048 , 0.0061 ]
Car	0.0008	0.0002	5.3428	0.0000	[ 4e-04 , 0.0012 ]
Landsize	0.0000	0.0000	6.1161	0.0000	[ 0 , 0 ]
BuildingArea	0.0000	0.0000	3.2133	0.0013	[ 0 , 0 ]
main.groupTRUE	0.0107	0.0006	18.2807	0.0000	[ 0.0091 , 0.0123 ]
Distance.num	-0.0010	0.0000	-40.8271	0.0000	[ -0.0011 , -0.001 ]

These results are largely similar to those of the cases bootstrap procedure, with the key difference being the estimated regression coefficient for the “main.group” variable falling within the range of the bootstrap confidence interval this time.

It should be mentioned that the differences between the results of the two bootstrapping methods may be due at least in part to the different data sets used to perform each.

5 Final Model

The the original version of the Box-Cox Model which was generated using the largest number of observations will be used as the final model for the discussion. The inferential statistics for this model are as follows:

kable(summary(Box.Cox.Model)$coef, caption = "Inferential Statistics of Final Model")

Inferential Statistics of Final Model
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	3.9880855	0.0008989	4436.736954	0.0000000
Rooms	0.0060549	0.0002128	28.455261	0.0000000
Typet	-0.0088665	0.0005019	-17.665360	0.0000000
Typeu	-0.0191099	0.0004327	-44.163583	0.0000000
Bathroom	0.0054608	0.0002397	22.779941	0.0000000
Car	0.0008148	0.0001525	5.342843	0.0000001
Landsize	0.0000007	0.0000001	6.116052	0.0000000
BuildingArea	0.0000010	0.0000003	3.213317	0.0013166
main.groupTRUE	0.0107272	0.0005868	18.280685	0.0000000
Distance.num	-0.0010369	0.0000254	-40.827076	0.0000000

The final model can be written as

\[ BoxCox(Price) = 3.9881 + 0.0061\times Rooms - 0.0089\times Typet - 0.0191\times Typeu + 0.0055\times Bathroom + 0.0008\times Car +0.0000007\times Landsize +0.000001\times BuildingArea + 0.0107\times main.groupTRUE - 0.0010\times Distance.num \] where BoxCox(Price) represents the value of price after the Box-Cox transformation is applied where lambda = -0.24. The regression coefficient for each variable represents the increase or decrease (depending on the sign of the coefficient) in Box-Cox transformed price per a one-unit increase (or change from baseline level) in the variable with all other variables being held constant. So, for example, holding all other variables constant, for each additional room a house has, the model predicts a .0061 increase in the Box-Cox-transformed price. Algebra can be employed to transform this difference back to a normal price value. Given that the formula for the Box-Cox tranformation is:

\[ \frac{y^\lambda - 1}{\lambda} \] Algebraic manipulation gives a function of

\[ y = \left( y(\lambda) \times \lambda + 1 \right)^{\frac{1}{\lambda}} \] Where y(lambda) is the transformed y-value and y is the corresponding non-transformed value. So for a transformed-price change of .0061 where lambda = -0.24,

\[ y = \left( .0061 \times -0.24 + 1 \right)^{\frac{1}{-0.24} } \] where y = 1.006. Thus for each additional room a house has, the predicted price increases by 1.006. This can be used for the coefficients for each variable, where a negative non-transformed y-value indicates a negative correlation between price and the variable.

6 Conclusion & Discussion

Though the goodness-of-fit measures for the final model are far from ideal, a general sense of how the various factors affect a house’s selling price can be gleaned from it. The regression coefficients generally align with intuitive notions of the associations between the response variable and each explanatory variable, e.g., the selling price of a house is positively associated with the number of rooms based on the model, and negatively associated with its distance from a desirable part of the city, etc. That said, considering this model’s modest goodness-of-fit measures as well as the still-questionable adherence to model assumptions, inferences drawn from the bootstrap procedures are likely more trustworthy than the p-values of the final regression model.

Because the final model involves a transformed response variable, it is not ideal for making inferences about the relationships between the response variable and each explanatory variable beyond the direction of the potential correlation. It may however be somewhat useful for predicting the selling price of a given house based on the factors in the model.

The bootstrap regression results generally aligned with the with the estimates of the final model and the intuitive notions of the relationships between the variables, but can be trusted more than the p-values of the final regression model because the violations of assumptions of normality and constant variance are not an issue for a non-parametric approach.

Ideally, the issue of missing values interfering with the bootstrapping procedures would have been identified at the outset of the analysis and a fully cleaned data set would have been produced at the start to use across all the different regression models and procedures. Different data sets being used to construct different regression models and confidence intervals may have contributed to the discrepancies between them. Useful insights can still be taken away from these procedures, but using the same data set for each would have provided a more solid foundation for drawing conclusions about the data.

Analyzing Melbourne Housing Market Data Via Parametric & Bootstrap Multiple Linear Regression

Haley Koprivsek

STA321 Mini-Project #1