1 Introduction

The data set that will be used to conduct this analysis consists of data concerning the housing market in Melbourne, Australia, collected in January 2016 from publicly available information posted on a real estate website. It is comprised of 34,857 total observations and 21 different variables. Of these, 8 are categorical: Suburb, Address, Type, Method of Sale, Seller, Date Sold, Council Area, and Region Name. The 13 numerical variables are: Number of Rooms, Selling Price (in Australian Dollars), Distance from Melbourne’s Central Business District (in kilometers), Postcode, Number of Bedrooms, Number of Bathrooms, Number of Carspots, Land Size (in meters), Building Size (in meters), Year Built, Latitude, Longitude, and the Number of Properties that exist in the suburb. Given the fairly large number of observations and numerous variables of each type, there should be ample information from which to draw the desired inferences.

This data set will be used to generate a parametric simple linear regression model using the least squares method with selling price as the response variable and an explanatory variable that appears to be linearly correlated with selling price based on a pair-wise scatter plot. The p-value and regression coefficients associated with this model will be compared to confidence intervals constructed via a bootstrapping algorithm using the same data. A choice will then be made as to which of these methods is preferable for drawing conclusions about the relationship between the two variables.

2 Finding A Potential Linear Relationship

pairs.panels(MelbourneHousing[, c(5,3,9,11,12,13,14,15)], main = "Pairwise Scatter Plot of Numerical Variables", cex.main = 0.75)

Based on the pairwise scatter plot, number of rooms is the explanatory variable which exhibits a relationship with selling price (the chosen response variable) closest to a linear correlation, albeit not a particularly strong one. This will be used as the explanatory variable in this report’s linear regression models.

3 Creating Parametric Simple Linear Regression Models and Analyzing Residuals

Residual analysis of a least squares simple linear regression model between selling price and number of rooms suggests some clear violations of model assumptions:

SLR.model <- lm(MelbourneHousing$Price ~ MelbourneHousing$Rooms)
plot(SLR.model)

For instance, the versus fits plot exhibits clear heteroskedasticity, suggesting a violation of the assumption of constant variance of the errors. Furthermore, the Q-Q plot suggests a violation of the normality assumption.

3.1 Box-Cox Transformation

To attempt to rectify these violations of model assumptions, a Box-Cox transformation can be applied to the response variable (as opposed to the explanatory variable, to maintain ease of interpretation of the model).

boxcox(lm(MelbourneHousing$Price ~ MelbourneHousing$Rooms), lambda = seq(-1, 1, 1/10))  #determining ideal lambda value

Based on this graph, -1/4 will be used as the value for lambda in the Box-Cox transformation. Then a new simple linear regression model can be generated using the transformed selling price variable and residual analysis can be performed on this model to once again check for any violations of model assumptions.

lambda = -1/4
#applying Box-Box transformation to response variable
Price.25 = (MelbourneHousing$Price^lambda-1)/lambda
#generating new SLR model using transformed response variable
boxcox.model = lm(Price.25 ~ MelbourneHousing$Rooms) 
plot(boxcox.model)

While this new transformed model still largely displays similar issues of violations of model assumptions seen in the residual analysis of the original SLR model, it does seem to show some slight improvements in terms of constant variance and normality of the errors. Therefore, this new parametric model based on the transformed response variable will be used for further inference, as well as the application of bootstrapping, which is necessary based on the persisting violations of model assumptions.

3.2 Inferential Statistics on Coefficients of Final SLR Model

reg.table <- coef(summary(boxcox.model))
reg.table

##                          Estimate   Std. Error    t value Pr(>|t|)
## (Intercept)            3.84175407 2.795141e-04 13744.4019        0
## MelbourneHousing$Rooms 0.00926271 8.899098e-05   104.0859        0

The estimated slope of the final parametric SLR model is 0.00926271, with an estimated y-intercept of 3.841754. These coefficients are both highly statistically significant (t = 13744.4019, p < .001), (t = 104.0859, p < .001), however, the violations of model assumptions cast doubt on the validity of this significance.

4 Employing Bootstrapping Method

4.1 Generating A Bootstrap Simple Linear Regression Model

A bootstrap SLR model can be generated by repeatedly sampling from the observations used for the previous SLR model with replacement.

B <- 1000    # number of bootstrap SLR models to generate

# creating vectors to store bootstrap regression coefficients
boot.beta0 <- NULL 
boot.beta1 <- NULL

vector.id <- 1:length(Price.25)   # vector of observation IDs
for(i in 1:B){ #starting loop
  
  ##creating samples of observation IDs with replacement, of same size    as original sample
  boot.id <- sample(vector.id, length(Price.25), replace=TRUE) 
  
  #matching response and explanatory variable values to bootstrap sample   observation IDs
  boot.price <- Price.25[boot.id]
  boot.rooms <- MelbourneHousing$Rooms[boot.id]
  
  #generating bootstrap SLR model for each bootstrap sample
  boot.SLR <-lm(Price.25[boot.id] ~ MelbourneHousing$Rooms[boot.id])       
  
  #storing regression coefficient values for each bootstrap SLR
  boot.beta0[i] <- coef(boot.SLR)[1]
  boot.beta1[i] <- coef(boot.SLR)[2]
}

4.2 Constructing Bootstrap Confidence Intervals for Regression Coefficients

Now that large bootstrap samples of both regression coefficients have been generated, 95% confidence intervals can be constructed for each by finding the proper quantiles.

#determining 2.5% and 97.5% quantiles for both bootstrap regression coefficient samples and displaying
boot.beta0.95 <- quantile(boot.beta0, c(0.025, 0.975), type = 2)
boot.beta1.95 <- quantile(boot.beta1, c(0.025, 0.975), type = 2)

The 95% bootstrap confidence interval for the y-intercept is [3.841062, 3.842453], which contains the y-intercept estimate of the final parametric SLR. The 95% bootstrap confidence interval for the slope is [0.009027927, 0.009488906], which likewise contains the slope estimate of the final parametric SLR. Because this bootstrap confidence interval for the slope does not contain 0, the bootstrapping method also suggests a statistically significant positive association between the response variable and the explanatory variable.

5 Conclusion

Thus, the inferences drawn from the bootstrapping method essentially echo those drawn from the analysis of the parametric SLR. That said, the inferences drawn from the bootstrapping method provide much more robust and trustworthy evidence for the statistical significance of the association, as their validity does not rely on the assumptions of constant variance and normality of the errors which the parametric SLR appeared to seriously violate. Therefore, reporting the inferences of the bootstrapping method is recommended in this case.

Least Squares Method Vs. Bootstrapping for Simple Linear Regression

Haley Koprivsek

STA321 Assignment #2