1. Present the study design and aims (5%)

Study Design - A cross-sectional investigation of house condistions in 506 suburbs in the Boston area was conducted to develop and validate a model for predicting housing prices.

The current analysis aims to validate the prediction model.

2. Describe the characteristics of the study sample (5%)

2.1 Import the data

f = file.choose()
housing_prices = read.csv(f)

2.2 Characteristic of the study

library(table1)

## 
## Attaching package: 'table1'

## The following objects are masked from 'package:base':
## 
##     units, units<-

table1(~ crime + zone + industry + river + nox + rooms + age + distance + radial + ptratio + lstat + price, data = housing_prices)

	Overall (N=506)
crime
Mean (SD)	3.61 (8.60)
Median [Min, Max]	0.257 [0.00632, 89.0]
zone
Mean (SD)	11.4 (23.3)
Median [Min, Max]	0 [0, 100]
industry
Mean (SD)	11.1 (6.86)
Median [Min, Max]	9.69 [0.460, 27.7]
river
Mean (SD)	0.0692 (0.254)
Median [Min, Max]	0 [0, 1.00]
nox
Mean (SD)	0.555 (0.116)
Median [Min, Max]	0.538 [0.385, 0.871]
rooms
Mean (SD)	6.28 (0.703)
Median [Min, Max]	6.21 [3.56, 8.78]
age
Mean (SD)	68.6 (28.1)
Median [Min, Max]	77.5 [2.90, 100]
distance
Mean (SD)	3.80 (2.11)
Median [Min, Max]	3.21 [1.13, 12.1]
radial
Mean (SD)	9.55 (8.71)
Median [Min, Max]	5.00 [1.00, 24.0]
ptratio
Mean (SD)	18.5 (2.16)
Median [Min, Max]	19.1 [12.6, 22.0]
lstat
Mean (SD)	12.7 (7.14)
Median [Min, Max]	11.4 [1.73, 38.0]
price
Mean (SD)	22.5 (9.20)
Median [Min, Max]	21.2 [5.00, 50.0]

The study’s analysis of house conditions across 506 Boston suburbs reveals significant variability in several key factors influencing housing prices. Notable findings include a wide range in crime rates, with some areas experiencing extremely high crime, and a substantial variation in zoning values, indicating diverse residential density. The proximity to a river is rare but could affect property values due to the desirability of waterfront locations. NOx levels, reflecting air quality, also vary, potentially impacting housing prices. The number of rooms per dwelling is relatively consistent, whereas the age of buildings shows a broad range, affecting maintenance costs. Distance to employment centers and radial accessibility highlight differences in commuting convenience. The pupil-teacher ratio varies moderately, influencing family housing decisions, while the socio-economic diversity is evident in the wide range of lower-status population percentages. Finally, housing prices themselves show considerable variability, underscoring the diverse economic landscape of the suburbs studied. These factors collectively contribute to the development and validation of a robust predictive model for housing prices.

3. Develop a model for predicting housing prices (30%)

3.1 Examine the distribution of housing prices and the correlation among its candidate predictors. Interpret the graph.

library(ggplot2)
p <- ggplot(data = housing_prices, aes(x = price))
p1 <- p + geom_histogram(aes(y = after_stat(density)), color = "white", fill = "blue", bins = 30)
p2 <- p1 + geom_density(col = "red")
p2 + ggtitle("Distribution of Boston Housing Prices") + theme_bw()

The distribution of housing prices in the Boston suburbs, as depicted by the histogram with an overlaid density plot, reveals a slightly right-skewed pattern with a primary concentration of prices between $15,000 and $25,000, peaking around $20,000. There is also a notable secondary peak around $50,000, indicating a subset of higher-priced, possibly luxury properties. The presence of outliers in the higher price range further emphasizes the diversity in housing prices, reflecting a range of market segments from mid-range to high-end properties. This variability in pricing is crucial for developing a robust predictive model that can account for different market dynamics.

3.2 Develop a model for predicting housing prices using a Bayesian Model Averaging (BMA) approach.

library(caret)

## Loading required package: lattice

set.seed(42)

index = createDataPartition(housing_prices$price, p = 0.75, list = FALSE)
training = housing_prices[index, ]
dim(training)

## [1] 381  13

testing = housing_prices[-index, ]
dim(testing)

## [1] 125  13

library(BMA)

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: leaps

## Loading required package: robustbase

## 
## Attaching package: 'robustbase'

## The following object is masked from 'package:survival':
## 
##     heart

## Loading required package: inline

## Loading required package: rrcov

## Scalable Robust Estimators with High Breakdown Point (version 1.7-5)

xvars = training[, c("crime", "zone", "industry", "river", "nox", "rooms", "age", "distance", "radial", "ptratio", "lstat")]
yvar = training[,c("price")]
bma = bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
summary(bma)

## 
## Call:
## bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
## 
## 
##   14  models were selected
##  Best  5  models (cumulative posterior probability =  0.6702 ): 
## 
##            p!=0    EV         SD        model 1     model 2     model 3   
## Intercept  100.0   3.630e+01  6.269017    33.67982    34.25556    39.66277
## crime       91.7  -1.017e-01  0.046737    -0.10147    -0.08842    -0.12671
## zone        46.7   1.738e-02  0.021434     0.03912       .           .    
## industry     2.3  -1.236e-03  0.013041       .           .           .    
## river       63.7   1.666e+00  1.490535     2.60519     2.54394     2.60610
## nox        100.0  -1.912e+01  4.187089   -17.78912   -18.07426   -21.85753
## rooms      100.0   4.551e+00  0.483530     4.52218     4.69457     4.50766
## age          0.9   1.311e-05  0.001503       .           .           .    
## distance   100.0  -1.459e+00  0.246003    -1.58849    -1.32289    -1.35600
## radial      36.0   3.818e-02  0.058343       .           .         0.11238
## ptratio    100.0  -1.061e+00  0.176398    -0.92343    -1.03890    -1.19494
## lstat      100.0  -5.346e-01  0.056451    -0.53056    -0.52780    -0.53258
##                                                                           
## nVar                                         8           7           8    
## r2                                         0.744       0.740       0.744  
## BIC                                     -472.15082  -471.56220  -471.52535
## post prob                                  0.184       0.137       0.134  
##            model 4     model 5   
## Intercept    34.29571    34.84329
## crime        -0.10701    -0.09415
## zone          0.03814       .    
## industry        .           .    
## river           .           .    
## nox         -17.26900   -17.55912
## rooms         4.55775     4.72511
## age             .           .    
## distance     -1.60624    -1.34671
## radial          .           .    
## ptratio      -0.96748    -1.07913
## lstat        -0.53297    -0.53023
##                                  
## nVar            7           6    
## r2            0.740       0.735  
## BIC        -471.16683  -470.99724
## post prob     0.112       0.103

imageplot.bma(bma)

m1 = lm(price ~ nox+rooms+distance+ptratio+lstat, data = training)
summary(m1)

## 
## Call:
## lm(formula = price ~ nox + rooms + distance + ptratio + lstat, 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.741  -3.066  -0.692   2.160  28.154 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.04576    5.59084   6.805 4.01e-11 ***
## nox         -18.86576    3.72441  -5.065 6.41e-07 ***
## rooms         4.55543    0.47407   9.609  < 2e-16 ***
## distance     -1.31671    0.20204  -6.517 2.32e-10 ***
## ptratio      -1.15474    0.13432  -8.597 2.25e-16 ***
## lstat        -0.56929    0.05519 -10.315  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.012 on 375 degrees of freedom
## Multiple R-squared:  0.729,  Adjusted R-squared:  0.7254 
## F-statistic: 201.8 on 5 and 375 DF,  p-value: < 2.2e-16

# Get the model coefficients
coef_m1 = coef(m1)

# Present the model equation
cat("The final model equation is:\n")

## The final model equation is:

cat("price =", coef_m1[1], "\n")  # Intercept

## price = 38.04576

# Loop through the remaining coefficients
for (i in 2:length(coef_m1)) {
  cat(" +", coef_m1[i], "*", names(coef_m1)[i], "\n")
}

##  + -18.86576 * nox 
##  + 4.555432 * rooms 
##  + -1.31671 * distance 
##  + -1.154739 * ptratio 
##  + -0.5692901 * lstat

par(mfrow = c(2,2))
plot(m1)

### Interpret the findings Our linear regression model effectively captures about 72.9% of the variance in housing prices in the Boston area, indicating a strong fit. The analysis highlights several significant predictors:

Nitrogen Oxide (nox): Higher levels of pollution are associated with a substantial decrease in housing prices.
Number of Rooms: Each additional room significantly increases the value of a house, reflecting buyer preference for larger homes.
Distance to Employment Centers: Homes farther from employment hubs are less valuable, underscoring the importance of location and commute times.
Pupil-Teacher Ratio (ptratio): Higher ratios, indicating larger class sizes, correlate with lower housing prices, emphasizing the value placed on school quality.
Percentage of Lower Status Population (lstat): Areas with higher percentages of lower status populations tend to have lower housing prices.

While the model performs well, diagnostic plots suggest minor heteroscedasticity and some influential outliers. These issues suggest areas for future refinement, such as employing robust regression techniques or further data cleaning.

In summary, the model provides robust insights into the key factors driving housing prices in the Boston area, offering valuable guidance for stakeholders in real estate and urban planning.

4. Validate the model’s performance internally (30%)

Identify the optimal internal validation method used to validate the model’s performance. Explain the reason(s).

The optimal internal validation method used to validate the Bayesian Model Averaging (BMA) model’s performance is bootstrap validation. This method is particularly well-suited for BMA models because it comprehensively assesses model stability and performance across multiple resampled datasets. Bootstrap validation repeatedly samples the data with replacement, creating numerous training and validation sets, which ensures that each observation is used in multiple iterations. This is crucial for BMA models, which rely on averaging over many possible models to account for model uncertainty. Bootstrap validation helps in accurately estimating the variability and robustness of the BMA model’s predictions and the inclusion probabilities of predictors. Additionally, it provides reliable estimates of performance metrics, such as R-squared and RMSE, and helps detect overfitting by validating the model on different resampled datasets. These reasons make bootstrap validation an ideal method for thoroughly evaluating the performance and generalizability of the BMA model in predicting housing prices.

Conduct an internal validation to assess the model’s performance.

library(rms)

## Loading required package: Hmisc

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:table1':
## 
##     label, label<-, units

## The following objects are masked from 'package:base':
## 
##     format.pval, units

fit.m2 = ols(price ~ nox+rooms+distance+ptratio+lstat, data = housing_prices, x = TRUE, y = TRUE)
fit.m2

## Linear Regression Model
## 
## ols(formula = price ~ nox + rooms + distance + ptratio + lstat, 
##     data = housing_prices, x = TRUE, y = TRUE)
## 
##                 Model Likelihood    Discrimination    
##                       Ratio Test           Indexes    
## Obs     506    LR chi2    623.04    R2       0.708    
## sigma4.9939    d.f.            5    R2 adj   0.705    
## d.f.    500    Pr(> chi2) 0.0000    g        8.699    
## 
## Residuals
## 
##      Min       1Q   Median       3Q      Max 
## -12.7765  -3.0186  -0.6481   1.9752  27.7625 
## 
## 
##           Coef     S.E.   t      Pr(>|t|)
## Intercept  37.4992 4.6129   8.13 <0.0001 
## nox       -17.9966 3.2610  -5.52 <0.0001 
## rooms       4.1633 0.4120  10.10 <0.0001 
## distance   -1.1847 0.1684  -7.03 <0.0001 
## ptratio    -1.0458 0.1135  -9.21 <0.0001 
## lstat      -0.5811 0.0479 -12.12 <0.0001

Present and interpret the findings.

set.seed(42)
v.m2 = validate(fit.m2, B = 500)
v.m2

##           index.orig training    test optimism index.corrected   n
## R-square      0.7081   0.7126  0.7022   0.0104          0.6977 500
## MSE          24.6430  24.1745 25.1419  -0.9674         25.6104 500
## g             8.6986   8.7001  8.6687   0.0314          8.6672 500
## Intercept     0.0000   0.0000  0.0998  -0.0998          0.0998 500
## Slope         1.0000   1.0000  0.9960   0.0040          0.9960 500

The bootstrap validation results for the Bayesian Model Averaging (BMA) model demonstrate its strong performance and reliability in predicting housing prices. The model explains approximately 69.77% of the variance in housing prices, as indicated by the corrected R-square value, and shows a low mean squared error (MSE) of 25.61, reflecting accurate predictions. The minimal optimism values for both R-square (0.0104) and MSE (-0.9674) suggest that the model’s performance is consistent across training and test datasets, indicating a low risk of overfitting. Additionally, the calibration metrics, including an intercept close to zero and a slope near one, confirm that the model’s predictions are well-calibrated. Overall, these findings validate the robustness and stability of the BMA model, making it a reliable tool for predicting housing prices in the Boston area.

5. Summarise and conclude the findings of this study. (5%)

This study identified significant variability in factors influencing housing prices, such as crime rates, zoning values, air quality (NOx levels), number of rooms, age of buildings, and proximity to employment centers. The predictive model, developed using Bayesian Model Averaging (BMA), effectively captures about 72.9% of the variance in housing prices, highlighting key predictors such as nitrogen oxide levels, number of rooms, distance to employment centers, pupil-teacher ratio, and socio-economic status.

The model’s performance was further validated using bootstrap validation, confirming its robustness and stability with a corrected R-square value of 69.77% and a low mean squared error of 25.61. The minimal optimism values and well-calibrated predictions indicate a low risk of overfitting, making the model a reliable tool for predicting housing prices.

Assignment Intermediate

Justin Sia

10/05/2024