f = file.choose()
housing_prices = read.csv(f)
library(table1)
##
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
##
## units, units<-
table1(~ crime + zone + industry + river + nox + rooms + age + distance + radial + ptratio + lstat + price, data = housing_prices)
| Overall (N=506) |
|
|---|---|
| crime | |
| Mean (SD) | 3.61 (8.60) |
| Median [Min, Max] | 0.257 [0.00632, 89.0] |
| zone | |
| Mean (SD) | 11.4 (23.3) |
| Median [Min, Max] | 0 [0, 100] |
| industry | |
| Mean (SD) | 11.1 (6.86) |
| Median [Min, Max] | 9.69 [0.460, 27.7] |
| river | |
| Mean (SD) | 0.0692 (0.254) |
| Median [Min, Max] | 0 [0, 1.00] |
| nox | |
| Mean (SD) | 0.555 (0.116) |
| Median [Min, Max] | 0.538 [0.385, 0.871] |
| rooms | |
| Mean (SD) | 6.28 (0.703) |
| Median [Min, Max] | 6.21 [3.56, 8.78] |
| age | |
| Mean (SD) | 68.6 (28.1) |
| Median [Min, Max] | 77.5 [2.90, 100] |
| distance | |
| Mean (SD) | 3.80 (2.11) |
| Median [Min, Max] | 3.21 [1.13, 12.1] |
| radial | |
| Mean (SD) | 9.55 (8.71) |
| Median [Min, Max] | 5.00 [1.00, 24.0] |
| ptratio | |
| Mean (SD) | 18.5 (2.16) |
| Median [Min, Max] | 19.1 [12.6, 22.0] |
| lstat | |
| Mean (SD) | 12.7 (7.14) |
| Median [Min, Max] | 11.4 [1.73, 38.0] |
| price | |
| Mean (SD) | 22.5 (9.20) |
| Median [Min, Max] | 21.2 [5.00, 50.0] |
The study’s analysis of house conditions across 506 Boston suburbs reveals significant variability in several key factors influencing housing prices. Notable findings include a wide range in crime rates, with some areas experiencing extremely high crime, and a substantial variation in zoning values, indicating diverse residential density. The proximity to a river is rare but could affect property values due to the desirability of waterfront locations. NOx levels, reflecting air quality, also vary, potentially impacting housing prices. The number of rooms per dwelling is relatively consistent, whereas the age of buildings shows a broad range, affecting maintenance costs. Distance to employment centers and radial accessibility highlight differences in commuting convenience. The pupil-teacher ratio varies moderately, influencing family housing decisions, while the socio-economic diversity is evident in the wide range of lower-status population percentages. Finally, housing prices themselves show considerable variability, underscoring the diverse economic landscape of the suburbs studied. These factors collectively contribute to the development and validation of a robust predictive model for housing prices.
library(ggplot2)
p <- ggplot(data = housing_prices, aes(x = price))
p1 <- p + geom_histogram(aes(y = after_stat(density)), color = "white", fill = "blue", bins = 30)
p2 <- p1 + geom_density(col = "red")
p2 + ggtitle("Distribution of Boston Housing Prices") + theme_bw()
The distribution of housing prices in the Boston suburbs, as depicted by the histogram with an overlaid density plot, reveals a slightly right-skewed pattern with a primary concentration of prices between $15,000 and $25,000, peaking around $20,000. There is also a notable secondary peak around $50,000, indicating a subset of higher-priced, possibly luxury properties. The presence of outliers in the higher price range further emphasizes the diversity in housing prices, reflecting a range of market segments from mid-range to high-end properties. This variability in pricing is crucial for developing a robust predictive model that can account for different market dynamics.
library(caret)
## Loading required package: lattice
set.seed(42)
index = createDataPartition(housing_prices$price, p = 0.75, list = FALSE)
training = housing_prices[index, ]
dim(training)
## [1] 381 13
testing = housing_prices[-index, ]
dim(testing)
## [1] 125 13
library(BMA)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: leaps
## Loading required package: robustbase
##
## Attaching package: 'robustbase'
## The following object is masked from 'package:survival':
##
## heart
## Loading required package: inline
## Loading required package: rrcov
## Scalable Robust Estimators with High Breakdown Point (version 1.7-5)
xvars = training[, c("crime", "zone", "industry", "river", "nox", "rooms", "age", "distance", "radial", "ptratio", "lstat")]
yvar = training[,c("price")]
bma = bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
summary(bma)
##
## Call:
## bicreg(x = xvars, y = yvar, strict = FALSE, OR = 20)
##
##
## 14 models were selected
## Best 5 models (cumulative posterior probability = 0.6702 ):
##
## p!=0 EV SD model 1 model 2 model 3
## Intercept 100.0 3.630e+01 6.269017 33.67982 34.25556 39.66277
## crime 91.7 -1.017e-01 0.046737 -0.10147 -0.08842 -0.12671
## zone 46.7 1.738e-02 0.021434 0.03912 . .
## industry 2.3 -1.236e-03 0.013041 . . .
## river 63.7 1.666e+00 1.490535 2.60519 2.54394 2.60610
## nox 100.0 -1.912e+01 4.187089 -17.78912 -18.07426 -21.85753
## rooms 100.0 4.551e+00 0.483530 4.52218 4.69457 4.50766
## age 0.9 1.311e-05 0.001503 . . .
## distance 100.0 -1.459e+00 0.246003 -1.58849 -1.32289 -1.35600
## radial 36.0 3.818e-02 0.058343 . . 0.11238
## ptratio 100.0 -1.061e+00 0.176398 -0.92343 -1.03890 -1.19494
## lstat 100.0 -5.346e-01 0.056451 -0.53056 -0.52780 -0.53258
##
## nVar 8 7 8
## r2 0.744 0.740 0.744
## BIC -472.15082 -471.56220 -471.52535
## post prob 0.184 0.137 0.134
## model 4 model 5
## Intercept 34.29571 34.84329
## crime -0.10701 -0.09415
## zone 0.03814 .
## industry . .
## river . .
## nox -17.26900 -17.55912
## rooms 4.55775 4.72511
## age . .
## distance -1.60624 -1.34671
## radial . .
## ptratio -0.96748 -1.07913
## lstat -0.53297 -0.53023
##
## nVar 7 6
## r2 0.740 0.735
## BIC -471.16683 -470.99724
## post prob 0.112 0.103
imageplot.bma(bma)
m1 = lm(price ~ nox+rooms+distance+ptratio+lstat, data = training)
summary(m1)
##
## Call:
## lm(formula = price ~ nox + rooms + distance + ptratio + lstat,
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.741 -3.066 -0.692 2.160 28.154
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.04576 5.59084 6.805 4.01e-11 ***
## nox -18.86576 3.72441 -5.065 6.41e-07 ***
## rooms 4.55543 0.47407 9.609 < 2e-16 ***
## distance -1.31671 0.20204 -6.517 2.32e-10 ***
## ptratio -1.15474 0.13432 -8.597 2.25e-16 ***
## lstat -0.56929 0.05519 -10.315 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.012 on 375 degrees of freedom
## Multiple R-squared: 0.729, Adjusted R-squared: 0.7254
## F-statistic: 201.8 on 5 and 375 DF, p-value: < 2.2e-16
# Get the model coefficients
coef_m1 = coef(m1)
# Present the model equation
cat("The final model equation is:\n")
## The final model equation is:
cat("price =", coef_m1[1], "\n") # Intercept
## price = 38.04576
# Loop through the remaining coefficients
for (i in 2:length(coef_m1)) {
cat(" +", coef_m1[i], "*", names(coef_m1)[i], "\n")
}
## + -18.86576 * nox
## + 4.555432 * rooms
## + -1.31671 * distance
## + -1.154739 * ptratio
## + -0.5692901 * lstat
par(mfrow = c(2,2))
plot(m1)
### Interpret the findings Our linear regression model effectively captures about 72.9% of the variance in housing prices in the Boston area, indicating a strong fit. The analysis highlights several significant predictors:
While the model performs well, diagnostic plots suggest minor heteroscedasticity and some influential outliers. These issues suggest areas for future refinement, such as employing robust regression techniques or further data cleaning.
In summary, the model provides robust insights into the key factors driving housing prices in the Boston area, offering valuable guidance for stakeholders in real estate and urban planning.
The optimal internal validation method used to validate the Bayesian Model Averaging (BMA) model’s performance is bootstrap validation. This method is particularly well-suited for BMA models because it comprehensively assesses model stability and performance across multiple resampled datasets. Bootstrap validation repeatedly samples the data with replacement, creating numerous training and validation sets, which ensures that each observation is used in multiple iterations. This is crucial for BMA models, which rely on averaging over many possible models to account for model uncertainty. Bootstrap validation helps in accurately estimating the variability and robustness of the BMA model’s predictions and the inclusion probabilities of predictors. Additionally, it provides reliable estimates of performance metrics, such as R-squared and RMSE, and helps detect overfitting by validating the model on different resampled datasets. These reasons make bootstrap validation an ideal method for thoroughly evaluating the performance and generalizability of the BMA model in predicting housing prices.
library(rms)
## Loading required package: Hmisc
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:table1':
##
## label, label<-, units
## The following objects are masked from 'package:base':
##
## format.pval, units
fit.m2 = ols(price ~ nox+rooms+distance+ptratio+lstat, data = housing_prices, x = TRUE, y = TRUE)
fit.m2
## Linear Regression Model
##
## ols(formula = price ~ nox + rooms + distance + ptratio + lstat,
## data = housing_prices, x = TRUE, y = TRUE)
##
## Model Likelihood Discrimination
## Ratio Test Indexes
## Obs 506 LR chi2 623.04 R2 0.708
## sigma4.9939 d.f. 5 R2 adj 0.705
## d.f. 500 Pr(> chi2) 0.0000 g 8.699
##
## Residuals
##
## Min 1Q Median 3Q Max
## -12.7765 -3.0186 -0.6481 1.9752 27.7625
##
##
## Coef S.E. t Pr(>|t|)
## Intercept 37.4992 4.6129 8.13 <0.0001
## nox -17.9966 3.2610 -5.52 <0.0001
## rooms 4.1633 0.4120 10.10 <0.0001
## distance -1.1847 0.1684 -7.03 <0.0001
## ptratio -1.0458 0.1135 -9.21 <0.0001
## lstat -0.5811 0.0479 -12.12 <0.0001
set.seed(42)
v.m2 = validate(fit.m2, B = 500)
v.m2
## index.orig training test optimism index.corrected n
## R-square 0.7081 0.7126 0.7022 0.0104 0.6977 500
## MSE 24.6430 24.1745 25.1419 -0.9674 25.6104 500
## g 8.6986 8.7001 8.6687 0.0314 8.6672 500
## Intercept 0.0000 0.0000 0.0998 -0.0998 0.0998 500
## Slope 1.0000 1.0000 0.9960 0.0040 0.9960 500
The bootstrap validation results for the Bayesian Model Averaging (BMA) model demonstrate its strong performance and reliability in predicting housing prices. The model explains approximately 69.77% of the variance in housing prices, as indicated by the corrected R-square value, and shows a low mean squared error (MSE) of 25.61, reflecting accurate predictions. The minimal optimism values for both R-square (0.0104) and MSE (-0.9674) suggest that the model’s performance is consistent across training and test datasets, indicating a low risk of overfitting. Additionally, the calibration metrics, including an intercept close to zero and a slope near one, confirm that the model’s predictions are well-calibrated. Overall, these findings validate the robustness and stability of the BMA model, making it a reliable tool for predicting housing prices in the Boston area.
This study identified significant variability in factors influencing housing prices, such as crime rates, zoning values, air quality (NOx levels), number of rooms, age of buildings, and proximity to employment centers. The predictive model, developed using Bayesian Model Averaging (BMA), effectively captures about 72.9% of the variance in housing prices, highlighting key predictors such as nitrogen oxide levels, number of rooms, distance to employment centers, pupil-teacher ratio, and socio-economic status.
The model’s performance was further validated using bootstrap validation, confirming its robustness and stability with a corrected R-square value of 69.77% and a low mean squared error of 25.61. The minimal optimism values and well-calibrated predictions indicate a low risk of overfitting, making the model a reliable tool for predicting housing prices.