Attempting to Remedy the Cost of Living in Boston with ML

Question

Can housing prices be predicted by location factors in Boston, MA?

Background

boston image

Boston, MA, is one of the largest cities in the United States [1]. Its popularity likely stems from its highly desirable quality of life, ranking 18th in the nation and among the top 40 cities internationally [2].

Despite Boston’s clear allure, an alarming amount of the Bostonian population is relocating outside of the city’s borders [3]. This is likely due to the extremely high cost of living in Boston. This crisis especially affects the housing market in Boston; the average cost of housing is 124% higher than the national average [4]. Additionally, Boston was declared the second-most expensive city in the US for renters in 2022 [5].

These sorts of inflated housing values can be dangerous for the economy, but they also produce an opportunity to develop a model that predicts housing value. The purpose of our research is to develop a model that will predict housing values; this has a variety of applications. One would be assisting sellers in the housing market to price their house based on neighborhood; another would be assisting buyers or investors in the market to determine whether a neighborhood is over- or under-valued.

Our dataset was sourced from the U.S. Census Service in Boston and its surrounding areas. It contains data on individual neighborhoods in the Boston area across 14 features, including the target variable. The target is MEDV, or median house value for the neighborhood. We have opted for a categorical variable approach, where each of 5 buckets represent median house in increments of $10,000. This incremenet is so small because the dataset was created in 1978. Adjusting for inflation and the House Price Index (HPI) would put this increment closer to $125,000 which is reasonable for urban homes in today’s market.

Using a Random Forest model, we aimed to predict the prices of housing based on neighborhood factors. If we are successful, this model could potentially be used to compare the actual value of homes in a neighborhood to the predicted value, and hopefully identify over- or under-valued neighborhoods.

Dataset Structure:

## spc_tbl_ [1,195 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ CRIM  : num [1:1195] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN    : num [1:1195] 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS : num [1:1195] 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NOX   : num [1:1195] 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM    : num [1:1195] 6.58 6.42 7.18 7 7.15 ...
##  $ AGE   : num [1:1195] 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS   : num [1:1195] 4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD   : num [1:1195] 1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX   : num [1:1195] 296 242 242 222 222 222 311 311 311 311 ...
##  $ PRATIO: num [1:1195] 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ B     : num [1:1195] 397 397 393 395 397 ...
##  $ LSTAT : num [1:1195] 4.98 9.14 4.03 2.94 5.33 ...
##  $ MEDV  : num [1:1195] 2 1 3 3 3 2 1 2 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   CRIM = col_double(),
##   ..   ZN = col_double(),
##   ..   INDUS = col_double(),
##   ..   CHAS = col_double(),
##   ..   NOX = col_double(),
##   ..   RM = col_double(),
##   ..   AGE = col_double(),
##   ..   DIS = col_double(),
##   ..   RAD = col_double(),
##   ..   TAX = col_double(),
##   ..   PRATIO = col_double(),
##   ..   B = col_double(),
##   ..   LSTAT = col_double(),
##   ..   MEDV = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Exploratory Data Analysis

Upon first glance, the most important variables seem to be
- CRIM: Per Capita Crime Rate Per Town
- DIS: Weighted Distance to Boston Employment Centers
- LSTAT: Percentage of Low Income Residents Per Town
- RM: Average Number of Rooms Per Residence

All

CRIM

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00632  0.06876  0.31823  4.08227  4.26623 88.97620

DIS

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.130   2.000   3.181   3.751   5.221  12.127

LSTAT

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.73    4.79    7.67   10.88   16.23   37.97

RM

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.561   6.017   6.532   6.629   7.173   8.780

Methods

The model that we will use to predict median housing prices is Random Forest. We use iterative model creation to optimize the final model. We optimize using hyperparameters (mtry, sample size, node size), as well as sparse versus unsparse datasets (determined using variable importance testing). Ultimately, we conclude that the best performing model is built using unsparse data where mtry = 4, sample size = 700, and nodesize = 3, which is our fifth model iteration (RF5).

Splitting the data into train and test

Training Data:

## tibble [1,076 × 14] (S3: tbl_df/tbl/data.frame)
##  $ CRIM  : num [1:1076] 0.0273 0.0273 0.0324 0.0691 0.0299 ...
##  $ ZN    : num [1:1076] 0 0 0 0 0 12.5 12.5 12.5 12.5 12.5 ...
##  $ INDUS : num [1:1076] 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 7.87 ...
##  $ CHAS  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NOX   : num [1:1076] 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 0.524 ...
##  $ RM    : num [1:1076] 6.42 7.18 7 7.15 6.43 ...
##  $ AGE   : num [1:1076] 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 94.3 ...
##  $ DIS   : num [1:1076] 4.97 4.97 6.06 6.06 6.06 ...
##  $ RAD   : num [1:1076] 2 2 3 3 3 5 5 5 5 5 ...
##  $ TAX   : num [1:1076] 242 242 222 222 222 311 311 311 311 311 ...
##  $ PRATIO: num [1:1076] 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 15.2 ...
##  $ B     : num [1:1076] 397 393 395 397 394 ...
##  $ LSTAT : num [1:1076] 9.14 4.03 2.94 5.33 5.21 ...
##  $ MEDV  : num [1:1076] 1 3 3 3 2 1 2 1 1 1 ...

Testing Data

## tibble [119 × 14] (S3: tbl_df/tbl/data.frame)
##  $ CRIM  : num [1:119] 9.4005 0.4527 0.0944 23.7496 0.0538 ...
##  $ ZN    : num [1:119] 0 16.6 0 0 32 ...
##  $ INDUS : num [1:119] 18.1 4.5 13.22 18.1 2.27 ...
##  $ CHAS  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ NOX   : num [1:119] 0.74 0.617 0.443 0.673 0.47 ...
##  $ RM    : num [1:119] 5.62 7.2 6.16 6.35 7.22 ...
##  $ AGE   : num [1:119] 93.9 78.3 18 96.5 40.8 ...
##  $ DIS   : num [1:119] 1.82 2.6 5.49 1.4 4.11 ...
##  $ RAD   : num [1:119] 24 4 4 24 6 3 3 1 5 14 ...
##  $ TAX   : num [1:119] 666 260 288 666 221 193 402 296 403 539 ...
##  $ PRATIO: num [1:119] 20.2 13.8 16.4 20.2 18.1 ...
##  $ B     : num [1:119] 397 393 397 397 394 ...
##  $ LSTAT : num [1:119] 22.89 8.64 8.2 23.91 6.76 ...
##  $ MEDV  : num [1:119] 0 3 2 0 3 3 3 2 4 4 ...

Choosing the mtry value

We find that the optimal mtry value is 4. mtry image

## [1] 3.605551

Model Tuning: Trying different sample sizes and node sizes

RF1 sample size = 500, nodesize = 5

## 
## Call:
##  randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000,      mtry = 4, replace = TRUE, sampsize = 500, nodesize = 5, importance = TRUE,      proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE,      keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 10.69%
## Confusion matrix:
##     0   1   2   3   4 class.error
## 0 200  15   0   0   0 0.069767442
## 1  23 166  25   0   2 0.231481481
## 2   4  23 175  11   0 0.178403756
## 3   0   0   2 208   8 0.045871560
## 4   0   0   1   1 212 0.009345794

## [1] 0.8926789

RF2 sample size = 100, nodesize = 5

## 
## Call:
##  randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000,      mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE,      proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE,      keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 13.48%
## Confusion matrix:
##     0   1   2   3   4 class.error
## 0 201  14   0   0   0 0.065116279
## 1  31 156  28   0   1 0.277777778
## 2   4  32 158  13   6 0.258215962
## 3   0   0   2 203  13 0.068807339
## 4   0   0   0   1 213 0.004672897

## [1] 0.8646995

RF3 sample size = 300, nodesize = 5

## 
## Call:
##  randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000,      mtry = 4, replace = TRUE, sampsize = 300, nodesize = 5, importance = TRUE,      proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE,      keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 11.43%
## Confusion matrix:
##     0   1   2   3   4 class.error
## 0 198  17   0   0   0 0.079069767
## 1  24 165  25   0   2 0.236111111
## 2   2  27 172  12   0 0.192488263
## 3   0   0   3 205  10 0.059633028
## 4   0   0   0   1 213 0.004672897

## [1] 0.8852172

RF4 sample size = 700, nodesize = 5

## 
## Call:
##  randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000,      mtry = 4, replace = TRUE, sampsize = 700, nodesize = 5, importance = TRUE,      proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE,      keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 10.13%
## Confusion matrix:
##     0   1   2   3   4 class.error
## 0 200  15   0   0   0 0.069767442
## 1  22 168  24   0   2 0.222222222
## 2   0  26 178   9   0 0.164319249
## 3   0   0   2 209   7 0.041284404
## 4   0   0   1   1 212 0.009345794

## [1] 0.8982757

RF5 sample size = 700, nodesize = 3

## 
## Call:
##  randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000,      mtry = 4, replace = TRUE, sampsize = 700, nodesize = 3, importance = TRUE,      proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE,      keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 9.94%
## Confusion matrix:
##     0   1   2   3   4 class.error
## 0 200  15   0   0   0 0.069767442
## 1  22 168  24   0   2 0.222222222
## 2   1  24 180   8   0 0.154929577
## 3   0   0   2 209   7 0.041284404
## 4   0   0   1   1 212 0.009345794

## [1] 0.9001414

RF6 Sample size = 700, nodesize = 2

## 
## Call:
##  randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000,      mtry = 4, replace = TRUE, sampsize = 700, nodesize = 2, importance = TRUE,      proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE,      keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 9.67%
## Confusion matrix:
##     0   1   2   3   4 class.error
## 0 201  14   0   0   0 0.065116279
## 1  21 169  24   0   2 0.217592593
## 2   1  22 181   9   0 0.150234742
## 3   0   0   2 209   7 0.041284404
## 4   0   0   1   1 212 0.009345794

## [1] 0.9029399

It seems the best parameters are mtry = 4, sample size = 700, and nodesize = 2, which is RF6

Visualizing Variable Importance in RF6

Sparse Model Tuning

Creating a new model with a sparse dataset with only RM, LSTAT, CRIM, and NOX

## tibble [1,076 × 5] (S3: tbl_df/tbl/data.frame)
##  $ RM   : num [1:1076] 6.42 7.18 7 7.15 6.43 ...
##  $ LSTAT: num [1:1076] 9.14 4.03 2.94 5.33 5.21 ...
##  $ CRIM : num [1:1076] 0.0273 0.0273 0.0324 0.0691 0.0299 ...
##  $ NOX  : num [1:1076] 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 0.524 ...
##  $ MEDV : num [1:1076] 1 3 3 3 2 1 2 1 1 1 ...

RFS1 Sample size = 700, nodesize = 3

## 
## Call:
##  randomForest(formula = as.factor(MEDV) ~ ., data = sparse_train,      ntree = 1000, mtry = 2, replace = TRUE, sampsize = 700, nodesize = 3,      importance = TRUE, proximity = FALSE, norm.votes = TRUE,      do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 12.36%
## Confusion matrix:
##     0   1   2   3   4 class.error
## 0 198  17   0   0   0 0.079069767
## 1  25 163  27   0   1 0.245370370
## 2   2  30 168  13   0 0.211267606
## 3   0   0   7 202   9 0.073394495
## 4   0   0   1   1 212 0.009345794

## [1] 0.8758906

RFS2 Sample size = 500, nodesize = 3

## 
## Call:
##  randomForest(formula = as.factor(MEDV) ~ ., data = sparse_train,      ntree = 1000, mtry = 2, replace = TRUE, sampsize = 500, nodesize = 3,      importance = TRUE, proximity = FALSE, norm.votes = TRUE,      do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.29%
## Confusion matrix:
##     0   1   2   3   4 class.error
## 0 194  21   0   0   0 0.097674419
## 1  29 160  26   0   1 0.259259259
## 2   3  29 169  12   0 0.206572770
## 3   0   0  10 198  10 0.091743119
## 4   0   0   1   1 212 0.009345794

## [1] 0.8665651

RFS3 Sample size = 700, nodesize = 5

## 
## Call:
##  randomForest(formula = as.factor(MEDV) ~ ., data = sparse_train,      ntree = 1000, mtry = 2, replace = TRUE, sampsize = 700, nodesize = 5,      importance = TRUE, proximity = FALSE, norm.votes = TRUE,      do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 12.83%
## Confusion matrix:
##     0   1   2   3   4 class.error
## 0 194  21   0   0   0 0.097674419
## 1  27 163  25   0   1 0.245370370
## 2   2  31 168  12   0 0.211267606
## 3   0   0   7 201  10 0.077981651
## 4   0   0   1   1 212 0.009345794

## [1] 0.8712277

RFS4 Sample size = 500, nodesize = 2

## 
## Call:
##  randomForest(formula = as.factor(MEDV) ~ ., data = sparse_train,      ntree = 1000, mtry = 2, replace = TRUE, sampsize = 500, nodesize = 2,      importance = TRUE, proximity = FALSE, norm.votes = TRUE,      do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 13.48%
## Confusion matrix:
##     0   1   2   3   4 class.error
## 0 196  19   0   0   0 0.088372093
## 1  30 158  27   0   1 0.268518519
## 2   2  32 165  13   1 0.225352113
## 3   0   0   9 199  10 0.087155963
## 4   0   0   1   0 213 0.004672897

## [1] 0.8646999

Overall, it appears using a sparse dataset does not improve the model.

RF6 remains the most accurate model

Model evaluation

Using the model to make predictions on the test dataset

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1  2  3  4
##          0 22  3  0  0  0
##          1  2 16  3  0  0
##          2  0  4 22  0  0
##          3  0  0  1 21  0
##          4  0  0  0  0 25
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8908          
##                  95% CI : (0.8204, 0.9405)
##     No Information Rate : 0.2185          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8633          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9167   0.6957   0.8462   1.0000   1.0000
## Specificity            0.9684   0.9479   0.9570   0.9898   1.0000
## Pos Pred Value         0.8800   0.7619   0.8462   0.9545   1.0000
## Neg Pred Value         0.9787   0.9286   0.9570   1.0000   1.0000
## Precision              0.8800   0.7619   0.8462   0.9545   1.0000
## Recall                 0.9167   0.6957   0.8462   1.0000   1.0000
## F1                     0.8980   0.7273   0.8462   0.9767   1.0000
## Prevalence             0.2017   0.1933   0.2185   0.1765   0.2101
## Detection Rate         0.1849   0.1345   0.1849   0.1765   0.2101
## Detection Prevalence   0.2101   0.1765   0.2185   0.1849   0.2101
## Balanced Accuracy      0.9425   0.8218   0.9016   0.9949   1.0000

As mentioned, we determined that RF5 was our optimal model. We use overall accuracy as the final determining factor of model optimality. The model had accuracy of 90.11% on the training data and 89.08% on the test data.

Conclusion

We were able to use a random forest ensemble to make predictions about housing prices with an 89% accuracy. The fact that a sparse dataset caused the models to perform worse indicates there are a number of factors that influence median home value. The utility of our result lies in the ability to use measurable factors of a neighborhood to predict the value of property within that neighborhood. This would allow an investor to take into account the fact that neighborhoods predicted to be in a higher class than their actual class may be undervalued and a sound real-estate investment. Conversely, neighborhoods whose median price is predicted to be lower than their actual price should likely be avoided by investors. To take this project a step further, it would be interesting to look within the neighborhoods at individual properties rather than just median values to account for possible variation within neighborhoods that may be a limitation of our model.

References

[1]World Population Review, “The 200 Largest Cities in the United States by Population 2020,” worldpopulationreview.com, 2021. https://worldpopulationreview.com/us-cities

[2]B. B. G. Staff, U. May 19, 2022, and 12:44 p m S. on F. S. on T. Comments, “Here’s where Boston ranks among the nation’s top places to live, according to US News & World Report - The Boston Globe,” BostonGlobe.com. https://www.bostonglobe.com/2022/05/19/lifestyle/heres-where-boston-ranks-among-top-25-places-live-according-us-news-world-report/#:~:text=US%20News%20%26%20World%20Report%20released (accessed Apr. 26, 2023).

[3]WBZ-News Staff, “Massachusetts once again ranked among ‘most moved from’ states,” www.cbsnews.com, Jan. 02, 2023. https://www.cbsnews.com/boston/news/massachusetts-most-moved-from-states-movers/#:~:text=BOSTON%20%2D%20An%20annual%20study%20from (accessed Apr. 26, 2023).

[4]“Cost of Living in Boston, MA | PayScale,” www.payscale.com. https://www.payscale.com/cost-of-living-calculator/Massachusetts-Boston

[5]C. DeVon, “Boston is now America’s second most expensive city for renters,” CNBC, Nov. 23, 2022. https://www.cnbc.com/2022/11/10/boston-beats-san-francisco-as-2nd-most-expensive-city-for-renters.html#:~:text=Move%20over%20San%20Francisco%20%E2%80%94%20Boston