Can housing prices be predicted by location factors in Boston, MA?
Dataset Structure:
## spc_tbl_ [1,195 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ CRIM : num [1:1195] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ ZN : num [1:1195] 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ INDUS : num [1:1195] 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ CHAS : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ NOX : num [1:1195] 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ RM : num [1:1195] 6.58 6.42 7.18 7 7.15 ...
## $ AGE : num [1:1195] 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ DIS : num [1:1195] 4.09 4.97 4.97 6.06 6.06 ...
## $ RAD : num [1:1195] 1 2 2 3 3 3 5 5 5 5 ...
## $ TAX : num [1:1195] 296 242 242 222 222 222 311 311 311 311 ...
## $ PRATIO: num [1:1195] 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ B : num [1:1195] 397 397 393 395 397 ...
## $ LSTAT : num [1:1195] 4.98 9.14 4.03 2.94 5.33 ...
## $ MEDV : num [1:1195] 2 1 3 3 3 2 1 2 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. CRIM = col_double(),
## .. ZN = col_double(),
## .. INDUS = col_double(),
## .. CHAS = col_double(),
## .. NOX = col_double(),
## .. RM = col_double(),
## .. AGE = col_double(),
## .. DIS = col_double(),
## .. RAD = col_double(),
## .. TAX = col_double(),
## .. PRATIO = col_double(),
## .. B = col_double(),
## .. LSTAT = col_double(),
## .. MEDV = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Upon first glance, the most important variables seem to be - CRIM: Per Capita Crime Rate Per Town - DIS: Weighted Distance to Boston Employment Centers - LSTAT: Percentage of Low Income Residents Per Town - RM: Average Number of Rooms Per Residence
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00632 0.06876 0.31823 4.08227 4.26623 88.97620
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.130 2.000 3.181 3.751 5.221 12.127
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.73 4.79 7.67 10.88 16.23 37.97
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.561 6.017 6.532 6.629 7.173 8.780
Training Data:
## tibble [1,076 × 14] (S3: tbl_df/tbl/data.frame)
## $ CRIM : num [1:1076] 0.0273 0.0273 0.0324 0.0691 0.0299 ...
## $ ZN : num [1:1076] 0 0 0 0 0 12.5 12.5 12.5 12.5 12.5 ...
## $ INDUS : num [1:1076] 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 7.87 ...
## $ CHAS : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ NOX : num [1:1076] 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 0.524 ...
## $ RM : num [1:1076] 6.42 7.18 7 7.15 6.43 ...
## $ AGE : num [1:1076] 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 94.3 ...
## $ DIS : num [1:1076] 4.97 4.97 6.06 6.06 6.06 ...
## $ RAD : num [1:1076] 2 2 3 3 3 5 5 5 5 5 ...
## $ TAX : num [1:1076] 242 242 222 222 222 311 311 311 311 311 ...
## $ PRATIO: num [1:1076] 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 15.2 ...
## $ B : num [1:1076] 397 393 395 397 394 ...
## $ LSTAT : num [1:1076] 9.14 4.03 2.94 5.33 5.21 ...
## $ MEDV : num [1:1076] 1 3 3 3 2 1 2 1 1 1 ...
Testing Data
## tibble [119 × 14] (S3: tbl_df/tbl/data.frame)
## $ CRIM : num [1:119] 9.4005 0.4527 0.0944 23.7496 0.0538 ...
## $ ZN : num [1:119] 0 16.6 0 0 32 ...
## $ INDUS : num [1:119] 18.1 4.5 13.22 18.1 2.27 ...
## $ CHAS : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ NOX : num [1:119] 0.74 0.617 0.443 0.673 0.47 ...
## $ RM : num [1:119] 5.62 7.2 6.16 6.35 7.22 ...
## $ AGE : num [1:119] 93.9 78.3 18 96.5 40.8 ...
## $ DIS : num [1:119] 1.82 2.6 5.49 1.4 4.11 ...
## $ RAD : num [1:119] 24 4 4 24 6 3 3 1 5 14 ...
## $ TAX : num [1:119] 666 260 288 666 221 193 402 296 403 539 ...
## $ PRATIO: num [1:119] 20.2 13.8 16.4 20.2 18.1 ...
## $ B : num [1:119] 397 393 397 397 394 ...
## $ LSTAT : num [1:119] 22.89 8.64 8.2 23.91 6.76 ...
## $ MEDV : num [1:119] 0 3 2 0 3 3 3 2 4 4 ...
We find that the optimal mtry value is 4.
## [1] 3.605551
RF1 sample size = 500, nodesize = 5
##
## Call:
## randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000, mtry = 4, replace = TRUE, sampsize = 500, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 10.69%
## Confusion matrix:
## 0 1 2 3 4 class.error
## 0 200 15 0 0 0 0.069767442
## 1 23 166 25 0 2 0.231481481
## 2 4 23 175 11 0 0.178403756
## 3 0 0 2 208 8 0.045871560
## 4 0 0 1 1 212 0.009345794
## [1] 0.8926789
RF2 sample size = 100, nodesize = 5
##
## Call:
## randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 13.48%
## Confusion matrix:
## 0 1 2 3 4 class.error
## 0 201 14 0 0 0 0.065116279
## 1 31 156 28 0 1 0.277777778
## 2 4 32 158 13 6 0.258215962
## 3 0 0 2 203 13 0.068807339
## 4 0 0 0 1 213 0.004672897
## [1] 0.8646995
RF3 sample size = 300, nodesize = 5
##
## Call:
## randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000, mtry = 4, replace = TRUE, sampsize = 300, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 11.43%
## Confusion matrix:
## 0 1 2 3 4 class.error
## 0 198 17 0 0 0 0.079069767
## 1 24 165 25 0 2 0.236111111
## 2 2 27 172 12 0 0.192488263
## 3 0 0 3 205 10 0.059633028
## 4 0 0 0 1 213 0.004672897
## [1] 0.8852172
RF4 sample size = 700, nodesize = 5
##
## Call:
## randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000, mtry = 4, replace = TRUE, sampsize = 700, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 10.13%
## Confusion matrix:
## 0 1 2 3 4 class.error
## 0 200 15 0 0 0 0.069767442
## 1 22 168 24 0 2 0.222222222
## 2 0 26 178 9 0 0.164319249
## 3 0 0 2 209 7 0.041284404
## 4 0 0 1 1 212 0.009345794
## [1] 0.8982757
RF5 sample size = 700, nodesize = 3
##
## Call:
## randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000, mtry = 4, replace = TRUE, sampsize = 700, nodesize = 3, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 9.94%
## Confusion matrix:
## 0 1 2 3 4 class.error
## 0 200 15 0 0 0 0.069767442
## 1 22 168 24 0 2 0.222222222
## 2 1 24 180 8 0 0.154929577
## 3 0 0 2 209 7 0.041284404
## 4 0 0 1 1 212 0.009345794
## [1] 0.9001414
RF6 Sample size = 700, nodesize = 2
##
## Call:
## randomForest(formula = as.factor(MEDV) ~ ., data = train, ntree = 1000, mtry = 4, replace = TRUE, sampsize = 700, nodesize = 2, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 9.67%
## Confusion matrix:
## 0 1 2 3 4 class.error
## 0 201 14 0 0 0 0.065116279
## 1 21 169 24 0 2 0.217592593
## 2 1 22 181 9 0 0.150234742
## 3 0 0 2 209 7 0.041284404
## 4 0 0 1 1 212 0.009345794
## [1] 0.9029399
It seems the best parameters are mtry = 4, sample size = 700, and nodesize = 2, which is RF6
Visualizing Variable Importance in RF6
Creating a new model with a sparse dataset with only RM, LSTAT, CRIM, and NOX
## tibble [1,076 × 5] (S3: tbl_df/tbl/data.frame)
## $ RM : num [1:1076] 6.42 7.18 7 7.15 6.43 ...
## $ LSTAT: num [1:1076] 9.14 4.03 2.94 5.33 5.21 ...
## $ CRIM : num [1:1076] 0.0273 0.0273 0.0324 0.0691 0.0299 ...
## $ NOX : num [1:1076] 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 0.524 ...
## $ MEDV : num [1:1076] 1 3 3 3 2 1 2 1 1 1 ...
RFS1 Sample size = 700, nodesize = 3
##
## Call:
## randomForest(formula = as.factor(MEDV) ~ ., data = sparse_train, ntree = 1000, mtry = 2, replace = TRUE, sampsize = 700, nodesize = 3, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 12.36%
## Confusion matrix:
## 0 1 2 3 4 class.error
## 0 198 17 0 0 0 0.079069767
## 1 25 163 27 0 1 0.245370370
## 2 2 30 168 13 0 0.211267606
## 3 0 0 7 202 9 0.073394495
## 4 0 0 1 1 212 0.009345794
## [1] 0.8758906
RFS2 Sample size = 500, nodesize = 3
##
## Call:
## randomForest(formula = as.factor(MEDV) ~ ., data = sparse_train, ntree = 1000, mtry = 2, replace = TRUE, sampsize = 500, nodesize = 3, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 13.29%
## Confusion matrix:
## 0 1 2 3 4 class.error
## 0 194 21 0 0 0 0.097674419
## 1 29 160 26 0 1 0.259259259
## 2 3 29 169 12 0 0.206572770
## 3 0 0 10 198 10 0.091743119
## 4 0 0 1 1 212 0.009345794
## [1] 0.8665651
RFS3 Sample size = 700, nodesize = 5
##
## Call:
## randomForest(formula = as.factor(MEDV) ~ ., data = sparse_train, ntree = 1000, mtry = 2, replace = TRUE, sampsize = 700, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 12.83%
## Confusion matrix:
## 0 1 2 3 4 class.error
## 0 194 21 0 0 0 0.097674419
## 1 27 163 25 0 1 0.245370370
## 2 2 31 168 12 0 0.211267606
## 3 0 0 7 201 10 0.077981651
## 4 0 0 1 1 212 0.009345794
## [1] 0.8712277
RFS4 Sample size = 500, nodesize = 2
##
## Call:
## randomForest(formula = as.factor(MEDV) ~ ., data = sparse_train, ntree = 1000, mtry = 2, replace = TRUE, sampsize = 500, nodesize = 2, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 13.48%
## Confusion matrix:
## 0 1 2 3 4 class.error
## 0 196 19 0 0 0 0.088372093
## 1 30 158 27 0 1 0.268518519
## 2 2 32 165 13 1 0.225352113
## 3 0 0 9 199 10 0.087155963
## 4 0 0 1 0 213 0.004672897
## [1] 0.8646999
Overall, it appears using a sparse dataset does not improve the model.
RF6 remains the most accurate model
Using the model to make predictions on the test dataset
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1 2 3 4
## 0 22 3 0 0 0
## 1 2 16 3 0 0
## 2 0 4 22 0 0
## 3 0 0 1 21 0
## 4 0 0 0 0 25
##
## Overall Statistics
##
## Accuracy : 0.8908
## 95% CI : (0.8204, 0.9405)
## No Information Rate : 0.2185
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8633
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity 0.9167 0.6957 0.8462 1.0000 1.0000
## Specificity 0.9684 0.9479 0.9570 0.9898 1.0000
## Pos Pred Value 0.8800 0.7619 0.8462 0.9545 1.0000
## Neg Pred Value 0.9787 0.9286 0.9570 1.0000 1.0000
## Precision 0.8800 0.7619 0.8462 0.9545 1.0000
## Recall 0.9167 0.6957 0.8462 1.0000 1.0000
## F1 0.8980 0.7273 0.8462 0.9767 1.0000
## Prevalence 0.2017 0.1933 0.2185 0.1765 0.2101
## Detection Rate 0.1849 0.1345 0.1849 0.1765 0.2101
## Detection Prevalence 0.2101 0.1765 0.2185 0.1849 0.2101
## Balanced Accuracy 0.9425 0.8218 0.9016 0.9949 1.0000
As mentioned, we determined that RF5 was our optimal model. We use overall accuracy as the final determining factor of model optimality. The model had accuracy of 90.11% on the training data and 89.08% on the test data.
[1]World Population Review, “The 200 Largest Cities in the United States by Population 2020,” worldpopulationreview.com, 2021. https://worldpopulationreview.com/us-cities
[2]B. B. G. Staff, U. May 19, 2022, and 12:44 p m S. on F. S. on T. Comments, “Here’s where Boston ranks among the nation’s top places to live, according to US News & World Report - The Boston Globe,” BostonGlobe.com. https://www.bostonglobe.com/2022/05/19/lifestyle/heres-where-boston-ranks-among-top-25-places-live-according-us-news-world-report/#:~:text=US%20News%20%26%20World%20Report%20released (accessed Apr. 26, 2023).
[3]WBZ-News Staff, “Massachusetts once again ranked among ‘most moved from’ states,” www.cbsnews.com, Jan. 02, 2023. https://www.cbsnews.com/boston/news/massachusetts-most-moved-from-states-movers/#:~:text=BOSTON%20%2D%20An%20annual%20study%20from (accessed Apr. 26, 2023).
[4]“Cost of Living in Boston, MA | PayScale,” www.payscale.com. https://www.payscale.com/cost-of-living-calculator/Massachusetts-Boston
[5]C. DeVon, “Boston is now America’s second most expensive city for renters,” CNBC, Nov. 23, 2022. https://www.cnbc.com/2022/11/10/boston-beats-san-francisco-as-2nd-most-expensive-city-for-renters.html#:~:text=Move%20over%20San%20Francisco%20%E2%80%94%20Boston