Housing data for 506 census tracts of Boston from the 1970 census. The dataframe BostonHousing contains the original data by Harrison and Rubinfeld (1979), the dataframe BostonHousing2 the corrected version with additional spatial information (see references below).
The original data are 506 observations on 14 variables, medv being the target variable:
crim percapita crime rate by town
zn proportion of residential land zoned for lots over 25,000 sq.ft
indus proportion of non-retail business acres per town
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration (parts per 10 million)
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per USD 10,000
ptratio pupil-teacher ratio by town
b 1000(B - 0.63)^2 where B is the proportion of blacks by town
lstat percentage of lower status of the population
medv median value of owner-occupied homes in USD 1000’s
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(mlbench)
Boston Housing Data - BostonHousing
data("BostonHousing")
glimpse(BostonHousing)
## Rows: 506
## Columns: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ b <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
ifelse(mean(complete.cases(BostonHousing))==1, "data complete" , "data needs to clean")
## [1] "data complete"
set.seed(55)
id <- createDataPartition(y = BostonHousing$medv, p = 0.8, list = FALSE)
train_data <- BostonHousing[id, ]
test_data <- BostonHousing[-id, ]
nrow(train_data)
## [1] 407
nrow(test_data)
## [1] 99
Linear Regression
K-nearest neighbors(knn)
set.seed(55)
control <- trainControl(method = "repeatedcv",
repeats = 5,
number = 5,
verboseIter = TRUE)
lm_model <- train(medv ~. ,
data = train_data,
method = "lm",
trControl = control)
## + Fold1.Rep1: intercept=TRUE
## - Fold1.Rep1: intercept=TRUE
## + Fold2.Rep1: intercept=TRUE
## - Fold2.Rep1: intercept=TRUE
## + Fold3.Rep1: intercept=TRUE
## - Fold3.Rep1: intercept=TRUE
## + Fold4.Rep1: intercept=TRUE
## - Fold4.Rep1: intercept=TRUE
## + Fold5.Rep1: intercept=TRUE
## - Fold5.Rep1: intercept=TRUE
## + Fold1.Rep2: intercept=TRUE
## - Fold1.Rep2: intercept=TRUE
## + Fold2.Rep2: intercept=TRUE
## - Fold2.Rep2: intercept=TRUE
## + Fold3.Rep2: intercept=TRUE
## - Fold3.Rep2: intercept=TRUE
## + Fold4.Rep2: intercept=TRUE
## - Fold4.Rep2: intercept=TRUE
## + Fold5.Rep2: intercept=TRUE
## - Fold5.Rep2: intercept=TRUE
## + Fold1.Rep3: intercept=TRUE
## - Fold1.Rep3: intercept=TRUE
## + Fold2.Rep3: intercept=TRUE
## - Fold2.Rep3: intercept=TRUE
## + Fold3.Rep3: intercept=TRUE
## - Fold3.Rep3: intercept=TRUE
## + Fold4.Rep3: intercept=TRUE
## - Fold4.Rep3: intercept=TRUE
## + Fold5.Rep3: intercept=TRUE
## - Fold5.Rep3: intercept=TRUE
## + Fold1.Rep4: intercept=TRUE
## - Fold1.Rep4: intercept=TRUE
## + Fold2.Rep4: intercept=TRUE
## - Fold2.Rep4: intercept=TRUE
## + Fold3.Rep4: intercept=TRUE
## - Fold3.Rep4: intercept=TRUE
## + Fold4.Rep4: intercept=TRUE
## - Fold4.Rep4: intercept=TRUE
## + Fold5.Rep4: intercept=TRUE
## - Fold5.Rep4: intercept=TRUE
## + Fold1.Rep5: intercept=TRUE
## - Fold1.Rep5: intercept=TRUE
## + Fold2.Rep5: intercept=TRUE
## - Fold2.Rep5: intercept=TRUE
## + Fold3.Rep5: intercept=TRUE
## - Fold3.Rep5: intercept=TRUE
## + Fold4.Rep5: intercept=TRUE
## - Fold4.Rep5: intercept=TRUE
## + Fold5.Rep5: intercept=TRUE
## - Fold5.Rep5: intercept=TRUE
## Aggregating results
## Fitting final model on full training set
set.seed(55)
control <- trainControl(method = "repeatedcv",
repeats = 5,
number = 5,
verboseIter = TRUE)
knn_model <- train(medv ~. ,
data = train_data,
method = "knn",
trControl = control)
## + Fold1.Rep1: k=5
## - Fold1.Rep1: k=5
## + Fold1.Rep1: k=7
## - Fold1.Rep1: k=7
## + Fold1.Rep1: k=9
## - Fold1.Rep1: k=9
## + Fold2.Rep1: k=5
## - Fold2.Rep1: k=5
## + Fold2.Rep1: k=7
## - Fold2.Rep1: k=7
## + Fold2.Rep1: k=9
## - Fold2.Rep1: k=9
## + Fold3.Rep1: k=5
## - Fold3.Rep1: k=5
## + Fold3.Rep1: k=7
## - Fold3.Rep1: k=7
## + Fold3.Rep1: k=9
## - Fold3.Rep1: k=9
## + Fold4.Rep1: k=5
## - Fold4.Rep1: k=5
## + Fold4.Rep1: k=7
## - Fold4.Rep1: k=7
## + Fold4.Rep1: k=9
## - Fold4.Rep1: k=9
## + Fold5.Rep1: k=5
## - Fold5.Rep1: k=5
## + Fold5.Rep1: k=7
## - Fold5.Rep1: k=7
## + Fold5.Rep1: k=9
## - Fold5.Rep1: k=9
## + Fold1.Rep2: k=5
## - Fold1.Rep2: k=5
## + Fold1.Rep2: k=7
## - Fold1.Rep2: k=7
## + Fold1.Rep2: k=9
## - Fold1.Rep2: k=9
## + Fold2.Rep2: k=5
## - Fold2.Rep2: k=5
## + Fold2.Rep2: k=7
## - Fold2.Rep2: k=7
## + Fold2.Rep2: k=9
## - Fold2.Rep2: k=9
## + Fold3.Rep2: k=5
## - Fold3.Rep2: k=5
## + Fold3.Rep2: k=7
## - Fold3.Rep2: k=7
## + Fold3.Rep2: k=9
## - Fold3.Rep2: k=9
## + Fold4.Rep2: k=5
## - Fold4.Rep2: k=5
## + Fold4.Rep2: k=7
## - Fold4.Rep2: k=7
## + Fold4.Rep2: k=9
## - Fold4.Rep2: k=9
## + Fold5.Rep2: k=5
## - Fold5.Rep2: k=5
## + Fold5.Rep2: k=7
## - Fold5.Rep2: k=7
## + Fold5.Rep2: k=9
## - Fold5.Rep2: k=9
## + Fold1.Rep3: k=5
## - Fold1.Rep3: k=5
## + Fold1.Rep3: k=7
## - Fold1.Rep3: k=7
## + Fold1.Rep3: k=9
## - Fold1.Rep3: k=9
## + Fold2.Rep3: k=5
## - Fold2.Rep3: k=5
## + Fold2.Rep3: k=7
## - Fold2.Rep3: k=7
## + Fold2.Rep3: k=9
## - Fold2.Rep3: k=9
## + Fold3.Rep3: k=5
## - Fold3.Rep3: k=5
## + Fold3.Rep3: k=7
## - Fold3.Rep3: k=7
## + Fold3.Rep3: k=9
## - Fold3.Rep3: k=9
## + Fold4.Rep3: k=5
## - Fold4.Rep3: k=5
## + Fold4.Rep3: k=7
## - Fold4.Rep3: k=7
## + Fold4.Rep3: k=9
## - Fold4.Rep3: k=9
## + Fold5.Rep3: k=5
## - Fold5.Rep3: k=5
## + Fold5.Rep3: k=7
## - Fold5.Rep3: k=7
## + Fold5.Rep3: k=9
## - Fold5.Rep3: k=9
## + Fold1.Rep4: k=5
## - Fold1.Rep4: k=5
## + Fold1.Rep4: k=7
## - Fold1.Rep4: k=7
## + Fold1.Rep4: k=9
## - Fold1.Rep4: k=9
## + Fold2.Rep4: k=5
## - Fold2.Rep4: k=5
## + Fold2.Rep4: k=7
## - Fold2.Rep4: k=7
## + Fold2.Rep4: k=9
## - Fold2.Rep4: k=9
## + Fold3.Rep4: k=5
## - Fold3.Rep4: k=5
## + Fold3.Rep4: k=7
## - Fold3.Rep4: k=7
## + Fold3.Rep4: k=9
## - Fold3.Rep4: k=9
## + Fold4.Rep4: k=5
## - Fold4.Rep4: k=5
## + Fold4.Rep4: k=7
## - Fold4.Rep4: k=7
## + Fold4.Rep4: k=9
## - Fold4.Rep4: k=9
## + Fold5.Rep4: k=5
## - Fold5.Rep4: k=5
## + Fold5.Rep4: k=7
## - Fold5.Rep4: k=7
## + Fold5.Rep4: k=9
## - Fold5.Rep4: k=9
## + Fold1.Rep5: k=5
## - Fold1.Rep5: k=5
## + Fold1.Rep5: k=7
## - Fold1.Rep5: k=7
## + Fold1.Rep5: k=9
## - Fold1.Rep5: k=9
## + Fold2.Rep5: k=5
## - Fold2.Rep5: k=5
## + Fold2.Rep5: k=7
## - Fold2.Rep5: k=7
## + Fold2.Rep5: k=9
## - Fold2.Rep5: k=9
## + Fold3.Rep5: k=5
## - Fold3.Rep5: k=5
## + Fold3.Rep5: k=7
## - Fold3.Rep5: k=7
## + Fold3.Rep5: k=9
## - Fold3.Rep5: k=9
## + Fold4.Rep5: k=5
## - Fold4.Rep5: k=5
## + Fold4.Rep5: k=7
## - Fold4.Rep5: k=7
## + Fold4.Rep5: k=9
## - Fold4.Rep5: k=9
## + Fold5.Rep5: k=5
## - Fold5.Rep5: k=5
## + Fold5.Rep5: k=7
## - Fold5.Rep5: k=7
## + Fold5.Rep5: k=9
## - Fold5.Rep5: k=9
## Aggregating results
## Selecting tuning parameters
## Fitting k = 5 on full training set
set.seed(55)
control <- trainControl(method = "repeatedcv",
repeats = 5,
number = 5,
verboseIter = TRUE)
rf_model <- train(medv ~. ,
data = train_data,
method = "rf",
trControl = control)
## + Fold1.Rep1: mtry= 2
## - Fold1.Rep1: mtry= 2
## + Fold1.Rep1: mtry= 7
## - Fold1.Rep1: mtry= 7
## + Fold1.Rep1: mtry=13
## - Fold1.Rep1: mtry=13
## + Fold2.Rep1: mtry= 2
## - Fold2.Rep1: mtry= 2
## + Fold2.Rep1: mtry= 7
## - Fold2.Rep1: mtry= 7
## + Fold2.Rep1: mtry=13
## - Fold2.Rep1: mtry=13
## + Fold3.Rep1: mtry= 2
## - Fold3.Rep1: mtry= 2
## + Fold3.Rep1: mtry= 7
## - Fold3.Rep1: mtry= 7
## + Fold3.Rep1: mtry=13
## - Fold3.Rep1: mtry=13
## + Fold4.Rep1: mtry= 2
## - Fold4.Rep1: mtry= 2
## + Fold4.Rep1: mtry= 7
## - Fold4.Rep1: mtry= 7
## + Fold4.Rep1: mtry=13
## - Fold4.Rep1: mtry=13
## + Fold5.Rep1: mtry= 2
## - Fold5.Rep1: mtry= 2
## + Fold5.Rep1: mtry= 7
## - Fold5.Rep1: mtry= 7
## + Fold5.Rep1: mtry=13
## - Fold5.Rep1: mtry=13
## + Fold1.Rep2: mtry= 2
## - Fold1.Rep2: mtry= 2
## + Fold1.Rep2: mtry= 7
## - Fold1.Rep2: mtry= 7
## + Fold1.Rep2: mtry=13
## - Fold1.Rep2: mtry=13
## + Fold2.Rep2: mtry= 2
## - Fold2.Rep2: mtry= 2
## + Fold2.Rep2: mtry= 7
## - Fold2.Rep2: mtry= 7
## + Fold2.Rep2: mtry=13
## - Fold2.Rep2: mtry=13
## + Fold3.Rep2: mtry= 2
## - Fold3.Rep2: mtry= 2
## + Fold3.Rep2: mtry= 7
## - Fold3.Rep2: mtry= 7
## + Fold3.Rep2: mtry=13
## - Fold3.Rep2: mtry=13
## + Fold4.Rep2: mtry= 2
## - Fold4.Rep2: mtry= 2
## + Fold4.Rep2: mtry= 7
## - Fold4.Rep2: mtry= 7
## + Fold4.Rep2: mtry=13
## - Fold4.Rep2: mtry=13
## + Fold5.Rep2: mtry= 2
## - Fold5.Rep2: mtry= 2
## + Fold5.Rep2: mtry= 7
## - Fold5.Rep2: mtry= 7
## + Fold5.Rep2: mtry=13
## - Fold5.Rep2: mtry=13
## + Fold1.Rep3: mtry= 2
## - Fold1.Rep3: mtry= 2
## + Fold1.Rep3: mtry= 7
## - Fold1.Rep3: mtry= 7
## + Fold1.Rep3: mtry=13
## - Fold1.Rep3: mtry=13
## + Fold2.Rep3: mtry= 2
## - Fold2.Rep3: mtry= 2
## + Fold2.Rep3: mtry= 7
## - Fold2.Rep3: mtry= 7
## + Fold2.Rep3: mtry=13
## - Fold2.Rep3: mtry=13
## + Fold3.Rep3: mtry= 2
## - Fold3.Rep3: mtry= 2
## + Fold3.Rep3: mtry= 7
## - Fold3.Rep3: mtry= 7
## + Fold3.Rep3: mtry=13
## - Fold3.Rep3: mtry=13
## + Fold4.Rep3: mtry= 2
## - Fold4.Rep3: mtry= 2
## + Fold4.Rep3: mtry= 7
## - Fold4.Rep3: mtry= 7
## + Fold4.Rep3: mtry=13
## - Fold4.Rep3: mtry=13
## + Fold5.Rep3: mtry= 2
## - Fold5.Rep3: mtry= 2
## + Fold5.Rep3: mtry= 7
## - Fold5.Rep3: mtry= 7
## + Fold5.Rep3: mtry=13
## - Fold5.Rep3: mtry=13
## + Fold1.Rep4: mtry= 2
## - Fold1.Rep4: mtry= 2
## + Fold1.Rep4: mtry= 7
## - Fold1.Rep4: mtry= 7
## + Fold1.Rep4: mtry=13
## - Fold1.Rep4: mtry=13
## + Fold2.Rep4: mtry= 2
## - Fold2.Rep4: mtry= 2
## + Fold2.Rep4: mtry= 7
## - Fold2.Rep4: mtry= 7
## + Fold2.Rep4: mtry=13
## - Fold2.Rep4: mtry=13
## + Fold3.Rep4: mtry= 2
## - Fold3.Rep4: mtry= 2
## + Fold3.Rep4: mtry= 7
## - Fold3.Rep4: mtry= 7
## + Fold3.Rep4: mtry=13
## - Fold3.Rep4: mtry=13
## + Fold4.Rep4: mtry= 2
## - Fold4.Rep4: mtry= 2
## + Fold4.Rep4: mtry= 7
## - Fold4.Rep4: mtry= 7
## + Fold4.Rep4: mtry=13
## - Fold4.Rep4: mtry=13
## + Fold5.Rep4: mtry= 2
## - Fold5.Rep4: mtry= 2
## + Fold5.Rep4: mtry= 7
## - Fold5.Rep4: mtry= 7
## + Fold5.Rep4: mtry=13
## - Fold5.Rep4: mtry=13
## + Fold1.Rep5: mtry= 2
## - Fold1.Rep5: mtry= 2
## + Fold1.Rep5: mtry= 7
## - Fold1.Rep5: mtry= 7
## + Fold1.Rep5: mtry=13
## - Fold1.Rep5: mtry=13
## + Fold2.Rep5: mtry= 2
## - Fold2.Rep5: mtry= 2
## + Fold2.Rep5: mtry= 7
## - Fold2.Rep5: mtry= 7
## + Fold2.Rep5: mtry=13
## - Fold2.Rep5: mtry=13
## + Fold3.Rep5: mtry= 2
## - Fold3.Rep5: mtry= 2
## + Fold3.Rep5: mtry= 7
## - Fold3.Rep5: mtry= 7
## + Fold3.Rep5: mtry=13
## - Fold3.Rep5: mtry=13
## + Fold4.Rep5: mtry= 2
## - Fold4.Rep5: mtry= 2
## + Fold4.Rep5: mtry= 7
## - Fold4.Rep5: mtry= 7
## + Fold4.Rep5: mtry=13
## - Fold4.Rep5: mtry=13
## + Fold5.Rep5: mtry= 2
## - Fold5.Rep5: mtry= 2
## + Fold5.Rep5: mtry= 7
## - Fold5.Rep5: mtry= 7
## + Fold5.Rep5: mtry=13
## - Fold5.Rep5: mtry=13
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 7 on full training set
rmse_models <- data.frame(model = c("lm", "knn", "rf"),
RMSE = rep(0, times = 3))
lm <- min(lm_model$results$RMSE)
knn <- min(knn_model$results$RMSE)
rf <- min(rf_model$results$RMSE)
rmse_models$RMSE <- c(lm, knn, rf)
rmse_models
## model RMSE
## 1 lm 4.728522
## 2 knn 6.471621
## 3 rf 3.139809
rf_model
## Random Forest
##
## 407 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 326, 326, 325, 326, 325, 325, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 3.474057 0.8759803 2.351505
## 7 3.139809 0.8848565 2.142029
## 13 3.288765 0.8704464 2.220037
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 7.
p_medv <- predict(rf_model, newdata = test_data)
p_medv
## 8 10 12 16 17 19 23 25
## 17.86384 18.43599 20.59960 20.07259 21.05760 18.57372 16.69547 16.88583
## 28 31 33 40 41 43 44 46
## 15.31268 14.78848 15.21248 28.40669 35.04628 24.59973 24.36465 19.94655
## 49 54 76 79 84 85 91 102
## 18.06048 20.97901 22.79911 21.05017 23.64261 22.67881 22.84964 25.39752
## 104 105 106 113 125 126 130 135
## 20.07225 20.24468 18.21924 19.36375 18.16320 19.70678 16.10115 15.39009
## 138 141 145 151 154 155 158 171
## 18.77348 15.54499 14.92889 19.43213 16.06445 16.66335 32.18671 20.50567
## 174 181 184 185 187 189 201 217
## 23.16568 37.88763 29.90304 23.62329 39.75825 28.31180 34.56370 21.68226
## 226 238 242 252 254 255 258 262
## 40.52779 33.06231 21.17975 28.50388 39.77825 22.85665 44.22772 40.93354
## 272 280 283 290 303 307 322 325
## 25.07377 31.33112 44.88021 23.53066 23.46055 34.58801 23.55078 23.65016
## 334 341 346 348 353 354 359 367
## 22.74199 20.02289 20.12858 24.35466 21.14097 30.02515 20.99503 18.38193
## 368 373 376 382 392 394 396 398
## 21.61542 31.40763 25.35163 11.79328 15.60912 14.95341 13.94374 12.60148
## 402 406 407 411 413 428 430 440
## 11.21567 9.57179 16.52086 26.26092 13.44742 15.41916 11.76756 12.35799
## 442 462 467 472 473 476 486 495
## 13.79481 19.68410 14.98616 20.48169 20.15240 15.09049 21.91237 20.80043
## 496 500 501
## 19.64760 19.12090 19.93675
test_rmse <- sqrt(mean((test_data$medv - p_medv)**2))
test_rmse
## [1] 3.846461
When comparing with three models:
Linear Regression
K-Nearest Neighbors(knn)
Random Forest
The Random Forest model is the high efficient model for predicting Boston Housing Data with the lowest RMSE. The RMSE of training and testing data is 3.139809, 3.846461 respectively.