Introduction
The MASS Package in R provides access to the Boston housing dataset, which comprises 506 observations across 14 variables. A comprehensive data dictionary detailing each variable is provided at the end of this report.
The primary objective of this analysis is to predict the median value of owner-occupied homes in the Boston area. To achieve this, a variety of machine learning models are implemented, using the remaining 13 variables as predictors.
The analysis will explore the following modeling approaches:
• Linear Regression
• Subset Regression - Forward, Backward, Stepwise Selection of
Variables
• Lasso Regression
• GLM - Generalized Linear Models
• CART (Classification and Regression Trees)
• GAM - Generalized Additive Models
• Neural Networks - Scaled Data, Unscaled Data, 2 Hidden Layers and 1
Hidden Layer
• Random Forests
• GBM (Gradient Boosted Machines)
Model performance will be evaluated using Root Mean Squared Error (RMSE), which measures the difference between predicted values and the actual values in the test dataset. Each model will be trained on the full training data and, where appropriate, with cross-validation. However, all performance comparisons will be based on the out-of-sample RMSE, as it provides the most reliable measure of predictive accuracy and ensures a fair comparison across models.
Data Dictionary
library(knitr)
library(kableExtra)
# Create the data dictionary
data_dict <- data.frame(
Variable = c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv"),
Class = c("double", "double", "double", "factor/binary", "double", "double", "double", "double", "integer", "integer", "double", "double", "double", "double"),
Description = c(
"per capita crime rate by town.",
"proportion of residential land zoned for lots over 25,000 sq.ft",
"proportion of non-retail business acres per town.",
"Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)",
"nitrogen oxides concentration (parts per 10 million)",
"average number of rooms per dwelling",
"proportion of owner-occupied units built prior to 1940",
"weighted mean of distances to five Boston employment centres",
"index of accessibility to radial highways",
"full-value property-tax rate per $10,000",
"pupil-teacher ratio by town",
"1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town",
"lower status of the population (percent)",
"median value of owner-occupied homes in $1000s"
)
)
# Render the table
kable(data_dict, caption = "Data Dictionary for Boston Housing Dataset") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Variable | Class | Description |
---|---|---|
crim | double | per capita crime rate by town. |
zn | double | proportion of residential land zoned for lots over 25,000 sq.ft |
indus | double | proportion of non-retail business acres per town. |
chas | factor/binary | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) |
nox | double | nitrogen oxides concentration (parts per 10 million) |
rm | double | average number of rooms per dwelling |
age | double | proportion of owner-occupied units built prior to 1940 |
dis | double | weighted mean of distances to five Boston employment centres |
rad | integer | index of accessibility to radial highways |
tax | integer | full-value property-tax rate per $10,000 |
ptratio | double | pupil-teacher ratio by town |
black | double | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town |
lstat | double | lower status of the population (percent) |
medv | double | median value of owner-occupied homes in $1000s |
Data Exploration
Mean | Original Split | 22.26 | Median Value |
Mean | New Split | 22.26 | Median Value |
Median | Original Split | 21.05 | Median Value |
Median | New Split | 21.05 | Median Value |
Standard Deviation | Original Split | 9.1 | Median Value |
Standard Deviation | New Split | 9.1 | Median Value |
The summary statistics for the target variable appear to be consistent across both data splits, indicating stability in the distribution of the output variable.
The above visualizations show the training and testing data splits. The samples appear to be representative of the overall dataset while also exhibiting some variation between splits. This balance suggests that the data is well-suited for building and evaluating predictive models.
The pairplot above illustrates the density, distribution, and correlation between all variables in the dataset.
While visually informative, the plot is quite dense and cluttered, making it difficult to extract meaningful insights. To address this, the visualizations below present variable-level distributions and correlations individually, offering a clearer and more interpretable view of the data.
## crim zn indus
## breaks numeric,10 numeric,11 numeric,15
## counts integer,9 integer,10 integer,14
## density numeric,9 numeric,10 numeric,14
## mids numeric,9 numeric,10 numeric,14
## xname "dots[[1L]][[1L]]" "dots[[1L]][[2L]]" "dots[[1L]][[3L]]"
## equidist TRUE TRUE TRUE
## nox rm age
## breaks numeric,12 numeric,11 numeric,11
## counts integer,11 integer,10 integer,10
## density numeric,11 numeric,10 numeric,10
## mids numeric,11 numeric,10 numeric,10
## xname "dots[[1L]][[4L]]" "dots[[1L]][[5L]]" "dots[[1L]][[6L]]"
## equidist TRUE TRUE TRUE
## dis rad tax
## breaks integer,11 numeric,13 integer,13
## counts integer,10 integer,12 integer,12
## density numeric,10 numeric,12 numeric,12
## mids numeric,10 numeric,12 numeric,12
## xname "dots[[1L]][[7L]]" "dots[[1L]][[8L]]" "dots[[1L]][[9L]]"
## equidist TRUE TRUE TRUE
## ptratio black lstat
## breaks integer,11 numeric,9 numeric,9
## counts integer,10 integer,8 integer,8
## density numeric,10 numeric,8 numeric,8
## mids numeric,10 numeric,8 numeric,8
## xname "dots[[1L]][[10L]]" "dots[[1L]][[11L]]" "dots[[1L]][[12L]]"
## equidist TRUE TRUE TRUE
The plots above display the distribution of each feature in the
training portion of the dataset, excluding the target variable
medv
. These variables serve as the independent predictors
that will be used to model and explain the median value of homes in the
Boston area.
The correlation plots above illustrate the relationship between each
of the 13 independent variables and the target variable,
medv
. Notably, there is strong evidence of a linear
relationship between medv
and both rm
(average
number of rooms per dwelling) and lstat
(percentage of
lower status population), indicating that these features are likely to
be influential predictors in the modeling process.
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
228 | 0.41238 | 0 | 6.20 | 0 | 0.5040 | 7.163 | 79.9 | 3.2157 | 8 | 307 | 17.4 | 372.08 | 6.36 | 31.6 |
405 | 41.52920 | 0 | 18.10 | 0 | 0.6930 | 5.531 | 85.4 | 1.6074 | 24 | 666 | 20.2 | 329.46 | 27.38 | 8.5 |
210 | 0.43571 | 0 | 10.59 | 1 | 0.4890 | 5.344 | 100.0 | 3.8750 | 4 | 277 | 18.6 | 396.90 | 23.09 | 20.0 |
333 | 0.03466 | 35 | 6.06 | 0 | 0.4379 | 6.031 | 23.3 | 6.6407 | 1 | 304 | 16.9 | 362.25 | 7.83 | 19.4 |
265 | 0.55007 | 20 | 3.97 | 0 | 0.6470 | 7.206 | 91.6 | 1.9301 | 5 | 264 | 13.0 | 387.89 | 8.10 | 36.5 |
439 | 13.67810 | 0 | 18.10 | 0 | 0.7400 | 5.935 | 87.9 | 1.8206 | 24 | 666 | 20.2 | 68.95 | 34.02 | 8.4 |
355 | 0.04301 | 80 | 1.91 | 0 | 0.4130 | 5.663 | 21.9 | 10.5857 | 4 | 334 | 22.0 | 382.80 | 8.05 | 18.2 |
459 | 7.75223 | 0 | 18.10 | 0 | 0.7130 | 6.301 | 83.7 | 2.7831 | 24 | 666 | 20.2 | 272.21 | 16.23 | 14.9 |
423 | 12.04820 | 0 | 18.10 | 0 | 0.6140 | 5.648 | 87.6 | 1.9512 | 24 | 666 | 20.2 | 291.55 | 14.10 | 20.8 |
325 | 0.34109 | 0 | 7.38 | 0 | 0.4930 | 6.415 | 40.1 | 4.7211 | 5 | 287 | 19.6 | 396.90 | 6.12 | 25.0 |
x | |
---|---|
crim | double |
zn | double |
indus | double |
chas | integer |
nox | double |
rm | double |
age | double |
dis | double |
rad | integer |
tax | double |
ptratio | double |
black | double |
lstat | double |
medv | double |
The table above outlines the structure of the dataset, indicating the data type assigned to each variable. For a more detailed explanation of each variable, please refer to the full data dictionary provided at the end of this document.
Linear Regression Models
When building regression models, including all available variables does not always yield the most accurate or generalizable results. A subset selection approach helps identify more parsimonious models—those that explain the data effectively while reducing complexity and the risk of overfitting.
To achieve this, I will implement several subset selection methods, including:
- AIC (Akaike Information Criterion)
- BIC (Bayesian Information Criterion)
- Forward Selection
- Backward Elimination
- Stepwise Regression (Both Directions)
- LASSO (Least Absolute Shrinkage and Selection Operator)
Information Criterion-Based Subset Selection
The plot above displays the variable subsets corresponding to each BIC value, as shown on the y-axis. In this context, a lower BIC value indicates a better model, balancing model fit with complexity. The top row, therefore, represents the optimal subset of variables according to the BIC criterion.
The best-performing model includes the intercept and the following predictors:
crim
zn
chas
nox
rm
dis
rad
tax
ptratio
black
lstat
## [1] 19.07689
By calculating the Mean Squared Error (MSE) of the model’s predictions on the training data (in-sample), we gain a preliminary understanding of how well the model fits the data it was trained on. While this provides useful insight, our primary interest lies in how the model performs on unseen data, which will be evaluated later using the Root Mean Squared Prediction Error (RMSPE).
For now, the model yields an in-sample MSE of 21.97 and an RMSE of 4.69, suggesting a reasonable fit within the training dataset.
## [1] 1.000 1785.449
## [1] 82.63866
## [1] 9.090581
The null model serves as a baseline by regressing a
constant (intercept-only model) against the medv
values,
effectively predicting the mean of the output variable for all
observations. This model is useful as a benchmark when
evaluating more complex models, particularly in the context of comparing
AIC and BIC values.
The null model yields an RMSE of 9.31, which is significantly higher than the Best BIC model’s RMSE of 4.69. This substantial reduction in error suggests that the selected BIC model provides a far more informative and accurate representation of the data compared to the null model.
## [1] 14.000 1203.101
## [1] 18.33231
## [1] 4.281624
Stepwise Regression Selection
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + age + dis +
## rad + tax + ptratio + black + lstat, data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3495 -2.5575 -0.5492 1.7267 30.2345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.705991 5.657861 5.074 6.04e-07 ***
## crim -0.116654 0.031627 -3.688 0.000258 ***
## zn 0.038114 0.014629 2.605 0.009526 **
## chas1 2.535404 0.869959 2.914 0.003769 **
## nox -16.722638 3.783516 -4.420 1.28e-05 ***
## rm 4.809080 0.481801 9.981 < 2e-16 ***
## age -0.021065 0.013984 -1.506 0.132772
## dis -1.374232 0.201271 -6.828 3.31e-11 ***
## rad 0.238543 0.070193 3.398 0.000747 ***
## tax -0.009923 0.003792 -2.617 0.009215 **
## ptratio -0.969673 0.136176 -7.121 5.18e-12 ***
## black 0.010516 0.002805 3.749 0.000205 ***
## lstat -0.386025 0.055591 -6.944 1.60e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.355 on 391 degrees of freedom
## Multiple R-squared: 0.7779, Adjusted R-squared: 0.771
## F-statistic: 114.1 on 12 and 391 DF, p-value: < 2.2e-16
## [1] 13.000 1201.646
## [1] 18.35707
Stepwise regression is a variable selection technique that iteratively evaluates the importance of each predictor, updating the model at each step based on statistical criteria such as AIC or BIC.
In this analysis, we employ backward selection, which begins with a full model containing all predictors. The algorithm then systematically removes the least significant variables one at a time, continuing this process until no further improvement can be made according to the selection criteria.
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + black + chas + nox +
## dis + crim + zn + rad + tax + age, data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3495 -2.5575 -0.5492 1.7267 30.2345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.705991 5.657861 5.074 6.04e-07 ***
## lstat -0.386025 0.055591 -6.944 1.60e-11 ***
## rm 4.809080 0.481801 9.981 < 2e-16 ***
## ptratio -0.969673 0.136176 -7.121 5.18e-12 ***
## black 0.010516 0.002805 3.749 0.000205 ***
## chas1 2.535404 0.869959 2.914 0.003769 **
## nox -16.722638 3.783516 -4.420 1.28e-05 ***
## dis -1.374232 0.201271 -6.828 3.31e-11 ***
## crim -0.116654 0.031627 -3.688 0.000258 ***
## zn 0.038114 0.014629 2.605 0.009526 **
## rad 0.238543 0.070193 3.398 0.000747 ***
## tax -0.009923 0.003792 -2.617 0.009215 **
## age -0.021065 0.013984 -1.506 0.132772
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.355 on 391 degrees of freedom
## Multiple R-squared: 0.7779, Adjusted R-squared: 0.771
## F-statistic: 114.1 on 12 and 391 DF, p-value: < 2.2e-16
## [1] 13.000 1201.646
## [1] 18.35707
The plot above represents a forward selection stepwise regression model, which starts with an intercept-only model and incrementally adds variables. At each step, the algorithm selects the variable that provides the greatest improvement to the model based on a selection criterion (such as AIC or BIC). This process continues until the addition of further variables no longer yields significant gains in model performance, resulting in what is considered an “optimal” model.
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + black + chas + nox +
## dis + crim + zn + rad + tax + age, data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3495 -2.5575 -0.5492 1.7267 30.2345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.705991 5.657861 5.074 6.04e-07 ***
## lstat -0.386025 0.055591 -6.944 1.60e-11 ***
## rm 4.809080 0.481801 9.981 < 2e-16 ***
## ptratio -0.969673 0.136176 -7.121 5.18e-12 ***
## black 0.010516 0.002805 3.749 0.000205 ***
## chas1 2.535404 0.869959 2.914 0.003769 **
## nox -16.722638 3.783516 -4.420 1.28e-05 ***
## dis -1.374232 0.201271 -6.828 3.31e-11 ***
## crim -0.116654 0.031627 -3.688 0.000258 ***
## zn 0.038114 0.014629 2.605 0.009526 **
## rad 0.238543 0.070193 3.398 0.000747 ***
## tax -0.009923 0.003792 -2.617 0.009215 **
## age -0.021065 0.013984 -1.506 0.132772
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.355 on 391 degrees of freedom
## Multiple R-squared: 0.7779, Adjusted R-squared: 0.771
## F-statistic: 114.1 on 12 and 391 DF, p-value: < 2.2e-16
## [1] 13.000 1201.646
## [1] 18.35707
The bidirectional (both directions) stepwise regression procedure begins with an intercept-only model and iteratively adds or removes variables based on their contribution to model performance. Unlike forward or backward selection alone, this approach allows the algorithm to both include new predictors and eliminate those deemed unnecessary at each step.
While this method can lead to more refined models, it typically involves longer computation times, especially when applied to larger datasets.
Best AIC Model
## [1] 14.000 1203.101
## [1] 1.000 1785.449
## [1] 13.000 1201.646
## [1] 13.000 1201.646
Both the forward selection and bidirectional stepwise selection procedures resulted in the same optimal model based on the AIC criterion, consisting of 12 predictors with an AIC value of 1262.36. For consistency and simplicity, the forward selection model will be used in comparisons with other algorithms moving forward.
LASSO Regression
LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization technique that promotes model simplicity by penalizing the absolute size of regression coefficients. This results in some coefficients being reduced to exactly zero, effectively performing variable selection. LASSO is especially useful for building sparse, interpretable models, particularly in “wide” datasets—where the number of predictors is large relative to the number of observations. It offers a cost-effective and scalable approach to prevent overfitting while identifying the most influential variables.
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 14.177440438
## crim -0.013381578
## zn .
## indus .
## chas 1.565138249
## nox -0.014613711
## rm 4.237566863
## age .
## dis -0.081500956
## rad .
## tax .
## ptratio -0.739162698
## black 0.005955512
## lstat -0.513806573
## [1] 0.009170489
## [1] 0.345263
## [1] 29.181
Out-of-Sample Testing
To evaluate predictive performance, the final models developed using
AIC, BIC, and LASSO were tested on the remaining 20% of the data that
was not used during model training. Predictions were generated using the
predict()
function in R, and model performance was assessed
using Root Mean Squared Error (RMSE) on the
out-of-sample (OOS) data. This provides a practical measure of how well
each model generalizes to unseen data.
While the AIC and BIC models slightly outperformed the LASSO model on the test set, the differences in RMSE were not substantial. It is important to note that cross-validation typically offers a more reliable estimate of model performance, and the actual predictive error may vary more significantly under repeated sampling.
- BIC Model RMSE: 5.06
- AIC Model RMSE: 5.02
- LASSO Model RMSE: 5.35
These results suggest that each model is capable of predicting the
median home value (medv
) within approximately
$5,000 of the actual value on unseen data.
Generalized Linear Models (GLMs)
Generalized Linear Models (GLMs) extend traditional linear regression to accommodate a wider variety of response distributions and link functions. While the term is often used interchangeably with linear regression, GLMs also encompass models such as logistic regression, Poisson regression, and others. These models are particularly useful when the response variable is not normally distributed or when modeling relationships involving categorical or count data.
Model Creation
## (Intercept) crim zn chas1 nox rm
## 36.33423870 -0.12871551 0.04328270 2.64440545 -19.61469218 4.36755117
## dis rad tax ptratio lstat
## -1.33459832 0.23798529 -0.01101571 -0.95977081 -0.45280032
The variables selected using the BIC criterion were
used to construct a GLM model, which was then evaluated
using the cv.glm()
function. This approach allows for an
assessment of the model’s predictive accuracy through cross-validation,
providing a more robust estimate of its generalization performance.
## (Intercept) lstat rm ptratio black
## 28.705991315 -0.386024915 4.809080407 -0.969673344 0.010515945
## chas1 nox dis crim zn
## 2.535404273 -16.722638170 -1.374232201 -0.116653646 0.038114383
## rad tax age
## 0.238543273 -0.009922873 -0.021065460
The variables and coefficients above were selected based on the
AIC criterion. These predictors were then used to fit a
GLM model, which was evaluated using the
cv.glm()
function to assess its predictive accuracy through
cross-validation.
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 16.573038153
## crim -0.024001671
## zn .
## indus .
## chas 1.994355944
## nox -4.485651802
## rm 4.272373008
## age .
## dis -0.390100184
## rad .
## tax .
## ptratio -0.801611503
## black 0.006695178
## lstat -0.518555953
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 33.009244896
## crim -0.104719398
## zn 0.044527873
## indus 0.007458193
## chas 2.698073139
## nox -17.125645162
## rm 3.829379107
## age .
## dis -1.454866978
## rad 0.285276396
## tax -0.011295472
## ptratio -0.942664365
## black 0.009211899
## lstat -0.523075319
The variables above were selected using the LASSO
method based on the optimal lambda value. These selected
predictors were then used to fit a GLM model, which was
evaluated using the cv.glm()
function to assess its
predictive accuracy through cross-validation.
Cross-Validated Performance
The BIC model produced a cross-validated prediction
error of 23.67, while the AIC model
yielded a higher error of 28.79. In contrast, the
LASSO model demonstrated the strongest performance,
with a cross-validated prediction error of 23.44 using
the lambda.min
criterion, and 24.81 using
the more conservative lambda.1se
criterion.
These results indicate that the LASSO model provides the best
predictive performance on cross-validated data among the three
approaches. This is not surprising, as LASSO selects the optimal
regularization parameter (lambda
) using cross-validation by
design. In comparison, the AIC and BIC models exhibited higher
prediction errors when evaluated using cross-validation, underscoring
LASSO’s advantage in balancing model complexity and predictive
power.
Regression Tree (CART)
CART (Classification and Regression Trees) are nonparametric models that use decision rules to predict either continuous (regression) or categorical (classification) outcomes. In regression settings, CART models segment the data into distinct regions based on predictor values, assigning predictions based on the mean value of the target variable within each terminal node (leaf). These interpretable rule-based structures are particularly useful for uncovering non-linear relationships and variable interactions that may not be captured in traditional linear models.
Creating the Tree
Based on the complexity parameter (CP) plot, the optimal tree depth is 5, and the tree has been pruned accordingly to avoid overfitting while maintaining interpretability.
This model is a regression tree, as it predicts a
continuous outcome (medv
). For example, the terminal node
on the far left contains 59 observations with an
average medv
value of 12.16. This implies
that any future observation meeting the following conditions:
rm
< 6.79
lstat
> 14.4
crim
> 5.78
will be assigned a predicted median home value of 12.16.
Pruning
Interpretation
The regression tree above illustrates the sequence of binary splits
applied to the data, designed to optimally partition the dataset based
on the specified complexity parameter (cp
). Each
split is made along a single variable axis, and each
terminal node represents the expected median home value
(medv
) for observations that follow that specific path
through the tree. Additionally, each node displays the number of
observations (support) that fall into that segment.
For instance, the leftmost terminal node contains 130 observations and predicts a median home value of 14.53. This prediction is based on the following conditions:
- The house has fewer than 6.941 rooms
(
rm
) on average
- The percentage of lower status population
(
lstat
) is greater than or equal to 14.78
All records meeting these criteria are assigned a predicted
medv
of 14.53.
Notably, the model relies heavily on
lstat
and rm
as primary splitting variables, indicating that these features hold the
most predictive value in determining housing prices within the
dataset.
Model Accuracy
In-Sample
The regression tree achieved an in-sample Mean Squared Error (MSE) of 21.5, demonstrating comparable predictive accuracy to the regression-based models and positioning it as a strong contender for modeling the median home value.
Out-of-Sample
On the test data, the regression tree achieved an out-of-sample MSE of 5.0, indicating predictive performance comparable to that of the previously developed linear models. This suggests that the tree-based approach generalizes well to unseen data while offering the added benefit of interpretability.
CART - Full Data
The Apparent Squared Error (ASE) of the fully-fitted regression tree on the training data is 4.19, representing the lowest in-sample error among all models evaluated thus far. However, this low error likely overstates the model’s true predictive capability, as it has not been validated using cross-validation or out-of-sample testing.
Given this, I would still favor the cross-validated LASSO model, which demonstrated strong and reliable generalization performance. Specifically, the LASSO model achieved a cross-validated error of 23.44, compared to an in-sample error of 29.18, indicating a more honest estimate of how it would perform on unseen data.
k-NN Analysis
k-Nearest Neighbors (k-NN) is a non-parametric
method that relies on measuring distances—typically Euclidean
distance—to determine the similarity between observations. For
classification tasks, an observation is assigned a class based on the
majority vote among its k closest neighbors.
For regression tasks, such as this one, predictions are made by
computing the average outcome (e.g., medv
)
of the k nearest data points. This method captures local
patterns in the data but can be sensitive to the choice of k
and the presence of irrelevant features.
The graphic above illustrates the core concept behind k-NN analysis. As the value of k increases, the radius—or the number of nearest neighbors considered—increases as well. This means that a larger group of nearby observations is used to determine the predicted value of the output. In regression, this translates to averaging the output values of the k closest data points, allowing the model to smooth out local variation and potentially improve generalization, depending on the structure of the data.
crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. :0.0000000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. : 5.00 | |
1st Qu.:0.0008524 | 1st Qu.:0.0000 | 1st Qu.:0.1734 | 1st Qu.:0.00000 | 1st Qu.:0.1253 | 1st Qu.:0.3815 | 1st Qu.:0.4307 | 1st Qu.:0.09732 | 1st Qu.:0.1304 | 1st Qu.:0.1702 | 1st Qu.:0.5106 | 1st Qu.:0.9456 | 1st Qu.:0.1501 | 1st Qu.:16.57 | |
Median :0.0028121 | Median :0.0000 | Median :0.3383 | Median :0.00000 | Median :0.3048 | Median :0.4492 | Median :0.7623 | Median :0.22020 | Median :0.1739 | Median :0.2715 | Median :0.6915 | Median :0.9872 | Median :0.2739 | Median :21.05 | |
Mean :0.0417906 | Mean :0.1082 | Mean :0.3911 | Mean :0.07178 | Mean :0.3444 | Mean :0.4695 | Mean :0.6710 | Mean :0.27569 | Mean :0.3652 | Mean :0.4124 | Mean :0.6230 | Mean :0.8980 | Mean :0.3062 | Mean :22.26 | |
3rd Qu.:0.0403351 | 3rd Qu.:0.1375 | 3rd Qu.:0.6466 | 3rd Qu.:0.00000 | 3rd Qu.:0.4843 | 3rd Qu.:0.5432 | 3rd Qu.:0.9363 | 3rd Qu.:0.41619 | 3rd Qu.:1.0000 | 3rd Qu.:0.9140 | 3rd Qu.:0.8085 | 3rd Qu.:0.9991 | 3rd Qu.:0.4289 | 3rd Qu.:25.00 | |
Max. :1.0000000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :50.00 |
pred | actual |
---|---|
25.90 | 21.6 |
19.18 | 16.5 |
20.88 | 15.0 |
21.62 | 21.7 |
19.12 | 18.2 |
17.64 | 21.0 |
21.42 | 18.9 |
23.94 | 26.6 |
23.94 | 25.3 |
21.86 | 19.7 |
33.44 | 35.4 |
26.76 | 24.7 |
22.80 | 19.4 |
22.28 | 20.9 |
22.48 | 24.1 |
22.46 | 20.8 |
24.96 | 21.4 |
20.78 | 27.5 |
14.20 | 18.6 |
19.64 | 20.4 |
The k-NN model achieved an RMSE of 4.98 and a Mean Squared Prediction Error (MSPE) of 24.77, indicating strong initial performance relative to previously evaluated models. These results suggest that the model is effective at capturing local structure in the data and offers competitive predictive accuracy.
k | MSPE.k | RMSE.k |
---|---|---|
1 | 16.69481 | 4.085929 |
2 | 14.20386 | 3.768801 |
3 | 18.07362 | 4.251308 |
4 | 18.58708 | 4.311274 |
5 | 19.31010 | 4.394326 |
6 | 19.96611 | 4.468346 |
7 | 21.60044 | 4.647627 |
8 | 22.46707 | 4.739944 |
9 | 22.99960 | 4.795790 |
10 | 23.42123 | 4.839549 |
11 | 24.04603 | 4.903676 |
12 | 23.38098 | 4.835389 |
13 | 23.73573 | 4.871933 |
14 | 23.81468 | 4.880029 |
15 | 23.57274 | 4.855177 |
16 | 23.74600 | 4.872987 |
17 | 23.99812 | 4.898787 |
18 | 24.84588 | 4.984564 |
19 | 25.64050 | 5.063644 |
20 | 26.91320 | 5.187793 |
21 | 27.21593 | 5.216889 |
22 | 27.46991 | 5.241174 |
23 | 27.86595 | 5.278821 |
24 | 28.23009 | 5.313200 |
25 | 28.14988 | 5.305646 |
26 | 28.72544 | 5.359612 |
27 | 29.23920 | 5.407328 |
28 | 29.71655 | 5.451289 |
29 | 30.07130 | 5.483730 |
30 | 30.33171 | 5.507423 |
## [1] 2
## [1] 29.53841
## [1] 5.434925
The k-NN approach is a relatively simple, low-effort method that has yielded one of the strongest predictive performances for estimating median home values in Boston on unseen data.
However, a key limitation of this method is its lack of interpretability. Unlike regression-based models, k-NN does not provide insight into the relationships between predictor variables and the outcome. Predictions are based solely on the proximity of observations—data points are either “close” or they are not—offering little transparency into the model’s decision-making process.
GAM (Generalized Additive Models)
Generalized Additive Models (GAMs) build upon the flexibility of Generalized Linear Models (GLMs) by allowing nonlinear relationships between the predictors and the response variable through the use of spline functions. This added flexibility enables GAMs to better capture complex, non-linear trends in the data, which may be missed by more rigid models.
In this analysis, we apply the s()
function to variables
where we believe a smoothed, nonlinear transformation
may improve model fit. Decisions about which smoothing terms to retain
will be guided by model performance and interpretability, as indicated
by output summaries and diagnostics.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## medv ~ s(crim) + s(zn) + s(indus) + chas + s(nox) + s(rm) + s(age) +
## s(dis) + rad + s(tax) + s(ptratio) + s(black) + s(lstat)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.3232 1.3346 13.729 < 2e-16 ***
## chas2 1.7014 0.6537 2.603 0.00963 **
## rad 0.4058 0.1413 2.872 0.00432 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(crim) 3.162 3.923 12.234 < 2e-16 ***
## s(zn) 1.000 1.000 0.896 0.34461
## s(indus) 6.301 7.290 3.124 0.00277 **
## s(nox) 8.413 8.860 8.441 < 2e-16 ***
## s(rm) 5.622 6.814 28.211 < 2e-16 ***
## s(age) 1.000 1.000 2.908 0.08900 .
## s(dis) 1.000 1.000 9.716 0.00198 **
## s(tax) 7.897 8.609 5.100 4.09e-06 ***
## s(ptratio) 1.000 1.000 30.796 < 2e-16 ***
## s(black) 5.511 6.611 2.967 0.00639 **
## s(lstat) 6.532 7.684 17.776 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.893 Deviance explained = 90.6%
## GCV = 10.097 Scale est. = 8.8361 n = 404
The model above reports an adjusted R-squared value of 0.892, indicating that it explains approximately 89.2% of the variance in the dataset. This suggests moderately strong model performance, with a high degree of explanatory power relative to the observed data.
## [1] 7.732989
The GAM model produced an out-of-sample RMSE of 13.30, which is notably higher than the RMSE values observed in our other modeling approaches. While GAMs offer powerful flexibility for capturing nonlinear relationships—especially in more complex datasets—this particular model does not generalize as effectively to unseen data within our current context.
Neural Networks
Neural networks, the backbone of modern deep learning, have become one of the most widely researched and applied techniques in advanced analytics. These models excel at capturing highly complex, nonlinear relationships in data that traditional models may overlook. By leveraging multiple layers and a high number of fitted parameters, neural networks can model intricate interactions across variables.
However, this increased complexity comes at a cost: neural networks are typically more computationally intensive and have a higher risk of overfitting, especially when applied to smaller datasets or without proper regularization.
Unscaled Data
We begin our neural network analysis using unscaled
data. A multilayer neural network is trained to predict the
target variable (medv
). However, without scaling the input
features, the model may assign inconsistent importance
to variables due to differences in their scales. This can distort the
optimization process and lead to biased or suboptimal
predictions, as neural networks are highly sensitive to feature
magnitude.
The neural network model above utilizes two hidden
layers to generate predictions for the medv
variable. However, because the input data was not
scaled, the model is likely to exhibit poor predictive
performance. Neural networks are highly sensitive to the scale
of input features, and unscaled data can lead to
disproportionate weighting, resulting in suboptimal
learning and biased outputs.
## [1] 118025.1
## [1] 343.5478
## [1] 22.7
The results above clearly highlight the consequences of using unscaled data in a neural network model. In a dataset where the average median home value is approximately 19.85, the unscaled neural network produced predictions with an error of 336.16—a value that is orders of magnitude larger than expected. This level of error underscores the importance of properly scaling input features, particularly when using models like neural networks that are sensitive to variable magnitude.
## [1] 176882.6
## [1] 22.51407
The in-sample performance of the unscaled neural network was also poor, further emphasizing the model’s sensitivity to input feature scales. To improve predictive accuracy, the analysis will be repeated using data that has been normalized to a 0–1 range for all variables. Scaling the data appropriately ensures that each feature contributes proportionally during model training, allowing the neural network to learn more effectively and generate more reliable predictions.
##Scaled Data
## [1] 8.964007
## [1] 2.993995
After scaling the input data, the neural network’s out-of-sample RMSE improved significantly to 3.02, reflecting a substantial increase in predictive accuracy. This performance aligns with expectations for a well-fit, complex model and highlights the critical importance of data preprocessing—particularly feature scaling—when working with neural networks.
## [1] 4.663368
## [1] 2.159483
## [1] 0.3890208
The in-sample performance of the scaled neural network model is also notably strong, achieving an RMSE of 2.11. This indicates that the model fits the training data well and reinforces its effectiveness following proper data scaling.
Random Forest Model
Random Forests are an ensemble learning method that build on bagged decision trees by introducing additional randomness during model construction. As Boehmke & Greenwell (2019) describe, they “build a large collection of de-correlated trees to further improve predictive performance.” The method leverages the “wisdom of crowds” by aggregating the predictions of many independent, deep (“bushy”) trees to generate more accurate and robust predictions.
In this analysis, I utilize the Ranger implementation of Random Forests—a highly efficient and fast-performing version that supports streamlined hyperparameter tuning, making it especially useful for large datasets and iterative model development.
Default Model
The untuned Random Forest model generated an RMSE of 3.38, demonstrating strong predictive performance relative to the other models evaluated thus far. Even without hyperparameter tuning, the model captures complex patterns in the data effectively, reinforcing the robustness of ensemble tree-based methods.
## [1] 22.977587 18.796120 21.288930 21.573543 19.636732 20.367953 21.596133
## [8] 29.244253 24.622220 20.812197 34.441777 24.596170 21.166243 21.033957
## [15] 23.911940 21.489123 24.337153 23.648090 19.662523 19.764933 19.208377
## [22] 21.145687 19.467789 17.440864 14.375451 14.384671 21.993369 21.851105
## [29] 20.807597 27.714293 37.388690 29.518947 39.766763 35.828937 25.423557
## [36] 45.097457 22.280227 20.344050 40.589180 43.080730 31.046770 42.709393
## [43] 22.989687 24.701990 21.932080 25.727617 29.314634 41.585886 32.902583
## [50] 31.850023 24.589743 23.966690 28.120190 23.148507 26.968877 25.239340
## [57] 27.657497 20.838242 20.924778 19.708233 19.967590 24.278716 21.468450
## [64] 23.605420 19.113420 24.721587 30.900947 21.742247 20.014430 20.013888
## [71] 34.331756 24.881532 19.484843 36.180634 36.774172 28.088006 14.396172
## [78] 12.752035 10.066063 10.632094 12.284164 10.621657 21.074127 14.843710
## [85] 9.493562 14.397020 18.118581 13.846884 15.807177 19.390654 15.641527
## [92] 19.521015 16.031511 18.769735 25.511509 27.859096 20.664337 20.679821
## [99] 13.701754 13.994832 18.659870 22.050758
Hyperparameter-Tuned Model
In the section below, I construct a grid of hyperparameter settings to optimize the performance of the Random Forest model, using Root Mean Squared Error (RMSE) as the evaluation metric.
By tuning across this hyperparameter grid, I can systematically evaluate a variety of model configurations and identify the combination that yields the lowest out-of-sample error, resulting in a more accurate and robust predictive model.
mtry | min.node.size | replace | sample.fraction | rmse | perc_gain |
---|---|---|---|---|---|
4 | 1 | FALSE | 0.80 | 3.227122 | 3.226980 |
5 | 1 | FALSE | 0.80 | 3.228947 | 3.172262 |
4 | 3 | FALSE | 0.80 | 3.235286 | 2.982162 |
5 | 3 | FALSE | 0.80 | 3.236607 | 2.942559 |
5 | 1 | FALSE | 0.63 | 3.267195 | 2.025285 |
0 | 1 | FALSE | 0.80 | 3.268149 | 1.996696 |
3 | 1 | FALSE | 0.80 | 3.268149 | 1.996696 |
5 | 5 | FALSE | 0.80 | 3.271540 | 1.894983 |
5 | 3 | FALSE | 0.63 | 3.272998 | 1.851281 |
4 | 5 | FALSE | 0.80 | 3.276693 | 1.740482 |
The grid-tuning process above generated multiple Random Forest models, each evaluated and sorted in ascending order of RMSE.
The best-performing model achieved an RMSE of 3.29, representing a 2.76% improvement over the default Random Forest model. Based on these results, I can now proceed to build the final optimized Random Forest model using the hyperparameter settings identified in the top-performing configuration.
Out of Sample (OOS) Performance
## [1] 3.796277
The final optimized Random Forest model achieved an out-of-sample RMSE of 3.10, making it the strongest performer among all models evaluated to this point.
Boosting
Gradient Boosted Machines (GBM) are an ensemble learning technique that builds sparse, sequential decision trees, contrasting with the deep, decorrelated trees used in Random Forests. GBM models iteratively improve by focusing on correcting the errors of previous trees, making them highly effective for predictive tasks.
To develop a well-performing GBM model, I will apply a hyperparameter tuning procedure similar to the one used for the Random Forest models, optimizing for RMSE to identify the best combination of learning rate, tree depth, and other key parameters.
Default Model
The default Gradient Boosted Machine (GBM) model produced an in-sample RMSE of 3.47, indicating solid baseline performance prior to any hyperparameter tuning.
Out of Sample Performance - Default Model
## [1] 631
Prior to any hyperparameter tuning, the default GBM model achieved an out-of-sample RMSE of 3.42, demonstrating strong predictive performance. Notably, this model already outperforms the two-layer neural network, which had an out-of-sample RMSE of 3.02, highlighting the GBM’s effectiveness even without optimization.
Grid Tuning
learning_rate | RMSE | trees | time | Time |
---|---|---|---|---|
0.010 | 3.304392 | 4945 | NA | 9.22 |
0.005 | 3.335294 | 4998 | NA | 14.43 |
0.050 | 3.347348 | 875 | NA | 10.16 |
0.100 | 3.360957 | 549 | NA | 8.18 |
0.300 | 3.506383 | 151 | NA | 9.06 |
After completing grid-based hyperparameter tuning, the table above presents the top-performing GBM models, ranked by in-sample RMSE. The best model achieved an RMSE of 3.17, which is comparable to the out-of-sample performance of the tuned Random Forest model. This suggests that the tuned GBM is a strong contender in terms of predictive accuracy and highlights the value of tuning in maximizing model performance.
OOS Performance
## learning_rate RMSE trees time Time
## 4 0.01 3.304392 4945 NA 9.22
By identifying the row in the tuning grid with the lowest RMSE, I can extract the corresponding hyperparameter settings for the optimal model and use them to retrain a final GBM model with those configurations applied.
After applying the optimized hyperparameters, the tuned GBM model achieved an out-of-sample RMSE of 3.24, making it one of the top-performing models developed in this analysis. Its strong predictive accuracy highlights the effectiveness of boosting combined with thoughtful tuning.
Comparing Results
Model | Metric | Value |
AIC Subset Selection | Predicted RMSE | 6.23877125808491 |
BIC Subset Selection | Predicted RMSE | 6.13694972761564 |
LASSO Regression | Predicted RMSE | 6.18643810458273 |
CART | Predicted RMSE | 6.31 |
k - Nearest Neighbors | Predicted RMSE | 6.18343454280047 |
Random Forests - Untuned | Predicted RMSE | 4.1591967414895 |
Random Forests - Tuned | Predicted RMSE | 3.79627709993045 |
Gradient Boosted Machines - Default | Predicted RMSE | 3.87322758990156 |
Gradient Boosted Machines - Tuned | Predicted RMSE | 3.82082927825121 |
Neural Netwoks - 2 Hidden Layers | Predicted RMSE | 2.9939951852128 |
Neural Networks - 1 Hidden Layer | Predicted RMSE | 17.8763462353268 |