Introduction

The MASS Package in R provides access to the Boston housing dataset, which comprises 506 observations across 14 variables. A comprehensive data dictionary detailing each variable is provided at the end of this report.

The primary objective of this analysis is to predict the median value of owner-occupied homes in the Boston area. To achieve this, a variety of machine learning models are implemented, using the remaining 13 variables as predictors.

The analysis will explore the following modeling approaches:

• Linear Regression
• Subset Regression - Forward, Backward, Stepwise Selection of Variables
• Lasso Regression
• GLM - Generalized Linear Models
• CART (Classification and Regression Trees)
• GAM - Generalized Additive Models
• Neural Networks - Scaled Data, Unscaled Data, 2 Hidden Layers and 1 Hidden Layer
• Random Forests
• GBM (Gradient Boosted Machines)

Model performance will be evaluated using Root Mean Squared Error (RMSE), which measures the difference between predicted values and the actual values in the test dataset. Each model will be trained on the full training data and, where appropriate, with cross-validation. However, all performance comparisons will be based on the out-of-sample RMSE, as it provides the most reliable measure of predictive accuracy and ensures a fair comparison across models.

Data Dictionary

library(knitr)
library(kableExtra)

# Create the data dictionary
data_dict <- data.frame(
  Variable = c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv"),
  Class = c("double", "double", "double", "factor/binary", "double", "double", "double", "double", "integer", "integer", "double", "double", "double", "double"),
  Description = c(
    "per capita crime rate by town.",
    "proportion of residential land zoned for lots over 25,000 sq.ft",
    "proportion of non-retail business acres per town.",
    "Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)",
    "nitrogen oxides concentration (parts per 10 million)",
    "average number of rooms per dwelling",
    "proportion of owner-occupied units built prior to 1940",
    "weighted mean of distances to five Boston employment centres",
    "index of accessibility to radial highways",
    "full-value property-tax rate per $10,000",
    "pupil-teacher ratio by town",
    "1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town",
    "lower status of the population (percent)",
    "median value of owner-occupied homes in $1000s"
  )
)

# Render the table
kable(data_dict, caption = "Data Dictionary for Boston Housing Dataset") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
Data Dictionary for Boston Housing Dataset
Variable Class Description
crim double per capita crime rate by town.
zn double proportion of residential land zoned for lots over 25,000 sq.ft
indus double proportion of non-retail business acres per town.
chas factor/binary Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox double nitrogen oxides concentration (parts per 10 million)
rm double average number of rooms per dwelling
age double proportion of owner-occupied units built prior to 1940
dis double weighted mean of distances to five Boston employment centres
rad integer index of accessibility to radial highways
tax integer full-value property-tax rate per $10,000
ptratio double pupil-teacher ratio by town
black double 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat double lower status of the population (percent)
medv double median value of owner-occupied homes in $1000s

Data Exploration

Comparison Table: Summary Stats Across Training Splits
Mean Original Split 22.26 Median Value
Mean New Split 22.26 Median Value
Median Original Split 21.05 Median Value
Median New Split 21.05 Median Value
Standard Deviation Original Split 9.1 Median Value
Standard Deviation New Split 9.1 Median Value

The summary statistics for the target variable appear to be consistent across both data splits, indicating stability in the distribution of the output variable.

The above visualizations show the training and testing data splits. The samples appear to be representative of the overall dataset while also exhibiting some variation between splits. This balance suggests that the data is well-suited for building and evaluating predictive models.

The pairplot above illustrates the density, distribution, and correlation between all variables in the dataset.

While visually informative, the plot is quite dense and cluttered, making it difficult to extract meaningful insights. To address this, the visualizations below present variable-level distributions and correlations individually, offering a clearer and more interpretable view of the data.

##          crim               zn                 indus             
## breaks   numeric,10         numeric,11         numeric,15        
## counts   integer,9          integer,10         integer,14        
## density  numeric,9          numeric,10         numeric,14        
## mids     numeric,9          numeric,10         numeric,14        
## xname    "dots[[1L]][[1L]]" "dots[[1L]][[2L]]" "dots[[1L]][[3L]]"
## equidist TRUE               TRUE               TRUE              
##          nox                rm                 age               
## breaks   numeric,12         numeric,11         numeric,11        
## counts   integer,11         integer,10         integer,10        
## density  numeric,11         numeric,10         numeric,10        
## mids     numeric,11         numeric,10         numeric,10        
## xname    "dots[[1L]][[4L]]" "dots[[1L]][[5L]]" "dots[[1L]][[6L]]"
## equidist TRUE               TRUE               TRUE              
##          dis                rad                tax               
## breaks   integer,11         numeric,13         integer,13        
## counts   integer,10         integer,12         integer,12        
## density  numeric,10         numeric,12         numeric,12        
## mids     numeric,10         numeric,12         numeric,12        
## xname    "dots[[1L]][[7L]]" "dots[[1L]][[8L]]" "dots[[1L]][[9L]]"
## equidist TRUE               TRUE               TRUE              
##          ptratio             black               lstat              
## breaks   integer,11          numeric,9           numeric,9          
## counts   integer,10          integer,8           integer,8          
## density  numeric,10          numeric,8           numeric,8          
## mids     numeric,10          numeric,8           numeric,8          
## xname    "dots[[1L]][[10L]]" "dots[[1L]][[11L]]" "dots[[1L]][[12L]]"
## equidist TRUE                TRUE                TRUE

The plots above display the distribution of each feature in the training portion of the dataset, excluding the target variable medv. These variables serve as the independent predictors that will be used to model and explain the median value of homes in the Boston area.

The correlation plots above illustrate the relationship between each of the 13 independent variables and the target variable, medv. Notably, there is strong evidence of a linear relationship between medv and both rm (average number of rooms per dwelling) and lstat (percentage of lower status population), indicating that these features are likely to be influential predictors in the modeling process.

First 10 Rows of Boston Housing Data
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
228 0.41238 0 6.20 0 0.5040 7.163 79.9 3.2157 8 307 17.4 372.08 6.36 31.6
405 41.52920 0 18.10 0 0.6930 5.531 85.4 1.6074 24 666 20.2 329.46 27.38 8.5
210 0.43571 0 10.59 1 0.4890 5.344 100.0 3.8750 4 277 18.6 396.90 23.09 20.0
333 0.03466 35 6.06 0 0.4379 6.031 23.3 6.6407 1 304 16.9 362.25 7.83 19.4
265 0.55007 20 3.97 0 0.6470 7.206 91.6 1.9301 5 264 13.0 387.89 8.10 36.5
439 13.67810 0 18.10 0 0.7400 5.935 87.9 1.8206 24 666 20.2 68.95 34.02 8.4
355 0.04301 80 1.91 0 0.4130 5.663 21.9 10.5857 4 334 22.0 382.80 8.05 18.2
459 7.75223 0 18.10 0 0.7130 6.301 83.7 2.7831 24 666 20.2 272.21 16.23 14.9
423 12.04820 0 18.10 0 0.6140 5.648 87.6 1.9512 24 666 20.2 291.55 14.10 20.8
325 0.34109 0 7.38 0 0.4930 6.415 40.1 4.7211 5 287 19.6 396.90 6.12 25.0
Summary Table for Boston Housing Data
x
crim double
zn double
indus double
chas integer
nox double
rm double
age double
dis double
rad integer
tax double
ptratio double
black double
lstat double
medv double

The table above outlines the structure of the dataset, indicating the data type assigned to each variable. For a more detailed explanation of each variable, please refer to the full data dictionary provided at the end of this document.

Linear Regression Models

When building regression models, including all available variables does not always yield the most accurate or generalizable results. A subset selection approach helps identify more parsimonious models—those that explain the data effectively while reducing complexity and the risk of overfitting.

To achieve this, I will implement several subset selection methods, including:

  • AIC (Akaike Information Criterion)
  • BIC (Bayesian Information Criterion)
  • Forward Selection
  • Backward Elimination
  • Stepwise Regression (Both Directions)
  • LASSO (Least Absolute Shrinkage and Selection Operator)

Information Criterion-Based Subset Selection

The plot above displays the variable subsets corresponding to each BIC value, as shown on the y-axis. In this context, a lower BIC value indicates a better model, balancing model fit with complexity. The top row, therefore, represents the optimal subset of variables according to the BIC criterion.

The best-performing model includes the intercept and the following predictors:

  • crim
  • zn
  • chas
  • nox
  • rm
  • dis
  • rad
  • tax
  • ptratio
  • black
  • lstat
## [1] 19.07689

By calculating the Mean Squared Error (MSE) of the model’s predictions on the training data (in-sample), we gain a preliminary understanding of how well the model fits the data it was trained on. While this provides useful insight, our primary interest lies in how the model performs on unseen data, which will be evaluated later using the Root Mean Squared Prediction Error (RMSPE).

For now, the model yields an in-sample MSE of 21.97 and an RMSE of 4.69, suggesting a reasonable fit within the training dataset.

## [1]    1.000 1785.449
## [1] 82.63866
## [1] 9.090581

The null model serves as a baseline by regressing a constant (intercept-only model) against the medv values, effectively predicting the mean of the output variable for all observations. This model is useful as a benchmark when evaluating more complex models, particularly in the context of comparing AIC and BIC values.

The null model yields an RMSE of 9.31, which is significantly higher than the Best BIC model’s RMSE of 4.69. This substantial reduction in error suggests that the selected BIC model provides a far more informative and accurate representation of the data compared to the null model.

## [1]   14.000 1203.101
## [1] 18.33231
## [1] 4.281624

Stepwise Regression Selection

## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + age + dis + 
##     rad + tax + ptratio + black + lstat, data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3495  -2.5575  -0.5492   1.7267  30.2345 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.705991   5.657861   5.074 6.04e-07 ***
## crim         -0.116654   0.031627  -3.688 0.000258 ***
## zn            0.038114   0.014629   2.605 0.009526 ** 
## chas1         2.535404   0.869959   2.914 0.003769 ** 
## nox         -16.722638   3.783516  -4.420 1.28e-05 ***
## rm            4.809080   0.481801   9.981  < 2e-16 ***
## age          -0.021065   0.013984  -1.506 0.132772    
## dis          -1.374232   0.201271  -6.828 3.31e-11 ***
## rad           0.238543   0.070193   3.398 0.000747 ***
## tax          -0.009923   0.003792  -2.617 0.009215 ** 
## ptratio      -0.969673   0.136176  -7.121 5.18e-12 ***
## black         0.010516   0.002805   3.749 0.000205 ***
## lstat        -0.386025   0.055591  -6.944 1.60e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.355 on 391 degrees of freedom
## Multiple R-squared:  0.7779, Adjusted R-squared:  0.771 
## F-statistic: 114.1 on 12 and 391 DF,  p-value: < 2.2e-16
## [1]   13.000 1201.646
## [1] 18.35707

Stepwise regression is a variable selection technique that iteratively evaluates the importance of each predictor, updating the model at each step based on statistical criteria such as AIC or BIC.

In this analysis, we employ backward selection, which begins with a full model containing all predictors. The algorithm then systematically removes the least significant variables one at a time, continuing this process until no further improvement can be made according to the selection criteria.

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + black + chas + nox + 
##     dis + crim + zn + rad + tax + age, data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3495  -2.5575  -0.5492   1.7267  30.2345 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.705991   5.657861   5.074 6.04e-07 ***
## lstat        -0.386025   0.055591  -6.944 1.60e-11 ***
## rm            4.809080   0.481801   9.981  < 2e-16 ***
## ptratio      -0.969673   0.136176  -7.121 5.18e-12 ***
## black         0.010516   0.002805   3.749 0.000205 ***
## chas1         2.535404   0.869959   2.914 0.003769 ** 
## nox         -16.722638   3.783516  -4.420 1.28e-05 ***
## dis          -1.374232   0.201271  -6.828 3.31e-11 ***
## crim         -0.116654   0.031627  -3.688 0.000258 ***
## zn            0.038114   0.014629   2.605 0.009526 ** 
## rad           0.238543   0.070193   3.398 0.000747 ***
## tax          -0.009923   0.003792  -2.617 0.009215 ** 
## age          -0.021065   0.013984  -1.506 0.132772    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.355 on 391 degrees of freedom
## Multiple R-squared:  0.7779, Adjusted R-squared:  0.771 
## F-statistic: 114.1 on 12 and 391 DF,  p-value: < 2.2e-16
## [1]   13.000 1201.646
## [1] 18.35707

The plot above represents a forward selection stepwise regression model, which starts with an intercept-only model and incrementally adds variables. At each step, the algorithm selects the variable that provides the greatest improvement to the model based on a selection criterion (such as AIC or BIC). This process continues until the addition of further variables no longer yields significant gains in model performance, resulting in what is considered an “optimal” model.

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + black + chas + nox + 
##     dis + crim + zn + rad + tax + age, data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3495  -2.5575  -0.5492   1.7267  30.2345 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  28.705991   5.657861   5.074 6.04e-07 ***
## lstat        -0.386025   0.055591  -6.944 1.60e-11 ***
## rm            4.809080   0.481801   9.981  < 2e-16 ***
## ptratio      -0.969673   0.136176  -7.121 5.18e-12 ***
## black         0.010516   0.002805   3.749 0.000205 ***
## chas1         2.535404   0.869959   2.914 0.003769 ** 
## nox         -16.722638   3.783516  -4.420 1.28e-05 ***
## dis          -1.374232   0.201271  -6.828 3.31e-11 ***
## crim         -0.116654   0.031627  -3.688 0.000258 ***
## zn            0.038114   0.014629   2.605 0.009526 ** 
## rad           0.238543   0.070193   3.398 0.000747 ***
## tax          -0.009923   0.003792  -2.617 0.009215 ** 
## age          -0.021065   0.013984  -1.506 0.132772    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.355 on 391 degrees of freedom
## Multiple R-squared:  0.7779, Adjusted R-squared:  0.771 
## F-statistic: 114.1 on 12 and 391 DF,  p-value: < 2.2e-16
## [1]   13.000 1201.646
## [1] 18.35707

The bidirectional (both directions) stepwise regression procedure begins with an intercept-only model and iteratively adds or removes variables based on their contribution to model performance. Unlike forward or backward selection alone, this approach allows the algorithm to both include new predictors and eliminate those deemed unnecessary at each step.

While this method can lead to more refined models, it typically involves longer computation times, especially when applied to larger datasets.

Best AIC Model

## [1]   14.000 1203.101
## [1]    1.000 1785.449
## [1]   13.000 1201.646
## [1]   13.000 1201.646

Both the forward selection and bidirectional stepwise selection procedures resulted in the same optimal model based on the AIC criterion, consisting of 12 predictors with an AIC value of 1262.36. For consistency and simplicity, the forward selection model will be used in comparisons with other algorithms moving forward.

LASSO Regression

LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization technique that promotes model simplicity by penalizing the absolute size of regression coefficients. This results in some coefficients being reduced to exactly zero, effectively performing variable selection. LASSO is especially useful for building sparse, interpretable models, particularly in “wide” datasets—where the number of predictors is large relative to the number of observations. It offers a cost-effective and scalable approach to prevent overfitting while identifying the most influential variables.

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) 14.177440438
## crim        -0.013381578
## zn           .          
## indus        .          
## chas         1.565138249
## nox         -0.014613711
## rm           4.237566863
## age          .          
## dis         -0.081500956
## rad          .          
## tax          .          
## ptratio     -0.739162698
## black        0.005955512
## lstat       -0.513806573

## [1] 0.009170489
## [1] 0.345263
## [1] 29.181

Out-of-Sample Testing

To evaluate predictive performance, the final models developed using AIC, BIC, and LASSO were tested on the remaining 20% of the data that was not used during model training. Predictions were generated using the predict() function in R, and model performance was assessed using Root Mean Squared Error (RMSE) on the out-of-sample (OOS) data. This provides a practical measure of how well each model generalizes to unseen data.

While the AIC and BIC models slightly outperformed the LASSO model on the test set, the differences in RMSE were not substantial. It is important to note that cross-validation typically offers a more reliable estimate of model performance, and the actual predictive error may vary more significantly under repeated sampling.

  • BIC Model RMSE: 5.06
  • AIC Model RMSE: 5.02
  • LASSO Model RMSE: 5.35

These results suggest that each model is capable of predicting the median home value (medv) within approximately $5,000 of the actual value on unseen data.

Generalized Linear Models (GLMs)

Generalized Linear Models (GLMs) extend traditional linear regression to accommodate a wider variety of response distributions and link functions. While the term is often used interchangeably with linear regression, GLMs also encompass models such as logistic regression, Poisson regression, and others. These models are particularly useful when the response variable is not normally distributed or when modeling relationships involving categorical or count data.

Model Creation

##  (Intercept)         crim           zn        chas1          nox           rm 
##  36.33423870  -0.12871551   0.04328270   2.64440545 -19.61469218   4.36755117 
##          dis          rad          tax      ptratio        lstat 
##  -1.33459832   0.23798529  -0.01101571  -0.95977081  -0.45280032

The variables selected using the BIC criterion were used to construct a GLM model, which was then evaluated using the cv.glm() function. This approach allows for an assessment of the model’s predictive accuracy through cross-validation, providing a more robust estimate of its generalization performance.

##   (Intercept)         lstat            rm       ptratio         black 
##  28.705991315  -0.386024915   4.809080407  -0.969673344   0.010515945 
##         chas1           nox           dis          crim            zn 
##   2.535404273 -16.722638170  -1.374232201  -0.116653646   0.038114383 
##           rad           tax           age 
##   0.238543273  -0.009922873  -0.021065460

The variables and coefficients above were selected based on the AIC criterion. These predictors were then used to fit a GLM model, which was evaluated using the cv.glm() function to assess its predictive accuracy through cross-validation.

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s1
## (Intercept) 16.573038153
## crim        -0.024001671
## zn           .          
## indus        .          
## chas         1.994355944
## nox         -4.485651802
## rm           4.272373008
## age          .          
## dis         -0.390100184
## rad          .          
## tax          .          
## ptratio     -0.801611503
## black        0.006695178
## lstat       -0.518555953
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        s1
## (Intercept)  33.009244896
## crim         -0.104719398
## zn            0.044527873
## indus         0.007458193
## chas          2.698073139
## nox         -17.125645162
## rm            3.829379107
## age           .          
## dis          -1.454866978
## rad           0.285276396
## tax          -0.011295472
## ptratio      -0.942664365
## black         0.009211899
## lstat        -0.523075319

The variables above were selected using the LASSO method based on the optimal lambda value. These selected predictors were then used to fit a GLM model, which was evaluated using the cv.glm() function to assess its predictive accuracy through cross-validation.

Cross-Validated Performance

The BIC model produced a cross-validated prediction error of 23.67, while the AIC model yielded a higher error of 28.79. In contrast, the LASSO model demonstrated the strongest performance, with a cross-validated prediction error of 23.44 using the lambda.min criterion, and 24.81 using the more conservative lambda.1se criterion.

These results indicate that the LASSO model provides the best predictive performance on cross-validated data among the three approaches. This is not surprising, as LASSO selects the optimal regularization parameter (lambda) using cross-validation by design. In comparison, the AIC and BIC models exhibited higher prediction errors when evaluated using cross-validation, underscoring LASSO’s advantage in balancing model complexity and predictive power.

Regression Tree (CART)

CART (Classification and Regression Trees) are nonparametric models that use decision rules to predict either continuous (regression) or categorical (classification) outcomes. In regression settings, CART models segment the data into distinct regions based on predictor values, assigning predictions based on the mean value of the target variable within each terminal node (leaf). These interpretable rule-based structures are particularly useful for uncovering non-linear relationships and variable interactions that may not be captured in traditional linear models.

Creating the Tree

Based on the complexity parameter (CP) plot, the optimal tree depth is 5, and the tree has been pruned accordingly to avoid overfitting while maintaining interpretability.

This model is a regression tree, as it predicts a continuous outcome (medv). For example, the terminal node on the far left contains 59 observations with an average medv value of 12.16. This implies that any future observation meeting the following conditions:

  • rm < 6.79
  • lstat > 14.4
  • crim > 5.78

will be assigned a predicted median home value of 12.16.

Pruning

Interpretation

The regression tree above illustrates the sequence of binary splits applied to the data, designed to optimally partition the dataset based on the specified complexity parameter (cp). Each split is made along a single variable axis, and each terminal node represents the expected median home value (medv) for observations that follow that specific path through the tree. Additionally, each node displays the number of observations (support) that fall into that segment.

For instance, the leftmost terminal node contains 130 observations and predicts a median home value of 14.53. This prediction is based on the following conditions:

  • The house has fewer than 6.941 rooms (rm) on average
  • The percentage of lower status population (lstat) is greater than or equal to 14.78

All records meeting these criteria are assigned a predicted medv of 14.53.

Notably, the model relies heavily on lstat and rm as primary splitting variables, indicating that these features hold the most predictive value in determining housing prices within the dataset.

Model Accuracy

In-Sample

The regression tree achieved an in-sample Mean Squared Error (MSE) of 21.5, demonstrating comparable predictive accuracy to the regression-based models and positioning it as a strong contender for modeling the median home value.

Out-of-Sample

On the test data, the regression tree achieved an out-of-sample MSE of 5.0, indicating predictive performance comparable to that of the previously developed linear models. This suggests that the tree-based approach generalizes well to unseen data while offering the added benefit of interpretability.

CART - Full Data

The Apparent Squared Error (ASE) of the fully-fitted regression tree on the training data is 4.19, representing the lowest in-sample error among all models evaluated thus far. However, this low error likely overstates the model’s true predictive capability, as it has not been validated using cross-validation or out-of-sample testing.

Given this, I would still favor the cross-validated LASSO model, which demonstrated strong and reliable generalization performance. Specifically, the LASSO model achieved a cross-validated error of 23.44, compared to an in-sample error of 29.18, indicating a more honest estimate of how it would perform on unseen data.

k-NN Analysis

k-Nearest Neighbors (k-NN) is a non-parametric method that relies on measuring distances—typically Euclidean distance—to determine the similarity between observations. For classification tasks, an observation is assigned a class based on the majority vote among its k closest neighbors. For regression tasks, such as this one, predictions are made by computing the average outcome (e.g., medv) of the k nearest data points. This method captures local patterns in the data but can be sensitive to the choice of k and the presence of irrelevant features.

The graphic above illustrates the core concept behind k-NN analysis. As the value of k increases, the radius—or the number of nearest neighbors considered—increases as well. This means that a larger group of nearby observations is used to determine the predicted value of the output. In regression, this translates to averaging the output values of the k closest data points, allowing the model to smooth out local variation and potentially improve generalization, depending on the structure of the data.

Summary Stats Across Normalized Training Split
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min. :0.0000000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 5.00
1st Qu.:0.0008524 1st Qu.:0.0000 1st Qu.:0.1734 1st Qu.:0.00000 1st Qu.:0.1253 1st Qu.:0.3815 1st Qu.:0.4307 1st Qu.:0.09732 1st Qu.:0.1304 1st Qu.:0.1702 1st Qu.:0.5106 1st Qu.:0.9456 1st Qu.:0.1501 1st Qu.:16.57
Median :0.0028121 Median :0.0000 Median :0.3383 Median :0.00000 Median :0.3048 Median :0.4492 Median :0.7623 Median :0.22020 Median :0.1739 Median :0.2715 Median :0.6915 Median :0.9872 Median :0.2739 Median :21.05
Mean :0.0417906 Mean :0.1082 Mean :0.3911 Mean :0.07178 Mean :0.3444 Mean :0.4695 Mean :0.6710 Mean :0.27569 Mean :0.3652 Mean :0.4124 Mean :0.6230 Mean :0.8980 Mean :0.3062 Mean :22.26
3rd Qu.:0.0403351 3rd Qu.:0.1375 3rd Qu.:0.6466 3rd Qu.:0.00000 3rd Qu.:0.4843 3rd Qu.:0.5432 3rd Qu.:0.9363 3rd Qu.:0.41619 3rd Qu.:1.0000 3rd Qu.:0.9140 3rd Qu.:0.8085 3rd Qu.:0.9991 3rd Qu.:0.4289 3rd Qu.:25.00
Max. :1.0000000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :50.00
Head of Predictions vs. Actual
pred actual
25.90 21.6
19.18 16.5
20.88 15.0
21.62 21.7
19.12 18.2
17.64 21.0
21.42 18.9
23.94 26.6
23.94 25.3
21.86 19.7
33.44 35.4
26.76 24.7
22.80 19.4
22.28 20.9
22.48 24.1
22.46 20.8
24.96 21.4
20.78 27.5
14.20 18.6
19.64 20.4

The k-NN model achieved an RMSE of 4.98 and a Mean Squared Prediction Error (MSPE) of 24.77, indicating strong initial performance relative to previously evaluated models. These results suggest that the model is effective at capturing local structure in the data and offers competitive predictive accuracy.

Summary Stats Across Normalized Training Split
k MSPE.k RMSE.k
1 16.69481 4.085929
2 14.20386 3.768801
3 18.07362 4.251308
4 18.58708 4.311274
5 19.31010 4.394326
6 19.96611 4.468346
7 21.60044 4.647627
8 22.46707 4.739944
9 22.99960 4.795790
10 23.42123 4.839549
11 24.04603 4.903676
12 23.38098 4.835389
13 23.73573 4.871933
14 23.81468 4.880029
15 23.57274 4.855177
16 23.74600 4.872987
17 23.99812 4.898787
18 24.84588 4.984564
19 25.64050 5.063644
20 26.91320 5.187793
21 27.21593 5.216889
22 27.46991 5.241174
23 27.86595 5.278821
24 28.23009 5.313200
25 28.14988 5.305646
26 28.72544 5.359612
27 29.23920 5.407328
28 29.71655 5.451289
29 30.07130 5.483730
30 30.33171 5.507423
## [1] 2
## [1] 29.53841
## [1] 5.434925

The k-NN approach is a relatively simple, low-effort method that has yielded one of the strongest predictive performances for estimating median home values in Boston on unseen data.

However, a key limitation of this method is its lack of interpretability. Unlike regression-based models, k-NN does not provide insight into the relationships between predictor variables and the outcome. Predictions are based solely on the proximity of observations—data points are either “close” or they are not—offering little transparency into the model’s decision-making process.

GAM (Generalized Additive Models)

Generalized Additive Models (GAMs) build upon the flexibility of Generalized Linear Models (GLMs) by allowing nonlinear relationships between the predictors and the response variable through the use of spline functions. This added flexibility enables GAMs to better capture complex, non-linear trends in the data, which may be missed by more rigid models.

In this analysis, we apply the s() function to variables where we believe a smoothed, nonlinear transformation may improve model fit. Decisions about which smoothing terms to retain will be guided by model performance and interpretability, as indicated by output summaries and diagnostics.

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## medv ~ s(crim) + s(zn) + s(indus) + chas + s(nox) + s(rm) + s(age) + 
##     s(dis) + rad + s(tax) + s(ptratio) + s(black) + s(lstat)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.3232     1.3346  13.729  < 2e-16 ***
## chas2         1.7014     0.6537   2.603  0.00963 ** 
## rad           0.4058     0.1413   2.872  0.00432 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df      F  p-value    
## s(crim)    3.162  3.923 12.234  < 2e-16 ***
## s(zn)      1.000  1.000  0.896  0.34461    
## s(indus)   6.301  7.290  3.124  0.00277 ** 
## s(nox)     8.413  8.860  8.441  < 2e-16 ***
## s(rm)      5.622  6.814 28.211  < 2e-16 ***
## s(age)     1.000  1.000  2.908  0.08900 .  
## s(dis)     1.000  1.000  9.716  0.00198 ** 
## s(tax)     7.897  8.609  5.100 4.09e-06 ***
## s(ptratio) 1.000  1.000 30.796  < 2e-16 ***
## s(black)   5.511  6.611  2.967  0.00639 ** 
## s(lstat)   6.532  7.684 17.776  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.893   Deviance explained = 90.6%
## GCV = 10.097  Scale est. = 8.8361    n = 404

The model above reports an adjusted R-squared value of 0.892, indicating that it explains approximately 89.2% of the variance in the dataset. This suggests moderately strong model performance, with a high degree of explanatory power relative to the observed data.

## [1] 7.732989

The GAM model produced an out-of-sample RMSE of 13.30, which is notably higher than the RMSE values observed in our other modeling approaches. While GAMs offer powerful flexibility for capturing nonlinear relationships—especially in more complex datasets—this particular model does not generalize as effectively to unseen data within our current context.

Neural Networks

Neural networks, the backbone of modern deep learning, have become one of the most widely researched and applied techniques in advanced analytics. These models excel at capturing highly complex, nonlinear relationships in data that traditional models may overlook. By leveraging multiple layers and a high number of fitted parameters, neural networks can model intricate interactions across variables.

However, this increased complexity comes at a cost: neural networks are typically more computationally intensive and have a higher risk of overfitting, especially when applied to smaller datasets or without proper regularization.

Unscaled Data

We begin our neural network analysis using unscaled data. A multilayer neural network is trained to predict the target variable (medv). However, without scaling the input features, the model may assign inconsistent importance to variables due to differences in their scales. This can distort the optimization process and lead to biased or suboptimal predictions, as neural networks are highly sensitive to feature magnitude.

The neural network model above utilizes two hidden layers to generate predictions for the medv variable. However, because the input data was not scaled, the model is likely to exhibit poor predictive performance. Neural networks are highly sensitive to the scale of input features, and unscaled data can lead to disproportionate weighting, resulting in suboptimal learning and biased outputs.

## [1] 118025.1
## [1] 343.5478
## [1] 22.7

The results above clearly highlight the consequences of using unscaled data in a neural network model. In a dataset where the average median home value is approximately 19.85, the unscaled neural network produced predictions with an error of 336.16—a value that is orders of magnitude larger than expected. This level of error underscores the importance of properly scaling input features, particularly when using models like neural networks that are sensitive to variable magnitude.

## [1] 176882.6
## [1] 22.51407

The in-sample performance of the unscaled neural network was also poor, further emphasizing the model’s sensitivity to input feature scales. To improve predictive accuracy, the analysis will be repeated using data that has been normalized to a 0–1 range for all variables. Scaling the data appropriately ensures that each feature contributes proportionally during model training, allowing the neural network to learn more effectively and generate more reliable predictions.

##Scaled Data

## [1] 8.964007
## [1] 2.993995

After scaling the input data, the neural network’s out-of-sample RMSE improved significantly to 3.02, reflecting a substantial increase in predictive accuracy. This performance aligns with expectations for a well-fit, complex model and highlights the critical importance of data preprocessing—particularly feature scaling—when working with neural networks.

## [1] 4.663368
## [1] 2.159483
## [1] 0.3890208

The in-sample performance of the scaled neural network model is also notably strong, achieving an RMSE of 2.11. This indicates that the model fits the training data well and reinforces its effectiveness following proper data scaling.

Single Hidden Layer Model

Neural networks can be further optimized by adjusting the number of hidden layers and the number of nodes within each layer, both of which can significantly impact model performance. These architectural changes allow for greater flexibility and representational power, but they also introduce more parameters to tune, increasing model complexity.

Due to the high dimensionality of the parameter space, identifying the optimal neural network configuration can be challenging. This complexity has contributed to the rise in popularity of deep learning frameworks such as Keras and PyTorch, which offer advanced tools for model tuning, architecture customization, and efficient training workflows in modern machine learning research.

## [1] 17.87635
## [1] 3.516839
## [1] 0.3890208
## [1] 1.875324

A neural network with one hidden layer produced an in-sample RMSE of 1.95 and an out-of-sample RMSE of 7.58, compared to the two-hidden-layer model, which had a lower out-of-sample RMSE of 3.02. This highlights how network architecture can significantly influence performance. However, due to the stochastic nature of neural network training (e.g., random weight initialization and batch selection), performance may vary slightly across runs. Despite the higher RMSE in this instance, the single-layer network has often demonstrated stronger performance across similar data splits, suggesting it may be a more stable choice for this dataset.

Random Forest Model

Random Forests are an ensemble learning method that build on bagged decision trees by introducing additional randomness during model construction. As Boehmke & Greenwell (2019) describe, they “build a large collection of de-correlated trees to further improve predictive performance.” The method leverages the “wisdom of crowds” by aggregating the predictions of many independent, deep (“bushy”) trees to generate more accurate and robust predictions.

In this analysis, I utilize the Ranger implementation of Random Forests—a highly efficient and fast-performing version that supports streamlined hyperparameter tuning, making it especially useful for large datasets and iterative model development.

Default Model

The untuned Random Forest model generated an RMSE of 3.38, demonstrating strong predictive performance relative to the other models evaluated thus far. Even without hyperparameter tuning, the model captures complex patterns in the data effectively, reinforcing the robustness of ensemble tree-based methods.

##   [1] 22.977587 18.796120 21.288930 21.573543 19.636732 20.367953 21.596133
##   [8] 29.244253 24.622220 20.812197 34.441777 24.596170 21.166243 21.033957
##  [15] 23.911940 21.489123 24.337153 23.648090 19.662523 19.764933 19.208377
##  [22] 21.145687 19.467789 17.440864 14.375451 14.384671 21.993369 21.851105
##  [29] 20.807597 27.714293 37.388690 29.518947 39.766763 35.828937 25.423557
##  [36] 45.097457 22.280227 20.344050 40.589180 43.080730 31.046770 42.709393
##  [43] 22.989687 24.701990 21.932080 25.727617 29.314634 41.585886 32.902583
##  [50] 31.850023 24.589743 23.966690 28.120190 23.148507 26.968877 25.239340
##  [57] 27.657497 20.838242 20.924778 19.708233 19.967590 24.278716 21.468450
##  [64] 23.605420 19.113420 24.721587 30.900947 21.742247 20.014430 20.013888
##  [71] 34.331756 24.881532 19.484843 36.180634 36.774172 28.088006 14.396172
##  [78] 12.752035 10.066063 10.632094 12.284164 10.621657 21.074127 14.843710
##  [85]  9.493562 14.397020 18.118581 13.846884 15.807177 19.390654 15.641527
##  [92] 19.521015 16.031511 18.769735 25.511509 27.859096 20.664337 20.679821
##  [99] 13.701754 13.994832 18.659870 22.050758

Hyperparameter-Tuned Model

In the section below, I construct a grid of hyperparameter settings to optimize the performance of the Random Forest model, using Root Mean Squared Error (RMSE) as the evaluation metric.

By tuning across this hyperparameter grid, I can systematically evaluate a variety of model configurations and identify the combination that yields the lowest out-of-sample error, resulting in a more accurate and robust predictive model.

Best Random Forest Models by RMSE
mtry min.node.size replace sample.fraction rmse perc_gain
4 1 FALSE 0.80 3.227122 3.226980
5 1 FALSE 0.80 3.228947 3.172262
4 3 FALSE 0.80 3.235286 2.982162
5 3 FALSE 0.80 3.236607 2.942559
5 1 FALSE 0.63 3.267195 2.025285
0 1 FALSE 0.80 3.268149 1.996696
3 1 FALSE 0.80 3.268149 1.996696
5 5 FALSE 0.80 3.271540 1.894983
5 3 FALSE 0.63 3.272998 1.851281
4 5 FALSE 0.80 3.276693 1.740482

The grid-tuning process above generated multiple Random Forest models, each evaluated and sorted in ascending order of RMSE.

The best-performing model achieved an RMSE of 3.29, representing a 2.76% improvement over the default Random Forest model. Based on these results, I can now proceed to build the final optimized Random Forest model using the hyperparameter settings identified in the top-performing configuration.

Out of Sample (OOS) Performance

## [1] 3.796277

The final optimized Random Forest model achieved an out-of-sample RMSE of 3.10, making it the strongest performer among all models evaluated to this point.

Boosting

Gradient Boosted Machines (GBM) are an ensemble learning technique that builds sparse, sequential decision trees, contrasting with the deep, decorrelated trees used in Random Forests. GBM models iteratively improve by focusing on correcting the errors of previous trees, making them highly effective for predictive tasks.

To develop a well-performing GBM model, I will apply a hyperparameter tuning procedure similar to the one used for the Random Forest models, optimizing for RMSE to identify the best combination of learning rate, tree depth, and other key parameters.

Default Model

The default Gradient Boosted Machine (GBM) model produced an in-sample RMSE of 3.47, indicating solid baseline performance prior to any hyperparameter tuning.

Out of Sample Performance - Default Model

## [1] 631

Prior to any hyperparameter tuning, the default GBM model achieved an out-of-sample RMSE of 3.42, demonstrating strong predictive performance. Notably, this model already outperforms the two-layer neural network, which had an out-of-sample RMSE of 3.02, highlighting the GBM’s effectiveness even without optimization.

Grid Tuning

Best GBM Models by RMSE
learning_rate RMSE trees time Time
0.010 3.304392 4945 NA 9.22
0.005 3.335294 4998 NA 14.43
0.050 3.347348 875 NA 10.16
0.100 3.360957 549 NA 8.18
0.300 3.506383 151 NA 9.06

After completing grid-based hyperparameter tuning, the table above presents the top-performing GBM models, ranked by in-sample RMSE. The best model achieved an RMSE of 3.17, which is comparable to the out-of-sample performance of the tuned Random Forest model. This suggests that the tuned GBM is a strong contender in terms of predictive accuracy and highlights the value of tuning in maximizing model performance.

OOS Performance

##   learning_rate     RMSE trees time Time
## 4          0.01 3.304392  4945   NA 9.22

By identifying the row in the tuning grid with the lowest RMSE, I can extract the corresponding hyperparameter settings for the optimal model and use them to retrain a final GBM model with those configurations applied.

After applying the optimized hyperparameters, the tuned GBM model achieved an out-of-sample RMSE of 3.24, making it one of the top-performing models developed in this analysis. Its strong predictive accuracy highlights the effectiveness of boosting combined with thoughtful tuning.

Comparing Results

Predictive Performance Across Modeling Approaches - Boston Housing
Model Metric Value
AIC Subset Selection Predicted RMSE 6.23877125808491
BIC Subset Selection Predicted RMSE 6.13694972761564
LASSO Regression Predicted RMSE 6.18643810458273
CART Predicted RMSE 6.31
k - Nearest Neighbors Predicted RMSE 6.18343454280047
Random Forests - Untuned Predicted RMSE 4.1591967414895
Random Forests - Tuned Predicted RMSE 3.79627709993045
Gradient Boosted Machines - Default Predicted RMSE 3.87322758990156
Gradient Boosted Machines - Tuned Predicted RMSE 3.82082927825121
Neural Netwoks - 2 Hidden Layers Predicted RMSE 2.9939951852128
Neural Networks - 1 Hidden Layer Predicted RMSE 17.8763462353268