Boston Housing Case

Introduction

The Boston housing market data is a built-in R data in MASS package, and the data was originally presented in Harrison and Rubinfeld (1978). The data was collected from various sources including 1970 US Census, FBI, and Massachusetts Taxpayers Foundation.

In this project, my goal was to fit models using linear regression, regression tree, KNN, advanced decision tree, GAM, and neural networks to predict house price and compare their prediction performance.

Statistical analyses and visualization were performed using R 4.1.0.

Packages Required

Below is the list of packages used in the project.

  • Package magrittr is for running tidyverse functions with the pipe operator, %>%.
  • Package tidyverse is used for dplyr for data manipulation and ggplot2 package for creating plots all at once.
  • Package MASS is for uploading the Boston data.
  • Package leaps is for variable selection.
  • Package boot is for conducting cross validation.
  • Package rpart is for fitting regression tree.
  • Package rpart.plot is for displaying regression tree diagram
  • Package FNN is for conducting k-nearest neighbor.
  • Package knitr is for creating HTML tables.
  • Package kableExtra is for creating HTML tables.
  • Package pander is to produce simple tables from summary() output.
  • Package ipred is for bagging.
  • Package randomForest is for random forests.
  • Package gbm is for boosting.
  • Package mgcv is for generalized additive model.
  • Package neuralnet is for neural networks model.

Boston Housing Data

The Boston data is one of R built-in dataset, and it has been loaded from MASS package. There are 506 records with 14 variables in the data.

library(MASS)
data(Boston) 
dim(Boston)
## [1] 506  14

Snapshot

Below is a brief snapshot of the Boston data set.

Table 1: Boston Housing data (first 10 records)
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9
0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15 27.1
0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.93 16.5
0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.10 18.9

Data dictionary

The following is the table of variable description.

Table 2: Data Dictionary
Variable Type Description
crim numeric per capita crime rate by town
zn numeric proportion of residential land zoned for lots over 25,000 sq.ft
indus numeric proportion of non-retail business acres per town
chas integer Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
nox numeric nitrogen oxides concentration (parts per 10 million)
rm numeric average number of rooms per dwelling
age numeric proportion of owner-occupied units built prior to 1940
dis numeric weighted mean of distances to five Boston employment centres
rad integer index of accessibility to radial highways
tax numeric full-value property-tax rate per $10,000
ptratio numeric pupil-teacher ratio by town
black numeric 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
lstat numeric lower status of the population (percent)
medv numeric median value of owner-occupied homes in $1,000

Data Summary

A summary for all variables in the data is displayed in the table below. The response variable is medv, median value of homes in $1,000.

Table continues below
crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
Table continues below
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
Table continues below
rad tax ptratio black
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00

MLR/Regression Tree with Split #1

Linear Regression Model

After setting a seed, 80% of the original data was randomly chosen as the training data, and the remaining 20% as the testing Data.

seed1<-12462210
set.seed(seed1)
sample_index <- sample(nrow(Boston),nrow(Boston)*0.80)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]

1. Fit the full model

The full model was fitted, and most regression coefficients are significant except indus and age. About 74% of variation in the data can be explained by the full model.

model_full.train <- lm(medv~., data=Boston_train)
summary(model_full.train)

2. Variable Selection

Best subset using adjusted \(r^2\) was presented in the plot. The subset including 11 explanatory variables, excluding age and indus was suggested as one of best subsets with its highest adjusted \(r^2\) value of 0.74.

subset_result <- regsubsets(medv~.,data=Boston_train, nbest=2, nvmax = 14)
summary(subset_result)
plot(subset_result, scale="adjr2")

By conducting stepwise regression, the model including all variables except age and indus presented lowest AIC (1267.39) and lowest BIC (1315.41).

model_null.train <- lm(medv~1, data=Boston_train)
#stepwise selection using AIC
model_stepwise.aic <- step(model_null.train, scope=list(lower=model_null.train, upper=model_full.train), direction='both')

#stepwise selection using BIC
model_stepwise.bic <- step(model_null.train, scope=list(lower=model_null.train, upper=model_full.train),  direction='both', k=log(dim(Boston_train)[1]))

3. Final Model

Based on various variable selection methods, indus and age were excluded, and all the remaining 11 explanatory variables were included in the best model, and following is the summary of the best model.

Boston.train.lm<- lm(medv~.-indus-age, data=Boston_train)
summary(Boston.train.lm)
## 
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1139  -2.8071  -0.4659   1.6879  25.4858 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.710292   5.691837   6.977 1.29e-11 ***
## crim         -0.103718   0.038937  -2.664 0.008047 ** 
## zn            0.050198   0.015186   3.306 0.001035 ** 
## chas          2.551570   0.919963   2.774 0.005810 ** 
## nox         -20.666326   3.994345  -5.174 3.67e-07 ***
## rm            3.639440   0.443075   8.214 3.14e-15 ***
## dis          -1.679234   0.216152  -7.769 7.01e-14 ***
## rad           0.317015   0.068695   4.615 5.34e-06 ***
## tax          -0.011438   0.003671  -3.116 0.001967 ** 
## ptratio      -0.963296   0.144396  -6.671 8.67e-11 ***
## black         0.009944   0.002941   3.381 0.000796 ***
## lstat        -0.541142   0.053364 -10.140  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.73 on 392 degrees of freedom
## Multiple R-squared:  0.7445, Adjusted R-squared:  0.7374 
## F-statistic: 103.9 on 11 and 392 DF,  p-value: < 2.2e-16

4. Model Fit Evaluation

Mean Square Error (MSE) and Average Squared Error (ASE) for evaluating in-sample performance

As shown below, MSE of the model with the training data is 22.37, and

#MSE: Mean Square Error in Training data
SSE<-sum(Boston.train.lm$residuals^2)
n=length(Boston.train.lm$residuals)
k=length(Boston.train.lm$coefficients)-1 #exclude intercept
mse.train<-SSE/(n-k-1)
mse.train
## [1] 22.37307

ASE of the model with the training data is 21.71.

ase.train <-mean(Boston.train.lm$residuals^2)
ase.train
## [1] 21.70852

Mean Squared Prediction Error (MSPE) for evaluating out-of-sample performance

MSPE of the model with the testing data is 23.16, and Root Mean Squared Prediction Error (RMSPE) is 4.81.

#MSPE : Mean Squared Prediction Error in Testing data
yhat.test <- predict(object = Boston.train.lm, newdata = Boston_test)
mspe.lm <- mean((Boston_test$medv-yhat.test)^2)
mspe.lm
## [1] 23.15865
rmspe.lm <-sqrt(mspe.lm) #RMSPE
rmspe.lm 
## [1] 4.812344

Regression Tree

Regression tree was fitted with the training data.

# with training data  
#rpart
Boston_rpart <- rpart(formula = medv ~., data=Boston_train)
#prp(Boston_rpart, digits = 4, extra=1)

The default complexity parameter (cp) is 0.01, and plotcp() function was used to identify the most ideal cp value for best tree.

#plotcp
plotcp(Boston_rpart)

In the cp table, the leftmost value which lies below the horizontal line is the cp value that produces the simplest model with appropriate accuracy. In our case, the suggested cp value was 0.021. The tree was pruned using the most appropriate cp value detected above, and there were 7 terminal nodes in the best tree.

#prune
Boston_rpart2<- prune(Boston_rpart, cp=0.021) # terminal node=7
prp(Boston_rpart2, digits = 4, extra=1)

ASE for in-sample prediction

For evaluating in-sample prediction of the regression tree, ASE is calculated.

#Training data prediction
Boston_train_pred_tree=predict(Boston_rpart2)

#ASE Average Sum of Squared Error
ase.tree <- mean((Boston_train$medv - Boston_train_pred_tree)^2) 
ase.tree
## [1] 17.53666

MSPE for out-of-sample prediction

For evaluating out-of-sample predictionof the regression tree, MSPE is calculated.

#Testing data prediction
Boston_test_pred_tree = predict(Boston_rpart2, Boston_test)
#MSPE
mspe.tree <-mean((Boston_test$medv - Boston_test_pred_tree)^2) 
mspe.tree
## [1] 17.47077

It is notable that ASE and MSPE from Regression Tree are much smaller than those from Linear Regression Model. With our first split data, we can say that regression tree performance was better than Linear Regression Model, based on its smaller out-of-sample prediction value.

MLR/Regression Tree with Split #2

Linear Regression Model

With setting a different seed, another 80% of the original data was randomly chosen as the training data, and the remaining 20% as the testing Data.

seed2<-01226421
set.seed(seed2)
sample_index <- sample(nrow(Boston),nrow(Boston)*0.80)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]

1. Fit the full model

The full model was fitted with the training data, and with this split, the coefficients of crim, indusandage` were not significant. About 72% of variation in the data can be explained by this model.

model_full.train <- lm(medv~., data=Boston_train)
summary(model_full.train)

2. Variable Selection

Best subset using adjusted \(r^2\) was presented in the plot. The subset, that included 11 explanatory variables while excluding age and indus, was proposed again as a best subset with its highest adjusted \(r^2\) value of 0.72.

#Best subset using adjusted r^2
subset_result <- regsubsets(medv~.,data=Boston_train, nbest=2, nvmax = 14)
summary(subset_result)
plot(subset_result, scale="adjr2")

By conducting stepwise regression, the model including all variables except age and indus presented lowest AIC (1288.17) and lowest BIC (1331.33).

model_null.train <- lm(medv~1, data=Boston_train)
#stepwise selection using AIC
model_stepwise.aic <- step(model_null.train, scope=list(lower=model_null.train, upper=model_full.train), direction='both')

#stepwise selection using BIC
model_stepwise.bic <- step(model_null.train, scope=list(lower=model_null.train, upper=model_full.train),  direction='both', k=log(dim(Boston_train)[1]))

3. Final Model

Using various variable selection methods with second train/test split data, the same model was selected by chance. In other words, the best model excluded indus and age and all the remaining 11 explanatory variables were included. Below is the summary of the best model fitted.

Boston.train.lm2<- lm(medv~.-indus-age, data=Boston_train)
summary(Boston.train.lm2)
## 
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2817  -2.9221  -0.5911   1.8376  26.5143 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.830895   6.187762   4.821 2.05e-06 ***
## crim         -0.097310   0.060241  -1.615 0.107038    
## zn            0.041423   0.015299   2.708 0.007073 ** 
## chas          2.871187   0.991816   2.895 0.004005 ** 
## nox         -15.922379   4.073084  -3.909 0.000109 ***
## rm            4.261760   0.483720   8.810  < 2e-16 ***
## dis          -1.352809   0.208566  -6.486 2.66e-10 ***
## rad           0.284777   0.074477   3.824 0.000153 ***
## tax          -0.009780   0.003787  -2.582 0.010182 *  
## ptratio      -0.901311   0.146842  -6.138 2.05e-09 ***
## black         0.011750   0.003357   3.501 0.000518 ***
## lstat        -0.529872   0.053685  -9.870  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.853 on 392 degrees of freedom
## Multiple R-squared:  0.7316, Adjusted R-squared:  0.7241 
## F-statistic: 97.16 on 11 and 392 DF,  p-value: < 2.2e-16

4. Model Fit Evaluation

In-sample performance: Mean Square Error (MSE) and Average Squared Error (ASE)

With second training data, MSE was 22.37, and

#MSE: Mean Square Error in Training data
SSE<-sum(Boston.train.lm$residuals^2)
n=length(Boston.train.lm$residuals)
k=length(Boston.train.lm$coefficients)-1 #exclude intercept
mse.train2<-SSE/(n-k-1)
mse.train2
## [1] 22.37307

ASE was 21.71. Both in-sample performance values were exactly same with those values from the regression model using first training data.

ase.train2 <-mean(Boston.train.lm$residuals^2)
ase.train2
## [1] 21.70852

Mean Squared Prediction Error (MSPE) for evaluating out-of-sample performance

However, MSPE with second testing data was smaller than that from first testing data. MSPE with the second testing data was 17.20, and Root Mean Squared Prediction Error (RMSPE) is 4.15.

#MSPE : Mean Squared Prediction Error in Testing data
yhat.test <- predict(object = Boston.train.lm, newdata = Boston_test)
mspe.lm2 <- mean((Boston_test$medv-yhat.test)^2)
mspe.lm2
## [1] 17.60598
rmspe.lm2 <-sqrt(mspe.lm2) #RMSPE
rmspe.lm2 
## [1] 4.195948

We observed that MSPE, out-of sample-performance changes because of randomness of training/testing data split.

Regression Tree

Regression tree was fitted using second training data.

# with training data  
#rpart
Boston_rpart <- rpart(formula = medv ~., data=Boston_train)
#prp(Boston_rpart, digits = 4, extra=1)

The default complexity parameter (cp) is 0.01, and plotcp() function was used to identify the most ideal cp value for best tree.

#plotcp
plotcp(Boston_rpart)

In the cp table, the leftmost value which lies below the horizontal line is the cp value that produces the simplest model with appropriate accuracy. In our case, the suggested cp value was 0.02. The tree was pruned using the most appropriate cp value detected above, and there were 6 terminal nodes in the best tree.

#prune
Boston_rpart2<- prune(Boston_rpart, cp=0.02) 
prp(Boston_rpart2, digits = 4, extra=1)

ASE for in-sample prediction

For evaluating in-sample prediction of the regression tree, ASE was calculated.

#Training data prediction
Boston_train_pred_tree=predict(Boston_rpart2)

#ASE Average Sum of Squared Error
ase.tree2 <- mean((Boston_train$medv - Boston_train_pred_tree)^2) 
ase.tree2
## [1] 19.14203

MSPE for out-of-sample prediction

For evaluating out-of-sample prediction of the regression tree, MSPE is calculated.

#Testing data prediction
Boston_test_pred_tree = predict(Boston_rpart2, Boston_test)
#MSPE
mspe.tree2 <-mean((Boston_test$medv - Boston_test_pred_tree)^2) 
mspe.tree2
## [1] 24.3084

Both ASE and MSPE from the regression tree With second split data were larger than those with first split data.

Regression tree performed better with first split, but linear model performed better with second split.

One cannot conclude that one method was better than the other method since out-of-sample performance changes because of randomness of training/testing data split.

Cross Validation

Final Model with Entire Data

Summary of Linear Regression

The best linear regression model was fitted on the entire Boston data, and below displayed the summary.

# Build regression model on full data set#
Boston.lm<- lm(medv~.-indus-age, data=Boston)
summary(Boston.lm)
## 
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Average Squared Error (ASE) was shown below.

Boston.ase <- mean((Boston.lm$residuals)^2)
Boston.ase 
## [1] 21.89993

Regression Tree

The regression tree is fitted on the entire Boston data.

# Regression Tree
##rpart
Boston.rpart <- rpart(formula = medv ~., data=Boston)
#prp(Boston.rpart, digits = 4, extra=1) 

The default complexity parameter (cp) is 0.01, and plotcp() function are used to identify the most ideal cp value for best tree. The suggested cp value was 0.021, and the tree was pruned using this cp value, and there were 7 terminal nodes in best tree.

##plotcp
plotcp(Boston.rpart)

##prune
Boston.rpart2<- prune(Boston.rpart, cp=0.021) 
prp(Boston.rpart2, digits = 4, extra=1) 

ASE of regression tree is calculated below.

#Data prediction
Boston.pred.tree=predict(Boston.rpart2)
#ASE Average Sum of Squared Error
Boston.ase.tree<-mean((Boston$medv - Boston.pred.tree)^2) 
Boston.ase.tree
## [1] 17.58282

With entire Boston data, the in-sample performance of regression tree was better than linear model.

Cross Validation with Entire Data

5-fold cross validation

5-fold cross validation of the best linear regression model was performed. The average of out-of-sample performance across 5-iteration was calculated below.

Boston.glm <- glm(medv ~ .-indus-age, data = Boston) 
# k=5, 5-fold cross validation
cv5 <- cv.glm(Boston.glm, data=Boston, K=5)$delta[2]
cv5
## [1] 23.34135

Leave-one-out cross validation

Furthermore, leave-one-out cross validation for the final linear regression was performed, and the corresponding cv was 23.51.

## Leave-one-out Cross Validation ##
n <- dim(Boston)[1] #sample size
cv <- cv.glm(Boston.glm, data=Boston, K=n)$delta[2]
cv
## [1] 23.51161

Summary I

Multiple linear regression model and regression tree were employed for predicting median value of housing price with 2 sets of randomly selected 80% training/20% testing data. Entire data set also used for cross validation for the linear model. Table 3 displays ASE and MSPE for comparison.

Table 3: Result Summary I
Random Seed (80%/20%) Method (best) ASE (in-sample) MSPE (out-of-sample) 5-fold CV
12462210 Lindear Model 21.71 23.16
Regression Tree 17.54 17.47
1226421 Linear Model 21.71 17.61
Regression Tree 19.14 24.31
Full Data Linear Model 21.9
23.34
Regression Tree 17.58

With 2 different set of random split data experiment, in first case, out-of-sample prediction from regression tree performed better than linear model. However, second case, out-of-sample prediction performance from linear model was better than regression tree. Based on MSPE values from two different testing data sets, one cannot conclude that one method was superior than the other to predict housing price because MSPE values totally depend on randomness of training/testing split. The cv score is the average of out-of-sample performance across 5-folds, the variance of cv must be smaller than MSPE values from the best linear regression model with any testing data.

k-Nearest Neighbors

Data Scaling

Our first training/testing data split was used for KNN method.

seed1<-12462210
set.seed(seed1)
sample_index <- sample(nrow(Boston),nrow(Boston)*0.80)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]

Since KNN uses distance measure, all predictors need to be standardized.

First of all, standardized training and testing data sets were initialized.

train.norm <- Boston_train
test.norm <- Boston_test

All predictors in training data were normalized, and the range of each predictor was from zero to one. Similarly, testing data was also normalized using the scaling parameters from the training data.

p <- dim(Boston)[2]-1  #number of total predictors=13
cols <- colnames(train.norm[, -(p+1)]) #exclude response variable 
for (j in cols) {
  train.norm[[j]] <- (train.norm[[j]] - min(Boston_train[[j]])) / (max(Boston_train[[j]]) - min(Boston_train[[j]]))
  test.norm[[j]] <- (test.norm[[j]] - min(Boston_train[[j]])) / (max(Boston_train[[j]]) - min(Boston_train[[j]]))
}

Summary Statistics - scaled training data

Below is Summary statistics for rm and lstat in scaled training data.

rm: average number of rooms

Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0.4458 0.5074 0.5245 0.5881 1

lstat: lower status of the population

Min. 1st Qu. Median Mean 3rd Qu. Max.
0 0.1475 0.2774 0.3055 0.4333 1

Summary Statistics - scaled testing data

Below is summary statistics for rm and lstat in scaled testing data.

rm - Average number of rooms:

Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1106 0.4426 0.5034 0.5114 0.5724 0.8638

lstat - Lower status of the population:

Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.005419 0.1375 0.2486 0.3087 0.4155 1.028

KNN using k=5

KNN with scaled training data

k-Nearest Neighbors is conducted with scaled training data using tuning parameter k=5.

Boston.knn.reg2 <- knn.reg(train = train.norm[, 1:p], 
                           test = train.norm[, 1:p], 
                           y = train.norm$medv, 
                           k = 5) 
# compile the actual and predicted values and view the first 6 records
Boston.results2 <- data.frame(cbind(pred = Boston.knn.reg2$pred, actual = Boston_train$medv))
head(Boston.results2)

For evaluating in-sample prediction, the average sum squared error (ASE) for training data is calculated.

ASE.knn5 <- sum((Boston.results2$actual-Boston.results2$pred)^2)/length(Boston.results2$actual)
ASE.knn5 
## [1] 15.57555

KNN with scaled testing data

k-Nearest Neighbors is conducted with scaled testing data using k=5.

Boston.knn.reg <- knn.reg(train = train.norm[, 1:p], 
                          test = test.norm[, 1:p], 
                          y = train.norm$medv, 
                          k = 5)
# compile the actual and predicted values 
Boston.results <- data.frame(cbind(predicted_medv = Boston.knn.reg$pred, actual_medv = Boston_test$medv))

Predicted medv and actual medv in testing data has been compiled in Boston.results, and first 10 records are shown below.

head(Boston.results, 10)
##    predicted_medv actual_medv
## 1           25.66        21.6
## 2           29.46        36.2
## 3           20.08        18.9
## 4           21.56        21.7
## 5           18.02        18.2
## 6           17.60        17.5
## 7           14.04        14.5
## 8           14.14        13.9
## 9           13.54        13.2
## 10          20.28        20.0

For evaluating out-of-sample prediction, the mean squared prediction error (MSPE) is calculated.

MSPE.knn5 <- sum((Boston.results$actual-Boston.results$pred)^2)/length(Boston.results$actual)
MSPE.knn5 
## [1] 20.85205

The RMSPE is calculated as well.

# calculate RMSE (root mean (average) sum squared (prediction) errors)
RMSE.knn5 <- sqrt(MSPE.knn5) 
RMSE.knn5
## [1] 4.566404

In-sample prediction of KNN with k=5 was smaller than linear model and regression tree. However, out-of-sample prediction was not as good as that from regression tree.

KNN using Optimal K

Choosing Optimal Tuning Parameter

First of all, the scaled training data was randomly split into 80% training and 20% validation data.

seed1<-12462210
set.seed(seed1)
#sample_index <- sample(nrow(Boston),nrow(Boston)*0.80)
#Boston_train <- Boston[sample_index,]
#Boston_test <- Boston[-sample_index,]
sample_index2 <- sample(nrow(Boston_train),nrow(Boston_train)*0.80)
train2.norm <- train.norm[sample_index2,]
valid.norm <- train.norm[-sample_index2,]

A data frame with two columns (k and accuracy) was initialized.

RMSE.df <- data.frame(k = seq(1, 30, 1), RMSE.k = rep(0, 30))
# compute knn for different k on validation set
for (i in 1:30) {
  knn.reg.pred <- knn.reg(train = train2.norm[, c(1:p)], test = valid.norm[, c(1:p)], 
                          y = train2.norm$medv, k = i)
  RMSE.df[i, 2] <- sqrt(sum((valid.norm$medv-knn.reg.pred$pred)^2)/length(valid.norm$medv))
}
RMSE.df
##     k   RMSE.k
## 1   1 4.050621
## 2   2 4.161489
## 3   3 4.523736
## 4   4 4.952652
## 5   5 5.165888
## 6   6 5.371910
## 7   7 5.689382
## 8   8 5.911334
## 9   9 5.985068
## 10 10 5.965849
## 11 11 5.993370
## 12 12 6.131698
## 13 13 6.227074
## 14 14 6.238029
## 15 15 6.156573
## 16 16 6.193976
## 17 17 6.310006
## 18 18 6.372106
## 19 19 6.400128
## 20 20 6.521678
## 21 21 6.545490
## 22 22 6.606264
## 23 23 6.654130
## 24 24 6.673833
## 25 25 6.713152
## 26 26 6.735020
## 27 27 6.790043
## 28 28 6.758015
## 29 29 6.831986
## 30 30 6.888424

Optimal tuning parameter was determined for pursuing minimal RMSE value.

best_k <- which(RMSE.df[,2] == min(RMSE.df[,2]))
best_k
## [1] 1

KNN with scaled training data

k-Nearest Neighbors was conducted with scaled training data using Optimal tuning parameter.

Boston.knn.reg2 <- knn.reg(train = train.norm[, 1:p], 
                           test = train.norm[, 1:p], 
                           y = train.norm$medv, 
                           k = best_k) 
# compile the actual and predicted values and view the first 6 records
Boston.results2 <- data.frame(cbind(pred = Boston.knn.reg2$pred, actual = Boston_train$medv))
head(Boston.results2)

For evaluating in-sample prediction, the average sum squared error (ASE) for training data is calculated.

ASE.knn <- sum((Boston.results2$actual-Boston.results2$pred)^2)/length(Boston.results2$actual)
ASE.knn 
## [1] 0

KNN with scaled testing data

k-Nearest Neighbors is conducted with scaled testing data using Optimal tuning parameter.

Boston.knn.reg <- knn.reg(train = train.norm[, 1:p], 
                          test = test.norm[, 1:p], 
                          y = train.norm$medv, 
                          k = best_k)
# compile the actual and predicted values 
Boston.results <- data.frame(cbind(predicted_medv = Boston.knn.reg$pred, actual_medv = Boston_test$medv))

Predicted medv and actual medv in testing data has been compiled in Boston.results, and first 10 records are shown below.

head(Boston.results, 10)
##    predicted_medv actual_medv
## 1            22.0        21.6
## 2            33.4        36.2
## 3            18.9        18.9
## 4            22.9        21.7
## 5            21.0        18.2
## 6            19.6        17.5
## 7            13.6        14.5
## 8            14.8        13.9
## 9            13.5        13.2
## 10           18.9        20.0

For evaluating out-of-sample prediction, the mean squared prediction error (MSPE) and the RMSPE are calculated.

MSPE.knn <- sum((Boston.results$actual-Boston.results$pred)^2)/length(Boston.results$actual)
MSPE.knn 
## [1] 19.43706
# calculate RMSE (root mean (average) sum squared (prediction) errors)
RMSE.knn <- sqrt(MSPE.knn) 
RMSE.knn
## [1] 4.408748

KNN with optimal k was better than KNN with k=5 in terms of both ASE and MSPE.

Advanced Trees

Bagging, random forest, and boosting are ensemble methods. In other words, each method involves producing multiple trees from the training data, and aggregating to yield a single consensus prediction.

For exploring these methods, our first training/testing data split was used.

seed1<-12462210
set.seed(seed1)
sample_index <- sample(nrow(Boston),nrow(Boston)*0.80)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]

Bagging

Bootstrap aggregation, or bagging, is a procedure for reducing the variance. Bagging results in improved accuracy over prediction from a single tree. Bagging was fit on the training data using 100 bootstrap samples.

boston_bag<- bagging(formula = medv~., 
                     data = Boston_train, 
                     nbagg=100)  

In-sample prediction was calculated below.

#prediction on Training data: ASE
boston_train_pred_bag <-predict(boston_bag, newdata=Boston_train) 
ase.bag.train <- mean((Boston_train$medv - boston_train_pred_bag)^2)
ase.bag.train
## [1] 11.844

Out-of-sample prediction was calculated below.

#prediction on the Testing data: MSPE
boston_test_pred_bag<- predict(boston_bag, newdata = Boston_test)
mspe.bag.test <-mean((Boston_test$medv-boston_test_pred_bag)^2)
mspe.bag.test
## [1] 12.73231

Comparing out-of-sample prediction of a single tree (MSPE=17.47), we observed that MSPE of bagging was smaller.

Random forest

Random forest is an extension of Bagging, and it provide an significant improvement in prediction by de-correlation. A random sample of m predictors is chosen from the full set of p predictors.
The default m is p/3 and default ntree is 500. Using these default arguments, random forest was fitted on the training data.

boston_rf<- randomForest(medv~., data = Boston_train, importance=TRUE)

Two measures of variable importance were reported below.

boston_rf$importance 
##            %IncMSE IncNodePurity
## crim     8.2254602     1782.7139
## zn       0.7660977      191.9403
## indus    6.8020431     2223.1505
## chas     0.5397121      179.0061
## nox     10.7521564     2138.3738
## rm      34.7849590    10196.8060
## age      3.5685721      940.1232
## dis      7.7939942     2211.2383
## rad      1.2048911      274.3684
## tax      3.7968963     1143.6186
## ptratio  6.8755315     2106.0582
## black    1.6949790      708.3000
## lstat   58.4364956     9377.5200

The result represented that lstat (lower status of the population) and rm (average room number) were two most important variables.

In-sample average sum squared error (ASE) was computed below.

#prediction on Training data: ASE
boston_train_pred_rf<-predict(boston_rf, newdata=Boston_train)
ase.rf.train <- mean((Boston_train$medv - boston_train_pred_rf)^2)
ase.rf.train
## [1] 2.220531

Out-of-sample mean squared prediction error (MSPE) was computed as well.

#prediction on the Testing data: MSPE
boston_test_pred_rf <- predict(boston_rf, newdata =Boston_test)
mspe.rf.test <- mean((Boston_test$medv-boston_test_pred_rf)^2) 
mspe.rf.test
## [1] 8.289498
var(Boston_test$medv)
## [1] 82.96266

Both in-sample and out-of-sample prediction were much smaller than those from bagging. In other words, random forest yielded better prediction than bagging.

Boosting

Boosting does not involve bootstrap sampling, and the trees are grown sequentially. Each tree is grown using information from previously grown trees.

There are many R packages using boosting methods, including mboost, xgboost, catboost, and lightgbm: Ensemble packages in R, Gradient boosting methods

For this project, gbm package was used.

Boosting model was fitted on the training data using default argument, and relative influence was displayed.

boston_boost<- gbm(formula = medv~., 
                   data = Boston_train, 
                   distribution = "gaussian", 
                   n.trees = 100, 
                   shrinkage = 0.1, 
                   interaction.depth = 1)
summary(boston_boost)

##             var    rel.inf
## lstat     lstat 47.7886373
## rm           rm 36.2647742
## nox         nox  4.1088949
## crim       crim  3.2205830
## dis         dis  3.1854185
## ptratio ptratio  1.7958082
## chas       chas  1.2385286
## tax         tax  1.0799906
## age         age  0.6824800
## rad         rad  0.2484347
## black     black  0.2098398
## indus     indus  0.1766101
## zn           zn  0.0000000

Based on relative influence, lstat (lower status of the population) and rm (average room number) were two most important variables. These two important variables were consistent with the findings from random forest.

In-sample prediction was obtained.

#prediction on Training data: ASE
boston_boost_pred.train <- predict(boston_boost, newdata=Boston_train,  n.trees = 100)
ase.boost.train <- mean((Boston_train$medv-boston_boost_pred.train)^2)
ase.boost.train
## [1] 11.79631

Out-of-sample prediction was obtained as well.

#prediction on the Testing data: MSPE
boston_boost_pred.test<- predict(boston_boost, newdata=Boston_test, n.trees = 100)
mspe.boost.test <- mean((Boston_test$medv-boston_boost_pred.test)^2) 
mspe.boost.test 
## [1] 11.71237

Based on out-of-sample prediction performance, random forest was superior to predict house price compared to Boosting or Bagging.

GAM

Generalized additive model (GAM) was fit with the training data using smoothing splines. A Gaussian distribution and identity link has been assumed by default. Since chas and rad are qualitative variables, the variables are included in the model as they are, without using smoothing splines.

seed1<-12462210
set.seed(seed1)
sample_index <- sample(nrow(Boston),nrow(Boston)*0.80)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]
#create GAM model
Boston_gam0 <- gam(medv ~ s(crim)+s(zn)+s(indus)+chas+s(nox)
                  +s(rm)+ s(age) +s(dis)+rad+s(tax)+s(ptratio)
                  +s(black) +s(lstat),data=Boston_train)

Below is the summary of the GAM fit.

summary(Boston_gam0)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## medv ~ s(crim) + s(zn) + s(indus) + chas + s(nox) + s(rm) + s(age) + 
##     s(dis) + rad + s(tax) + s(ptratio) + s(black) + s(lstat)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.1987     1.2681  15.140   <2e-16 ***
## chas          1.2116     0.6876   1.762   0.0789 .  
## rad           0.3300     0.1304   2.530   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df      F  p-value    
## s(crim)    4.457  5.455  7.443 1.00e-06 ***
## s(zn)      1.000  1.000  0.577 0.448041    
## s(indus)   7.921  8.624  3.488 0.000515 ***
## s(nox)     8.991  8.999 12.769  < 2e-16 ***
## s(rm)      7.750  8.590 22.965  < 2e-16 ***
## s(age)     1.000  1.000  0.054 0.815790    
## s(dis)     8.786  8.983  7.914  < 2e-16 ***
## s(tax)     3.065  3.721  6.330 0.000136 ***
## s(ptratio) 1.296  1.530 19.349 2.12e-06 ***
## s(black)   1.000  1.000  2.133 0.145065    
## s(lstat)   5.407  6.583 23.091  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.881   Deviance explained = 89.7%
## GCV = 11.696  Scale est. = 10.142    n = 404

The summary of initial GAM models clearly showed that chas and rad were entered linearly. For variables having effective degree of freedom (edf) close to 1, we can say there are linear relationship with the response variable medv. Since the edf values of s(zn), s(age), and s(black) are 1 and those variables should be suitable to be entered linearly.

Partial plots are shown below. It clearly represents that the variables zn, age, and black have a linear relationship with medv. On the other hand, rest of them have a non-linear relationship with `medv.

plot(Boston_gam0, shade= TRUE, seWithMean=TRUE,  pages=1)

#par(ask=F)
#plot(Boston_gam0, shade= TRUE, seWithMean=TRUE)

As an example with non-linear term, s(rm) has extremely small p-value (<0.0001), and it tells the variable rm should be included in the model. Moreover, edf of 7.75 along with the plot, rm definitely need to be entered non-linearly.

On the contrary, zn has edf value of 1 which means the term has a linear relationship with medv along with the partial plot. However,we fail to reject the null hypothesis \(H_{0}: s(zn)=0\) with large p-value (0.448), and we can say that zn is not a significant term.

Final GAM model

I refit GAM model as shown below, and the variables zn, age, and black were entered as a linear term.

#model 2 - removing s() from functions which are linear
Boston_gam <- gam(medv ~ s(crim)+ zn +s(indus)+chas+s(nox)
                  +s(rm)+ age+s(dis)+rad+s(tax)+s(ptratio)
                  + black +s(lstat),data=Boston_train)

summary(Boston_gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## medv ~ s(crim) + zn + s(indus) + chas + s(nox) + s(rm) + age + 
##     s(dis) + rad + s(tax) + s(ptratio) + black + s(lstat)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.675548   1.727483  10.232   <2e-16 ***
## zn           0.012305   0.016201   0.760   0.4480    
## chas         1.211626   0.687637   1.762   0.0789 .  
## age          0.002893   0.012408   0.233   0.8158    
## rad          0.330045   0.130450   2.530   0.0118 *  
## black        0.003322   0.002274   1.460   0.1451    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df      F  p-value    
## s(crim)    4.457  5.455  7.443 1.00e-06 ***
## s(indus)   7.921  8.624  3.488 0.000516 ***
## s(nox)     8.991  8.999 12.769  < 2e-16 ***
## s(rm)      7.750  8.590 22.965  < 2e-16 ***
## s(dis)     8.786  8.983  7.914  < 2e-16 ***
## s(tax)     3.065  3.721  6.330 0.000136 ***
## s(ptratio) 1.296  1.531 19.348 2.12e-06 ***
## s(lstat)   5.407  6.583 23.092  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.881   Deviance explained = 89.7%
## GCV = 11.696  Scale est. = 10.142    n = 404
The formula of my final GAM model is \(Y = \beta_{0} + \sum_{j=1}^{9}s_{j} (X_{j}) + \varepsilon\) where \(\varepsilon \sim {\sf Norm}(0, \sigma^{2})\)
or

\(medv = \beta_{0} + s_{1}(crime)+zn+s_{3}(indus)+chas+s_{5}(nox)+s_{6}(rm)+age+s_{8}(dis)+rad+s_{10}(tax)+s_{11}(ptratio)+black+s_{13}(lstat)+ \varepsilon\)

In-sample ASE was computed below.

#In-sample ASE
pi <- predict(Boston_gam, newdata=Boston_train)
Boston_gam.ase.train <- mean((pi - Boston_train$medv)^2)
Boston_gam.ase.train
## [1] 8.794778

Out-of-sample MSPE was calcualted as well.

#out of sample performance: MSPE
pi.out <- predict(Boston_gam,newdata=Boston_test)
Boston_gam.mspe.test <- mean((pi.out - Boston_test$medv)^2)
Boston_gam.mspe.test 
## [1] 11.61164

Neural Networks

With a given seed, 80% of the original data was randomly chosen as the training data, and the remaining 20% as the testing Data.

seed1<-12462210
set.seed(seed1)
sample_index <- sample(nrow(Boston),nrow(Boston)*0.80)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]
n <- dim(Boston)[1]   # sample size
p <- dim(Boston)[2]-1 # number of predictors
With Unscaled Data

A Neural network model was fit on unscaled 80% training data using neuralnet(). Two hidden layers were used, 4 hidden nodes were specified for the first layer and 2 nodes for the second layer.

f <- as.formula("medv ~ .")
nn <- neuralnet(f,data=Boston_train, hidden=c(4,2), linear.output=T)
summary(nn)
##                     Length Class      Mode    
## call                   5   -none-     call    
## response             404   -none-     numeric 
## covariate           5252   -none-     numeric 
## model.list             2   -none-     list    
## err.fct                1   -none-     function
## act.fct                1   -none-     function
## linear.output          1   -none-     logical 
## data                  14   data.frame list    
## exclude                0   -none-     NULL    
## net.result             1   -none-     list    
## weights                1   -none-     list    
## generalized.weights    1   -none-     list    
## startweights           1   -none-     list    
## result.matrix         72   -none-     numeric

In-sample ASE was computed below.

#predicted value (yhat) of training data 
pr_nn_train <- compute(nn, Boston_train[,1:p]) #original scale

## ASE of training set ##
ASE_nn <- sum((Boston_train$medv - pr_nn_train$net.result)^2)/nrow(Boston_train)
ASE_nn
## [1] 84.97673

Out-of-sample MSPE was calcualted as well.

#### MSPE of the above neural network model for Testing data####
#predicted value (yhat) of testing data 
pr_nn_test <- compute(nn, Boston_test[,1:p]) #original scale

## MSPE of testing set ##
MSPE_nn <- sum((Boston_test$medv - pr_nn_test$net.result)^2)/nrow(Boston_test)
MSPE_nn
## [1] 82.22876
With Scaled Data

Both the predictors (X) and response variable (Y) were scaled to have [0,1] range.

## initialize standardized training and testing data ##
train.norm <- Boston_train
test.norm <- Boston_test
maxs<- apply(Boston_train, 2, max)
mins<- apply(Boston_train, 2, min)

train.norm <- as.data.frame(scale(Boston_train, center = mins, scale = maxs - mins))
test.norm <- as.data.frame(scale(Boston_test, center = mins, scale = maxs - mins))

Neural network was fit with this scaled data for computational stability. As shown above, 4 hidden nodes were specified for the first layer and 2 nodes for the second layer.

f <- as.formula("medv ~ .")
nn <- neuralnet(f, data=train.norm, hidden=c(4,2), linear.output=T)
summary(nn)
##                     Length Class      Mode    
## call                   5   -none-     call    
## response             404   -none-     numeric 
## covariate           5252   -none-     numeric 
## model.list             2   -none-     list    
## err.fct                1   -none-     function
## act.fct                1   -none-     function
## linear.output          1   -none-     logical 
## data                  14   data.frame list    
## exclude                0   -none-     NULL    
## net.result             1   -none-     list    
## weights                1   -none-     list    
## generalized.weights    1   -none-     list    
## startweights           1   -none-     list    
## result.matrix         72   -none-     numeric

The neural network diagram with the model was created.

plot(nn,, rep="best" )

In-sample ASE was computed below.

#predicted value (yhat) of scaled training data 
pr_nn_train <- compute(nn, train.norm[,1:p]) #medv in standardized scale [0,1]


#neural network predicted value of testing data at original scale 
pr_nn_train_org <- pr_nn_train$net.result*(max(Boston_train$medv)-min(Boston_train$medv))+min(Boston_train$medv)

## ASE of training set ##
ASE_nn2 <- sum((Boston_train$medv - pr_nn_train_org)^2)/nrow(Boston_train)
ASE_nn2 #4.917569
## [1] 5.694193

Out-of-sample MSPE was calcualted as well.

#predicted value (yhat) of scaled testing data 
pr_nn_test <- compute(nn, test.norm[,1:p])  #medv in standardized scale [0,1]


#neural network predicted value of testing data at original scale 
pr_nn_test_org <- pr_nn_test$net.result*(max(Boston_train$medv)-min(Boston_train$medv))+min(Boston_train$medv)

## MSPE of testing set ##
MSPE_nn2 <- sum((Boston_test$medv - pr_nn_test_org)^2)/nrow(Boston_test)
MSPE_nn2  
## [1] 10.61933
####################################################

Summary II

Multiple linear regression model, regression tree, k-nearest neighbors, random forest, boosting, GAM, and neural networks were employed for predicting median value of housing price. Table 4 displays ASE and MSPE for comparison.

Table 4: Result Summary II
Random Seed (80%/20%) Method (best) ASE (in-sample) MSPE (out-of-sample)
12462210 Linear Model 21.71 23.16
Regression Tree 17.54 17.47
k-NN with k=5 15.58 20.85
k-NN with optimal k 0 19.44
Random Forests 2.22 8.29
Boosting 11.8 11.71
GAM 8.79 11.61
Neural networks (unscaled) 84.98 82.23
Neural networks (scaled) 5.69 10.62

The prediction performance of random forests, boosting, Generalized Additive Model (GAM), and neural networks was improved substantially, comparing to conventional linear regression, regression tree, and KNN in terms of in-sample and out-of-sample performance. GAM allows to capture non-linear relationships that standard linear regression will miss. A Neural Networks model with scaled data performed much better than with unscaled data.

Interestingly, random forests yielded smallest ASE and MSPE, and it outperformed boosting, GAM, and Neural Networks in predicting median housing price(medv). The out-of-sample prediction performance (MSPE) of GAM and boosting were pretty similar.

If interpretation is not important and only prediction performance matters to the stakeholders, I would recommend either Random Forests or Neural Network. However, both interpretation and prediction performance are important to the stakeholders, I would recommend Regression Tree.