Boston Housing dataset contains information on median housing values in the suburbs of Boston, Massachusetts.
The dataset is often used in regression analysis and is available in the MASS library in R.
Predict or Interpret the median price
For Modeling purposes: This will be supervised learning model Regression will be used since the median price is a continuous variable
Rows: 506 Columns: 14
Output variable (Y): medv Predictors (X): crim, zn, Indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat – Total = 13
Three variables have a high correlation with medv: 1) lstat 2) rm 3) ptratio
Looking at above chart, the relationship between medv and rm can be evaluated as positively correlated.
Looking at above chart, the relationship between medv and lstat can be evaluated as negatively correlated.
Since this is a categorical variable, the dots are showing up in discrete segments.
To assess which model performs the best, each will be evaluated on its ASE and MSPE values.
The best regression model return an Adjusted R-squared value of 0.74 which shows a strong explanation of output variable in relation to predictors. For linear regression, an ASE of 22.97 and MSPE of 21.44 is calculated.
## [1] 22.43255
In Regression Tree, pruning was conducted on tree size of 13.
An ASE of 11.64 and MSPE of 22.43 was calculated. Comparing results with previous model, in-sample errors improved but we did not get an improvement over test set. Therefore we cannot conclude that this model performs better than linear regression.
For K-NN, the optimal K was found to be 3.
Training set was scaled to accurately calculate euclidean distances between nodes.
An ASE of 18.96 and MSPE of 13.46 was calculated for best K-NN model.
Our out of sample performance is better than any of the previous models.
##
## Call:
## randomForest(formula = medv ~ ., data = Boston_train, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 11.33083
## % Var explained: 87.06
## %IncMSE IncNodePurity
## crim 9.1130967 2129.2599
## zn 0.8856469 292.8516
## indus 6.4980308 2091.3968
## chas 1.1551706 332.9708
## nox 8.8591184 2339.1907
## rm 32.5877659 9623.1322
## age 4.1893387 1004.3436
## dis 6.7268829 1903.9135
## rad 1.0833036 263.2989
## tax 3.6947355 1030.8962
## ptratio 7.3896075 2381.1642
## black 1.8620520 668.2100
## lstat 68.0677167 10900.9711
## [1] 2.196255
## [1] 7.803492
The importance of predictors was assessed in this Random Forest model - this indicates the strength of explanation of output variable by each predictor. For example, a 68% error increase in medv when we change lstat highlights the importance of lstat in pricing.
For the best Random Forest model, we calculate an ASE of 2.19 and MSPE of 7.8035.
For Boosting, we set n.Trees = 10000 and shrinkage = 0.01.
We get an ASE of 0.02 and MSPE of 7.8049. An extremely low ASE is often indicative of over-fitting and knowing that Boosting is generally prone to over-fitting we would focus on the MSPE value in this case.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## medv ~ s(crim) + s(zn) + s(indus) + chas + s(nox) + s(rm) + s(age) +
## s(dis) + rad + s(tax) + s(ptratio) + s(black) + s(lstat)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.3455 1.2705 15.227 <2e-16 ***
## chas 1.2978 0.7289 1.780 0.0759 .
## rad 0.2942 0.1291 2.279 0.0233 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(crim) 5.991 7.052 5.786 2.53e-06 ***
## s(zn) 1.000 1.000 1.905 0.16839
## s(indus) 6.343 7.390 2.927 0.00473 **
## s(nox) 8.955 8.996 11.031 < 2e-16 ***
## s(rm) 6.577 7.731 22.933 < 2e-16 ***
## s(age) 1.000 1.000 1.073 0.30100
## s(dis) 8.697 8.967 5.998 < 2e-16 ***
## s(tax) 3.360 4.053 8.883 9.69e-07 ***
## s(ptratio) 1.000 1.000 24.974 1.38e-06 ***
## s(black) 5.284 6.350 1.422 0.20445
## s(lstat) 5.727 6.919 18.626 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.882 Deviance explained = 89.9%
## GCV = 12.028 Scale est. = 10.333 n = 404
## [1] 10.33338
## [1] 10.33346
## [1] 8.473513
The final GAM model was fit by including the linear terms where applicable.
An ASE of 10.33 and MSPE of 8.47 was calculated in this case.
The best Neural Network model has 2 hidden layers.
An ASE of 5.75 and MSPE of 10.61 was calculated.

Based on calculated MSPE scores, Random Forest turns out to be the best model.
It is critical to evaluate personal goal; a seller may be interested in just predicting price, a construction company may be interested in architecture/location that will help increase house value. Depending on the goal, selected model may vary.