Introduction

Boston Housing dataset contains information on median housing values in the suburbs of Boston, Massachusetts.

The dataset is often used in regression analysis and is available in the MASS library in R.

Goal:

Predict or Interpret the median price

For Modeling purposes: This will be supervised learning model Regression will be used since the median price is a continuous variable

Dataset details:

Rows: 506 Columns: 14

Output variable (Y): medv Predictors (X): crim, zn, Indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat – Total = 13

Exploratory Data Analysis

Three variables have a high correlation with medv: 1) lstat 2) rm 3) ptratio

Looking at above chart, the relationship between medv and rm can be evaluated as positively correlated.

Looking at above chart, the relationship between medv and lstat can be evaluated as negatively correlated.

Since this is a categorical variable, the dots are showing up in discrete segments.

Regression Models

To assess which model performs the best, each will be evaluated on its ASE and MSPE values.

Linear Regression

The best regression model return an Adjusted R-squared value of 0.74 which shows a strong explanation of output variable in relation to predictors. For linear regression, an ASE of 22.97 and MSPE of 21.44 is calculated.

Regression Tree

## [1] 22.43255

In Regression Tree, pruning was conducted on tree size of 13.

An ASE of 11.64 and MSPE of 22.43 was calculated. Comparing results with previous model, in-sample errors improved but we did not get an improvement over test set. Therefore we cannot conclude that this model performs better than linear regression.

K-Nearest Neighbor

For K-NN, the optimal K was found to be 3.

Training set was scaled to accurately calculate euclidean distances between nodes.

An ASE of 18.96 and MSPE of 13.46 was calculated for best K-NN model.

Our out of sample performance is better than any of the previous models.

Random Forest

## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston_train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 11.33083
##                     % Var explained: 87.06
##            %IncMSE IncNodePurity
## crim     9.1130967     2129.2599
## zn       0.8856469      292.8516
## indus    6.4980308     2091.3968
## chas     1.1551706      332.9708
## nox      8.8591184     2339.1907
## rm      32.5877659     9623.1322
## age      4.1893387     1004.3436
## dis      6.7268829     1903.9135
## rad      1.0833036      263.2989
## tax      3.6947355     1030.8962
## ptratio  7.3896075     2381.1642
## black    1.8620520      668.2100
## lstat   68.0677167    10900.9711
## [1] 2.196255
## [1] 7.803492

The importance of predictors was assessed in this Random Forest model - this indicates the strength of explanation of output variable by each predictor. For example, a 68% error increase in medv when we change lstat highlights the importance of lstat in pricing.

For the best Random Forest model, we calculate an ASE of 2.19 and MSPE of 7.8035.

Boosting Model

For Boosting, we set n.Trees = 10000 and shrinkage = 0.01.

We get an ASE of 0.02 and MSPE of 7.8049. An extremely low ASE is often indicative of over-fitting and knowing that Boosting is generally prone to over-fitting we would focus on the MSPE value in this case.

General Additive Model

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## medv ~ s(crim) + s(zn) + s(indus) + chas + s(nox) + s(rm) + s(age) + 
##     s(dis) + rad + s(tax) + s(ptratio) + s(black) + s(lstat)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.3455     1.2705  15.227   <2e-16 ***
## chas          1.2978     0.7289   1.780   0.0759 .  
## rad           0.2942     0.1291   2.279   0.0233 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df      F  p-value    
## s(crim)    5.991  7.052  5.786 2.53e-06 ***
## s(zn)      1.000  1.000  1.905  0.16839    
## s(indus)   6.343  7.390  2.927  0.00473 ** 
## s(nox)     8.955  8.996 11.031  < 2e-16 ***
## s(rm)      6.577  7.731 22.933  < 2e-16 ***
## s(age)     1.000  1.000  1.073  0.30100    
## s(dis)     8.697  8.967  5.998  < 2e-16 ***
## s(tax)     3.360  4.053  8.883 9.69e-07 ***
## s(ptratio) 1.000  1.000 24.974 1.38e-06 ***
## s(black)   5.284  6.350  1.422  0.20445    
## s(lstat)   5.727  6.919 18.626  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.882   Deviance explained = 89.9%
## GCV = 12.028  Scale est. = 10.333    n = 404
## [1] 10.33338
## [1] 10.33346
## [1] 8.473513

The final GAM model was fit by including the linear terms where applicable.

An ASE of 10.33 and MSPE of 8.47 was calculated in this case.

Neural Network

The best Neural Network model has 2 hidden layers.

An ASE of 5.75 and MSPE of 10.61 was calculated.

Conclusion and Recommendation

Based on calculated MSPE scores, Random Forest turns out to be the best model.

It is critical to evaluate personal goal; a seller may be interested in just predicting price, a construction company may be interested in architecture/location that will help increase house value. Depending on the goal, selected model may vary.