Boston Housing:

If you have ever learned Data science or Machine learning this dataset needs no introduction. Most statisticians use it extensively either to learn ML algos or explain them. What makes this dataset unique is its complexity to predict the response variable. Most algorithms cannot accurately predict house prices.

I here will try to analyze how well Linear regression can do it versus Decision Trees…

1.0 Attributes Informtion:

The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE), and there are many other attributes that available. However, because we are going to use R, we can import it right away from the MASS Package itself. In this story, we will use several R libraries as required here.

  • Number of Instances: 506

  • Number of Attributes: 13 continuous attributes (including “class” attribute “MEDV”), 1 binary-valued attribute.

  • Attribute Information:

    1. CRIM per capita crime rate by town
    2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
    3. INDUS proportion of non-retail business acres per town
    4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    5. NOX nitric oxides concentration (parts per 10 million)
    6. RM average number of rooms per dwelling
    7. AGE proportion of owner-occupied units built prior to 1940
    8. DIS weighted distances to five Boston employment centres
    9. RAD index of accessibility to radial highways
    10. TAX full-value property-tax rate per $10,000
    11. PTRATIO pupil-teacher ratio by town
    12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    13. LSTAT % lower status of the population
    14. MEDV Median value of owner-occupied homes in $1000’s
  • Missing Attribute Values: None.

As our goal is to develop a model that has the capacity of predicting the value of houses, we will split the dataset into features and the target variable.

2.0 Initial Set Up

2.1 Package Loading

Required Packages

  • Tidyverse (dplyr, ggplot2..) - Data Read, Manipulation and visualisation
  • Plotly - Interactive Visualization
  • KableExtra - Styling for table (Styling Data Tables within Markdown)
  • gridExtra- Graphical arrangement
  • forecast- Time series and forecasting
  • ggplot2- Graphical representation
  • stringr- string manipulations
  • corrplot-making correlogram
  • knitrr- Dynamic report generation
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
library(MASS) #this data is in MASS package
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(DataExplorer)
library(ggpubr)
## Loading required package: ggplot2
## Loading required package: magrittr
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:plyr':
## 
##     mutate
library(leaps)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-2
library("kableExtra")
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(broom)
library(boot)
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library("gbm")
## Loaded gbm 2.1.5
library("mgcv")
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
library(neuralnet)
## 
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
## 
##     compute
library(mgcv)

2.2 Data Loading

boston.data <- data(Boston)
dim(Boston)
## [1] 506  14
nam<-names(Boston)

Dimension of data 506 rows and 14 columns The names of attributes are : crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Observation: Most columns are quantitative, House prices are in medv varaible.

3.0 Exploratory Data Analysis

3.1 Initial Analysis

Snapshot of data

kable(head(Boston,10))  %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "250px")
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
0.02985 0.0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.60 12.43 22.9
0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.90 19.15 27.1
0.21124 12.5 7.87 0 0.524 5.631 100.0 6.0821 5 311 15.2 386.63 29.93 16.5
0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.10 18.9

Data Quality Check

Data quality is another very important step in EDA, its imperative to have good data quality for a optimum analysis.

metadata<-t(introduce(Boston))
colnames(metadata)<-"Values"
metadata
##                      Values
## rows                    506
## columns                  14
## discrete_columns          0
## continuous_columns       14
## all_missing_columns       0
## total_missing_values      0
## complete_rows           506
## total_observations     7084
## memory_usage          56216
plot_intro(Boston)

The good news is that this dataset has no missing/ Abnormal values in dataset. The structure looks coherent

3.2 Deep Dive

Analysis based on Visualization

Density plots are used to observe the distribution of a variable in a dataset. … An advantage of Density Plots over Histograms is that they’re better at determining the distribution shape because they’re not affected by the number of bins.

# Basic density plot with mean line and marginal rug
crim<-ggdensity(Boston, x = "crim", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
zn<-ggdensity(Boston, x = "zn", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
indus<-ggdensity(Boston, x = "indus", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
chas<-ggdensity(Boston, x = "chas", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
nox<-ggdensity(Boston, x = "nox", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
rm<-ggdensity(Boston, x = "rm", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
age<-ggdensity(Boston, x = "age", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
dis<-ggdensity(Boston, x = "dis", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
rad<-ggdensity(Boston, x = "rad", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
tax<-ggdensity(Boston, x = "tax", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
ptratio<-ggdensity(Boston, x = "ptratio", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
black<-ggdensity(Boston, x = "black", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
lstat<-ggdensity(Boston, x = "lstat", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
medv<-ggdensity(Boston, x = "medv", 
          fill = "#0073C2FF", color = "#0073C2FF",
          add = "mean", rug = TRUE)
  

figure1 <- ggarrange(crim, zn, indus,chas,
                    labels = c("1", "2", "3","4"),
                    ncol = 2, nrow = 2)
figure2 <- ggarrange(nox,rm,age,dis,
                    labels = c("5", "6", "7","8"),
                    ncol = 2, nrow = 2)
figure3 <- ggarrange(rad,tax,ptratio,black,
                    labels = c("9", "10", "11","12"),
                    ncol = 2, nrow = 2)
figure4 <- ggarrange(lstat,medv,
                    labels = c("11","12"),
                    ncol = 2, nrow = 1)
figure1;figure2;figure3;figure4

Observation

  • crim, zn,chas,black are highly skewed
  • indus, rad, tax are bimodal
  • age,dis are skewed
  • nox,ptratio, lstat and are also slightly skewed.
  • rm and medv is unimodal and well shaped.

Correlation analysis Correlation matrix can help us understand the association between the variables in one snapshot..

plot_correlation(na.omit(Boston), maxcat = 2L)

Observation:

Response variable is highly correlated with lstat & moderately correlated with ptratio, indus, nox, tax. Strong correlation is also present between various predictors.

4.0 Linear Regression

4.1 Fit a Model

Here,fitting a model using Linear regression First we need to split the data into training and testing sample

Data split

7:3 split split on data is made

set.seed(1338364)
sample_index <- sample(nrow(Boston),nrow(Boston)*0.70)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]

4.1.2 FIT a full model

#full model
model_full <- lm(medv~., data=Boston_train)#SMALL R SQUARE
modelFullSummary<-summary(model_full)
modelFullSummary
## 
## Call:
## lm(formula = medv ~ ., data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2939  -2.6967  -0.6975   1.5993  25.4414 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.903283   5.945785   6.375 5.97e-10 ***
## crim         -0.119639   0.037998  -3.149 0.001786 ** 
## zn            0.048724   0.015804   3.083 0.002216 ** 
## indus        -0.022443   0.074667  -0.301 0.763919    
## chas          2.331514   1.002660   2.325 0.020642 *  
## nox         -17.836726   4.758368  -3.748 0.000209 ***
## rm            3.588560   0.496942   7.221 3.40e-12 ***
## age          -0.001905   0.017139  -0.111 0.911561    
## dis          -1.523971   0.240859  -6.327 7.87e-10 ***
## rad           0.353825   0.078491   4.508 9.02e-06 ***
## tax          -0.015400   0.004471  -3.445 0.000643 ***
## ptratio      -0.819006   0.161242  -5.079 6.26e-07 ***
## black         0.005917   0.003115   1.899 0.058361 .  
## lstat        -0.500630   0.062292  -8.037 1.54e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.727 on 340 degrees of freedom
## Multiple R-squared:  0.738,  Adjusted R-squared:  0.728 
## F-statistic: 73.66 on 13 and 340 DF,  p-value: < 2.2e-16

Observation

Except age and indus all variables are significant as per pvalue

4.1.2 Best Subsets

subset_reg<-regsubsets(medv~.,
                       Boston_train,
                       nbest=1,#ranks in output
                       nvmax=14) #max variables to be considered
summary(subset_reg)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., Boston_train, nbest = 1, nvmax = 14)
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
## 1  ( 1 )  " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
## 2  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
## 3  ( 1 )  " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"  
## 4  ( 1 )  " "  " " " "   " "  " " "*" " " "*" " " " " "*"     " "   "*"  
## 5  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 6  ( 1 )  " "  " " " "   "*"  "*" "*" " " "*" " " " " "*"     " "   "*"  
## 7  ( 1 )  " "  " " " "   " "  "*" "*" " " "*" "*" "*" "*"     " "   "*"  
## 8  ( 1 )  "*"  " " " "   " "  "*" "*" " " "*" "*" "*" "*"     " "   "*"  
## 9  ( 1 )  "*"  "*" " "   " "  "*" "*" " " "*" "*" "*" "*"     " "   "*"  
## 10  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     " "   "*"  
## 11  ( 1 ) "*"  "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 12  ( 1 ) "*"  "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"  
## 13  ( 1 ) "*"  "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"
plot(subset_reg,scale="bic")

Observation

Here we observe that indus and age are insignificant variables. This was also predicted on the basis of pvalue. Hence, we can drop indus and age variables

4.1.3 Stepwise Approach

Note ,Because this is prediction problem we need to use AIC criterian to choose a model. Bic being more parsimonious in approach generally chooses smallest possible model.

Forward summary

summary(model_step_f)
## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     zn + crim + rad + tax + black, data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2966  -2.6949  -0.7083   1.6644  25.3922 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.939342   5.920279   6.408 4.88e-10 ***
## lstat        -0.504292   0.057740  -8.734  < 2e-16 ***
## rm            3.597435   0.481771   7.467 6.88e-13 ***
## ptratio      -0.822901   0.160319  -5.133 4.80e-07 ***
## dis          -1.498428   0.219136  -6.838 3.71e-11 ***
## nox         -18.321202   4.402658  -4.161 4.01e-05 ***
## chas          2.306098   0.996616   2.314  0.02126 *  
## zn            0.049475   0.015500   3.192  0.00154 ** 
## crim         -0.119024   0.037841  -3.145  0.00180 ** 
## rad           0.360853   0.075028   4.810 2.27e-06 ***
## tax          -0.016028   0.003976  -4.031 6.84e-05 ***
## black         0.005902   0.003098   1.905  0.05763 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.714 on 342 degrees of freedom
## Multiple R-squared:  0.7379, Adjusted R-squared:  0.7295 
## F-statistic: 87.53 on 11 and 342 DF,  p-value: < 2.2e-16

Backward summary

summary(model_step_b)
## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2966  -2.6949  -0.7083   1.6644  25.3922 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.939342   5.920279   6.408 4.88e-10 ***
## crim         -0.119024   0.037841  -3.145  0.00180 ** 
## zn            0.049475   0.015500   3.192  0.00154 ** 
## chas          2.306098   0.996616   2.314  0.02126 *  
## nox         -18.321202   4.402658  -4.161 4.01e-05 ***
## rm            3.597435   0.481771   7.467 6.88e-13 ***
## dis          -1.498428   0.219136  -6.838 3.71e-11 ***
## rad           0.360853   0.075028   4.810 2.27e-06 ***
## tax          -0.016028   0.003976  -4.031 6.84e-05 ***
## ptratio      -0.822901   0.160319  -5.133 4.80e-07 ***
## black         0.005902   0.003098   1.905  0.05763 .  
## lstat        -0.504292   0.057740  -8.734  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.714 on 342 degrees of freedom
## Multiple R-squared:  0.7379, Adjusted R-squared:  0.7295 
## F-statistic: 87.53 on 11 and 342 DF,  p-value: < 2.2e-16

Stepwise Summary

summary(model_step_s)
## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     zn + crim + rad + tax + black, data = Boston_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2966  -2.6949  -0.7083   1.6644  25.3922 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.939342   5.920279   6.408 4.88e-10 ***
## lstat        -0.504292   0.057740  -8.734  < 2e-16 ***
## rm            3.597435   0.481771   7.467 6.88e-13 ***
## ptratio      -0.822901   0.160319  -5.133 4.80e-07 ***
## dis          -1.498428   0.219136  -6.838 3.71e-11 ***
## nox         -18.321202   4.402658  -4.161 4.01e-05 ***
## chas          2.306098   0.996616   2.314  0.02126 *  
## zn            0.049475   0.015500   3.192  0.00154 ** 
## crim         -0.119024   0.037841  -3.145  0.00180 ** 
## rad           0.360853   0.075028   4.810 2.27e-06 ***
## tax          -0.016028   0.003976  -4.031 6.84e-05 ***
## black         0.005902   0.003098   1.905  0.05763 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.714 on 342 degrees of freedom
## Multiple R-squared:  0.7379, Adjusted R-squared:  0.7295 
## F-statistic: 87.53 on 11 and 342 DF,  p-value: < 2.2e-16

Observation

Forward, backward, stepwise and best subset, even p value approach select the same model. Hence we drop the (indus and age )insignificant variables. Apart from these approaches lets also try to observe the outcome of a more advanced approach:Lasso

4.1.4 Regularization

In short, ridge regression and lasso are regression techniques optimized for prediction, rather than inference.

Normal regression gives you unbiased regression coefficients (maximum likelihood estimates “as observed in the data-set”).

Ridge and lasso regression allow you to regularize (“shrink”) coefficients. This means that the estimated coefficients are pushed towards 0, to make them work better on new data-sets (“optimized for prediction”). This allows you to use complex models and avoid over-fitting at the same time.

Lets check it our the Lasso on this dataset..

lasso_fit = glmnet(x = as.matrix(Boston_train[, -c(which(colnames(Boston)=='medv'))]), 
                   y = Boston_train$medv, 
                   alpha = 1)#Is lasso penlity alpha =0 become the ridge penality

#lasso_fit
#lambda = 0.5
coef(lasso_fit,s=0.5)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept) 15.107777246
## crim        -0.014014855
## zn           .          
## indus        .          
## chas         1.207204963
## nox          .          
## rm           4.024667476
## age          .          
## dis         -0.020290778
## rad          .          
## tax         -0.001408246
## ptratio     -0.628831857
## black        0.002073485
## lstat       -0.510448603
#lambda = 1
coef(lasso_fit,s=1)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) 15.84925307
## crim         .         
## zn           .         
## indus        .         
## chas         .         
## nox          .         
## rm           3.65905943
## age          .         
## dis          .         
## rad          .         
## tax         -0.00057969
## ptratio     -0.53085900
## black        .         
## lstat       -0.50311093

Randomly choosing the shrinkage variable we observe that with increase in the value of “s” aka “Lambda” of Lasso valiable selection the coefficient of predictors shrinks.

Now, lets try to use cross validation to select the optimum value of this parameter: Here, we are using a 5 fold cross validation

#use 5-fold cross validation to pick lambda
cv_lasso_fit = cv.glmnet(x = as.matrix(Boston_train[, -c(which(colnames(Boston)=='medv'))]), 
                         y = Boston_train$medv, 
                         alpha = 1, 
                         nfolds = 5)# no of folds in CV
plot(cv_lasso_fit)

lambdamin<-cv_lasso_fit$lambda.min
lambda1se<-cv_lasso_fit$lambda.1se

Observation

Here, it is observed that minimum MSE occours at 0.0091247 and MSE 1 standard error occours at 0.3770339.

Lets have a look at the coefficients for these 2 values of Lambda

coef(lasso_fit,lambdamin)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)  37.268685172
## crim         -0.116102863
## zn            0.047242676
## indus        -0.024349259
## chas          2.323414949
## nox         -17.453294673
## rm            3.603834918
## age          -0.001400228
## dis          -1.490139998
## rad           0.336587993
## tax          -0.014644048
## ptratio      -0.813365987
## black         0.005809118
## lstat        -0.502146356
coef(lasso_fit,lambda1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) 18.7438896869
## crim        -0.0243845454
## zn           .           
## indus       -0.0068557915
## chas         1.5655812406
## nox         -3.9000947690
## rm           4.0622259043
## age          .           
## dis         -0.2844690457
## rad          .           
## tax         -0.0009749202
## ptratio     -0.6831615800
## black        0.0027886007
## lstat       -0.5137573171

Inference

lambda.min Shrinks indus and age variables, this means this would suggest the same model as by above mechanisms. We will use lambda.1se as it shrinks the coefficients much more that lambda.min

data_train.lasso.X = as.matrix(Boston_train[, -c(which(colnames(Boston_train)=='medv'))])

pred.lasso.train<- predict(lasso_fit, newx=data_train.lasso.X, s=lambda1se, type = "response")
mean(pred.lasso.train-Boston_train$medv)^2
## [1] 1.812385e-27
data_test.lasso.X = as.matrix(Boston_test[, -c(which(colnames(Boston_test)=='medv'))])

pred.lasso.test<- predict(lasso_fit, newx=data_test.lasso.X, s=lambda1se, type = "response")
mean(pred.lasso.test-Boston_test$medv)^2
## [1] 0.05510227

Here, we observe that MSE and MSPE Values have been reduced to very small values. Key takeaways:

Lasso suggests same predictors as that of Best Sussets,Forward,Backward,stepwise. It shrinks the values of coefficients for a optimum value of lambda we choose from the plotcp graph. *The model has very low in-sample & out-sample error values.

Final model:

As with most approaches we got the same result; to drop indus and age predictors. On the other hand Lasso had a different take for lambda1se(@ 1 standard error) by additionally shrinking the coefficients and making indus, age, zn,rad as = 0. In this scenario we are not much bothered by for multicollinearity and we also have good number of data rows as compared to predictors. Hence we need not go for the Lasso variable selection using Regularization.

So what model do we select? Here, we keep our approach simple and choose the model based on the other variable selection methodology we employed above(pvalue , best, forward, bacward, stepwise, both).

We select the model that has the lowest value of AIC(Predicting power) as we are more interested in the virtue of correct prediction. The full model has slightly higher value of AIC than the model suggested by selection mechanisms employed (Best Sussets,Forward,Backward,stepwise)

#pvalue , best, forward, bacward, stepwise both
#drop indus & age
model_sel <- lm(medv ~ lstat + rm + ptratio + black + dis + nox + chas + zn +   rad + tax + crim, data=Boston_train)
model_selSummary<-summary(model_sel)

#lasso 1se model
model_sel_lso <- lm(medv ~ lstat + rm + ptratio + black + dis +  chas, data=Boston_train)
model_sel_lso_Summary<-summary(model_sel_lso)

modelVec<-c("Full model","Model_PvBeFwBkStLmin")

OpDF<-cbind(modelVec,
      rbind(glance(model_full),
            glance(model_sel)
            #,glance(model_sel_lso)
            )
      )
OpDF$MSE<-OpDF$sigma^2
rel_cols<-c("modelVec","sigma","MSE","r.squared","adj.r.squared","AIC","BIC")

kable(head(OpDF[,rel_cols]))  %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "170px")
modelVec sigma MSE r.squared adj.r.squared AIC BIC
Full model 4.727104 22.34551 0.7379818 0.7279635 2120.070 2178.109
Model_PvBeFwBkStLmin 4.713948 22.22131 0.7379055 0.7294755 2116.173 2166.474

4.2 Residual Dignosis

Full model: All predictors

lm(formula = medv ~ ., data = Boston_train)

par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(model_full)# Plot the model information

Selected model: All predictors all but indus and age

lm(formula = medv ~ lstat + rm + ptratio + black + dis + nox + chas + zn + rad + tax + crim, data = Boston_train)

par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(model_sel)

Lasso Model: Predictors suggested by 1se of Lasso

lm(formula = medv ~ lstat + rm + ptratio + black + dis + chas, data = Boston_train)

par(mfrow = c(2, 2))  # Split the plotting panel into a 2 x 2 grid
plot(model_sel_lso)

Inference

Most plots look quite similar, We do observe some pattern in Residual plots. As we already know ideally there should just be a random scatter in Residual plot & Scalled residuals. We may need to investigate the model using heuristic approach further.

Normality is met to a certain extent although a marginal right skewness is observed.

4.3 Model Selection:

So, which model should we select from there 3, but before that what should be our criteria of selection…

Alert! when compairing models with different predictors one should use Adjr2, AIC or BIC. MSE generally decrease with increase number of predictors also R2 tends to increase with number of predictors, hence they are not good criterians for model comparision

kable(head(OpDF[,rel_cols]))  %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "170px")
modelVec sigma MSE r.squared adj.r.squared AIC BIC
Full model 4.727104 22.34551 0.7379818 0.7279635 2120.070 2178.109
Model_PvBeFwBkStLmin 4.713948 22.22131 0.7379055 0.7294755 2116.173 2166.474

Looking at Adj R2 and AIC values we decide to select the model: Model_PvBeFwBkStLmin

This model was suggested by stepwise, Best subset,Pvalue approach,Forward, Backward.

We understand that when comparing models with different set of predictors AIC, Adj R2, BIC are better criterias than SSE, MSE, R2. This is because these parameters generally tend to give better results with addition of more and more predictors.

5.0 Validation

out-of-sample performance

Lets try to test the out-of-sample performance. Using final linear model built from (i) on the 70% of original data, test with the remaining 30% testing data. (Try predict() function in R.) Report out-of-sample model MSE etc.

pred.test.BH<- predict(model_sel,Boston_test[,-14], type="response")
tstmse<-mean((Boston_test[,14]-pred.test.BH)^2)
tstmse
## [1] 23.83127

Observation

The test MSE(Mean square error of testing sample) is 23.8312659, to our surprise it’s even less that Training sample MSE 24.18.

In general model is expected to underperform on test data than on train data, but sometimes it is out performs also. This can be by fluke.

This value may fluctuate based on initial sapmling of data, hence a more reliable indicator of model perfomance is Cross validation error.

Let’s see what is the CV value of the model…

Cross validation

Let’s use 3-fold cross validation. (Try cv.glm() function in R on the ORIGINAL 100% data.) Does CV yield similar answer as above?

model.glm1 = glm(medv~lstat + rm + ptratio + black + dis + nox + 
    chas + zn + rad + tax + crim, data = Boston)
#loocv 
cv.glm(data = Boston, glmfit = model.glm1)$delta[2]
## [1] 23.51161
#10 FOLD CV
cv.glm(data = Boston, glmfit = model.glm1, K = 10)$delta[2]
## [1] 23.66909
#3 FOLD CV
cv.glm(data = Boston, glmfit = model.glm1, K = 3)$delta[2]
## [1] 23.34635

Observation Here, we observe that the Validation set approach gave a smaller MSE than cross validation. As we increased the number of folds we see that MSE increases marginally

6.0 Decision Tree

Lets try to fit a regression tree (CART) on the same data; repeat the above step of predicting the error to check how well do Trees perform on this data.

boston.largetree <- rpart(formula = medv ~ ., data = Boston_train, cp = 0.001)
#Try plot it yourself to see its structure.

prp(boston.largetree)

#The plotcp() function gives the relationship between 10-fold cross-validation error in the training set and size of tree.

plotcp(boston.largetree)

printcp(boston.largetree)
## 
## Regression tree:
## rpart(formula = medv ~ ., data = Boston_train, cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] age     black   crim    dis     lstat   nox     ptratio rm      tax    
## 
## Root node error: 28996/354 = 81.91
## 
## n= 354 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.4482914      0   1.00000 1.00611 0.099044
## 2  0.1814837      1   0.55171 0.63171 0.069864
## 3  0.0679966      2   0.37022 0.40852 0.052836
## 4  0.0422795      3   0.30223 0.35115 0.050599
## 5  0.0307986      4   0.25995 0.34984 0.050635
## 6  0.0278359      5   0.22915 0.35681 0.051885
## 7  0.0219337      6   0.20131 0.32852 0.048862
## 8  0.0078703      7   0.17938 0.27900 0.039767
## 9  0.0075106      8   0.17151 0.25793 0.033691
## 10 0.0074710      9   0.16400 0.25793 0.033691
## 11 0.0074043     10   0.15653 0.25673 0.033695
## 12 0.0045159     11   0.14912 0.24817 0.033790
## 13 0.0041459     12   0.14461 0.23765 0.032853
## 14 0.0033568     13   0.14046 0.23659 0.033191
## 15 0.0030870     14   0.13711 0.23486 0.033060
## 16 0.0024137     15   0.13402 0.23160 0.032927
## 17 0.0021253     17   0.12919 0.23143 0.032927
## 18 0.0021244     18   0.12707 0.23048 0.032959
## 19 0.0013879     19   0.12494 0.22922 0.032891
## 20 0.0012861     21   0.12217 0.23271 0.032879
## 21 0.0012598     22   0.12088 0.23105 0.032686
## 22 0.0010354     23   0.11962 0.23271 0.032686
## 23 0.0010000     24   0.11858 0.23218 0.032691

Observation xerror: does not reduce post nsplit=10, hence we choose cp= 0.0052405(corresponding to 10 ) to prune the large tree This is much clear looking at the plot where we clearly see that 10 fold cross validation error does not reduce after 10 splits.

Tree prune:

#create a tree
boston.trn.dt.p<-prune(boston.largetree,cp=0.0052405)
#plot a tree
prp(boston.trn.dt.p,digits = 5, extra = 1)

For 354 observations in training data

MSE /MSPE

In-sample & Out-of-sample prediction

#In-sample prediction
boston.train.pred.tree = predict(boston.trn.dt.p)
#Out-of-sample prediction
boston.test.pred.tree = predict(boston.trn.dt.p,Boston_test)

(MSE.tree<- sum((Boston_train$medv-boston.train.pred.tree)^2)/nrow(Boston_train))
## [1] 12.21472
(MSPE.tree <- sum((Boston_test$medv-boston.test.pred.tree)^2)/nrow(Boston_test))
## [1] 17.89879
#In-sample prediction
boston.train.pred.tree.full = predict(boston.largetree)
#Out-of-sample prediction
boston.test.pred.tree.full = predict(boston.largetree,Boston_test)

(MSE.tree<- sum((Boston_train$medv-boston.train.pred.tree)^2)/nrow(Boston_train))
## [1] 12.21472
(MSPE.tree <- sum((Boston_test$medv-boston.test.pred.tree)^2)/nrow(Boston_test))
## [1] 17.89879

Observation

The mse(in-sample) is 12.2147196 which is less that MSE(in-sample) from Linear regression model. The mse(out of sample) aka MSPE is 17.8987945 which is although more that MSE(in-sample) but less that MSPE from Linear regression model.

CV Decision Trees

set.seed(450)
cv.error <- NULL
k <- 10

pbar <- create_progress_bar('text')
pbar$init(k)
## 
  |                                                                       
  |                                                                 |   0%
for(i in 1:k){
    index <- sample(1:nrow(Boston),round(0.9*nrow(Boston)))
    train.cv <- Boston[index,]
    test.cv <- Boston[-index,]
    
    boston.ct.cv<- rpart(formula = medv ~ ., data = train.cv, cp = 0.0052405)
    
    pt<-predict(boston.ct.cv, test.cv, n.trees = 1000)
    cv.error[i]<-mean((test.cv$medv-pt)^2)
    
    pbar$step()
}
## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
cv_ct<-mean(cv.error)

The cross validation error for CT : 20.8169881

Plot of predicted vs actual values

par(mfrow=c(1,2))
plot(boston.train.pred.tree,Boston_train$medv,col=6, lwd=0, xlab = "predicted values", ylab = "Actual Values", main="Training data or OOBE")
abline(c(0,1),col=9)

plot(boston.test.pred.tree,Boston_test$medv,col=6, lwd=0, xlab = "predicted values", ylab = "Actual Values",main="Testing data")
abline(c(0,1),col=9)

Observation

The mse(in-sample) is 12.2147196 which is less that MSE(in-sample) from Linear regression model. The mse(out of sample) aka MSPE is 17.8987945 which is although more that MSE(in-sample) but less that MSPE from Linear regression model.

Inference

7.0 Bagging

#
boston.bag<- randomForest(medv~., data = Boston_train,mtry=ncol(Boston_train)-1,ntree=500, importance=TRUE)
boston.bag
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston_train, mtry = ncol(Boston_train) -      1, ntree = 500, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 11.11915
##                     % Var explained: 86.43

Out of bag error

hist(boston.bag$mse,col=3, lwd=2)

plot(boston.bag$mse,type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error") 

MSE on Training Sample

boston.bag.pred.TR<- predict(boston.bag)
train.mse.bag<-mean((Boston_train$medv-boston.bag.pred.TR)^2)
train.mse.bag
## [1] 11.11915

Hence, MSR is actually OOB Error on training data.

MSPE on testing Sample

boston.bag.pred<- predict(boston.bag, Boston_test)
test.mse.bag<-mean((Boston_test$medv-boston.bag.pred)^2)
test.mse.bag
## [1] 10.1621

Here we observe that train mse is reduced to 11.1191491 & test mse has reduced to 10.1620955 which is way less that previous models

CV Bagging

set.seed(450)
cv.error <- NULL
k <- 10

pbar <- create_progress_bar('text')
pbar$init(k)
## 
  |                                                                       
  |                                                                 |   0%
for(i in 1:k){
    index <- sample(1:nrow(Boston),round(0.9*nrow(Boston)))
    train.cv <- Boston[index,]
    test.cv <- Boston[-index,]
    
    boston.bg.cv<- randomForest(medv~., data = train.cv,mtry=ncol(train.cv)-1,ntree=500, importance=TRUE)
    
    pt<-predict(boston.bg.cv, test.cv, n.trees = 1000)
     cv.error[i]<-mean((test.cv$medv-pt)^2)
    
    pbar$step()
}
## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
cv_bg<-mean(cv.error)

The cross validation error for RF is 10.1267984

Plot of predicted vs actual values

par(mfrow=c(1,2))
plot(boston.bag.pred.TR,Boston_train$medv,col=4, lwd=0, xlab = "predicted values", ylab = "Actual Values", main="Training data or OOBE")
abline(c(0,1),col=2)

plot(boston.bag.pred,Boston_test$medv,col=4, lwd=0, xlab = "predicted values", ylab = "Actual Values",main="Testing data")
abline(c(0,1),col=2)

8.0 Random Forest

Used m=sqrt(p)

#library(randomForest)
boston.rf<- randomForest(medv~., data = Boston_train,mtry=round(sqrt(ncol(Boston_train)),0),ntree=500, importance=TRUE)
boston.rf
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston_train, mtry = round(sqrt(ncol(Boston_train)),      0), ntree = 500, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 11.95426
##                     % Var explained: 85.41

OOB ERROR The MSR is MSE of out-of-bag prediction (recall the OOB in bagging). The fitted randomForest actually saves all OOB errors for each ntree value from 1 to 500. We can make a plot to see how the OOB error changes with different ntree.

hist(boston.rf$mse,col=3, lwd=2)

plot(boston.rf$mse,type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error") 

#why this error keeps falling?

MSE on Training Sample

boston.rf.pred.TR<- predict(boston.rf)
train.mse.rf<-mean((Boston_train$medv-boston.rf.pred.TR)^2)
train.mse.rf
## [1] 11.95426

Hence, MSR is actually OOB Error on training data.

MSPE on testing Sample

boston.rf.pred<- predict(boston.rf, Boston_test)
test.mse.rf<-mean((Boston_test$medv-boston.rf.pred)^2)
test.mse.rf
## [1] 10.50503

Here we observe that train mse is reduced to 11.9542639 & test mse has reduced to 10.5050258 which is way less that previous models

CV Random Forest

set.seed(450)
cv.error <- NULL
k <- 10

pbar <- create_progress_bar('text')
pbar$init(k)
## 
  |                                                                       
  |                                                                 |   0%
for(i in 1:k){
    index <- sample(1:nrow(Boston),round(0.9*nrow(Boston)))
    train.cv <- Boston[index,]
    test.cv <- Boston[-index,]
    
    boston.rf.cv<- randomForest(medv~., data = train.cv,mtry=round(sqrt(ncol(train.cv)),0),ntree=500, importance=TRUE)
    
    pt<-predict(boston.rf.cv, test.cv, n.trees = 1000)
     cv.error[i]<-mean((test.cv$medv-pt)^2)
    
    pbar$step()
}
## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
cv_rf<-mean(cv.error)

The cross validation error for RF is 6.8847982

Plot of predicted vs actual values

par(mfrow=c(1,2))
plot(boston.rf.pred.TR,Boston_train$medv,col=3, lwd=0, xlab = "predicted values", ylab = "Actual Values", main="Training data or OOBE")
abline(c(0,1),col=2)

plot(boston.rf.pred,Boston_test$medv,col=3, lwd=0, xlab = "predicted values", ylab = "Actual Values",main="Testing data")
abline(c(0,1),col=2)

Variable Importance

varImpPlot(boston.rf,col=2)

The variable importance plot tells us that lstat & rm are far more important variables than the rest. This plot gives us an comparative significance of variables.

9.0 Boosting

#
boston.boost<- gbm(medv~., data = Boston_train, distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, interaction.depth = 8)
summary(boston.boost)

##             var     rel.inf
## lstat     lstat 42.36190988
## rm           rm 31.19495549
## dis         dis  8.04331205
## nox         nox  3.46895424
## crim       crim  3.35487848
## age         age  2.84156588
## tax         tax  2.25568607
## ptratio ptratio  2.18396603
## black     black  2.13093999
## indus     indus  0.98962284
## chas       chas  0.67877728
## rad         rad  0.43202787
## zn           zn  0.06340388

Prediction on Training sample

boston.boost.pred.train<- predict(boston.boost, n.trees = 1000)
mse.train.Bst<-mean((Boston_train$medv-boston.boost.pred.train)^2)
mse.train.Bst
## [1] 3.022727

Prediction on testing sample

boston.boost.pred.test<- predict(boston.boost, Boston_test, n.trees = 1000)
mse.test.Bst<-mean((Boston_test$medv-boston.boost.pred.test)^2)
mse.test.Bst
## [1] 8.374799

CV Boosting

set.seed(450)
cv.error <- NULL
k <- 10

pbar <- create_progress_bar('text')
pbar$init(k)
## 
  |                                                                       
  |                                                                 |   0%
for(i in 1:k){
    index <- sample(1:nrow(Boston),round(0.9*nrow(Boston)))
    train.cv <- Boston[index,]
    test.cv <- Boston[-index,]
    
    boston.boost.cv<- gbm(medv~., data = train.cv, distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, interaction.depth = 8)
    
    pt<-predict(boston.boost.cv, test.cv, n.trees = 1000)
     cv.error[i]<-mean((test.cv$medv-pt)^2)
    
    pbar$step()
}
## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
cv_boost<-mean(cv.error)

The cv error estimate for Boosting is 9.0850293

Plot of predicted vs actual values for Boosting

par(mfrow=c(1,2))
plot(boston.boost.pred.train,Boston_train$medv,col=3, lwd=0, xlab = "predicted values", ylab = "Actual Values", main="Training data")
abline(c(0,1),col=2)

plot(boston.boost.pred.test,Boston_test$medv,col=3, lwd=0, xlab = "predicted values", ylab = "Actual Values",main="Testing data")
abline(c(0,1),col=2)

10.0 GAM: Generalized additive Model

Inorder to fit a GAM model, it is imperative to understand the distribution of predictors. This helps speculate the nonlinearity of predictors.

In boston dataset, Column: chas & rad are labeled as continious but posses discreet values. Hence, a non-linear smoothing function is inappropriate.

Boston.gam <-gam(medv ~ s(crim)+s(zn)+s(indus)+chas+s(nox)+s(rm)+s(age)+s(dis)+rad+s(tax)+s(ptratio)+s(black)+s(lstat),data=Boston_train)
summary(Boston.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## medv ~ s(crim) + s(zn) + s(indus) + chas + s(nox) + s(rm) + s(age) + 
##     s(dis) + rad + s(tax) + s(ptratio) + s(black) + s(lstat)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.5580     1.2854  14.438  < 2e-16 ***
## chas          1.4301     0.7295   1.961  0.05085 .  
## rad           0.3788     0.1292   2.932  0.00363 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df      F  p-value    
## s(crim)    2.325  2.908 17.746 1.07e-09 ***
## s(zn)      1.000  1.000  0.394  0.53060    
## s(indus)   7.378  8.267  3.038  0.00246 ** 
## s(nox)     8.979  8.999 10.528 1.38e-14 ***
## s(rm)      8.120  8.782 16.416  < 2e-16 ***
## s(age)     1.000  1.000  0.044  0.83378    
## s(dis)     8.645  8.960  7.368 1.03e-09 ***
## s(tax)     3.478  4.201 10.975 2.06e-08 ***
## s(ptratio) 1.000  1.000 19.049 1.74e-05 ***
## s(black)   1.855  2.288  1.811  0.14941    
## s(lstat)   5.500  6.676 19.364  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.882   Deviance explained = 89.9%
## GCV = 11.357  Scale est. = 9.6801    n = 354

Inference Here we can further notice that columns zn,age,pratio,black have Linear distribution. This is evident from effective degrees of freedom=1. This is confirmed that the marginal plots of the same predictors

plot(Boston.gam,pages=1)

Boston.gam_adv <-gam(medv ~ s(crim)+zn+s(indus)+chas+s(nox)+s(rm)+age+s(dis)+rad+s(tax)+
ptratio+black+s(lstat), data=Boston_train)
bostonGamsummary<-summary(Boston.gam_adv)
bostonGamsummary
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## medv ~ s(crim) + zn + s(indus) + chas + s(nox) + s(rm) + age + 
##     s(dis) + rad + s(tax) + ptratio + black + s(lstat)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.697146   3.494724   8.498 8.98e-16 ***
## zn           0.007444   0.015322   0.486  0.62743    
## chas         1.444269   0.732815   1.971  0.04965 *  
## age         -0.004205   0.014576  -0.288  0.77317    
## rad          0.375052   0.130985   2.863  0.00449 ** 
## ptratio     -0.639718   0.148242  -4.315 2.16e-05 ***
## black        0.002761   0.002302   1.200  0.23126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F  p-value    
## s(crim)  2.387  2.985 17.28 3.90e-10 ***
## s(indus) 7.585  8.418  3.05  0.00222 ** 
## s(nox)   8.977  8.999 10.56 1.26e-14 ***
## s(rm)    8.075  8.762 15.99  < 2e-16 ***
## s(dis)   8.657  8.963  7.45 7.68e-10 ***
## s(tax)   3.460  4.183 10.71 3.21e-08 ***
## s(lstat) 5.811  6.994 18.30  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.881   Deviance explained = 89.8%
## GCV = 11.435  Scale est. = 9.7567    n = 354

The cross Validation error is 11.46

Model Criterians

AIC(Boston.gam_adv)
## [1] 1860.724
BIC(Boston.gam_adv)
## [1] 2065.616
Boston.gam_adv$deviance
## [1] 2946.968

Prediction Accuracy Insample

#in-sample mse using df 
(Boston.gam.mse.train <- Boston.gam_adv$dev/Boston.gam_adv$df.residual) 
## [1] 9.756659
#or
Boston.gam_adv$dev/(nrow(Boston_train)-sum(influence(Boston.gam_adv))) #mdf
## [1] 9.756659
#Average Sum of Squared Error
(Boston.gam.mse.train <- Boston.gam_adv$dev/nrow(Boston_train))
## [1] 8.324768
#using the predict() function
pi <- predict(Boston.gam_adv,Boston_train)
mean((pi - Boston_train$medv)^2)
## [1] 8.324768

Out of sample

Boston.gam.predict.test <-predict(Boston.gam_adv,Boston_test) #Boston.gam built on training data
Boston_gam_mse_test <-mean((Boston.gam.predict.test-Boston_test[, "medv"])^2) ## out of sample
Boston_gam_mse_test
## [1] 12.0779
par(mfrow=c(1,2))
plot(Boston_train$medv,pi,col='blue',main='Real vs predicted GAM',pch=16,cex=0.6)
abline(lm(pi~Boston_train$medv))
legend('bottomright',legend='Train',pch=18,col='blue', bty='n')

plot(Boston_test$medv,Boston.gam.predict.test,col='blue',main='Real vs predicted GAM',pch=16,cex=0.6)
abline(lm(Boston.gam.predict.test~Boston_test$medv))
legend('bottomright',legend='Test',pch=18,col='blue', bty='n')

Inference It’s observed that GCV error is 11.435

11.0 Neural Network

set.seed(1338364)
maxs <- apply(Boston, 2, max) 
mins <- apply(Boston, 2, min)

scaled <- as.data.frame(scale(Boston, center = mins, scale = maxs - mins))
index <- sample(1:nrow(Boston),round(0.75*nrow(Boston)))

train_ <- scaled[index,]
test_ <- scaled[-index,]
n <- names(train_)
f <- as.formula(paste("medv ~", paste(n[!n %in% "medv"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5,3),linear.output=T)
plot(nn)

In sample Predictions

pr.nn <- compute(nn,train_[,1:13])
pr.nn_tr <- pr.nn$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)#prediction
train.r <- (train_$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)#actual value to compare
# MSE of testing set
MSE.nn_train <- sum((train.r - pr.nn_tr)^2)/nrow(train_)
MSE.nn_train
## [1] 3.841102

Out of sample Predictions

pr.nn <- compute(nn,test_[,1:13])

pr.nn_ <- pr.nn$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
test.r <- (test_$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
# MSE of testing set
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
MSE.nn
## [1] 10.26944
par(mfrow=c(1,2))

plot(train_$medv,pr.nn_tr,col='red',main='Real vs predicted NN',pch=16,cex=0.6)
abline(lm(pr.nn_tr~train_$medv))
legend('bottomright',legend='Train',pch=18,col='red', bty='n')

plot(test_$medv,pr.nn_,col='red',main='Real vs predicted NN',pch=16,cex=0.6)
abline(lm(pr.nn_~test_$medv))
legend('bottomright',legend='Test',pch=18,col='red', bty='n')

Inference We can see that the values of train MSE has decreased substantially, But on the other hand the out of sample error did not reduce much.

Lets try a different initialization

set.seed(500)
maxs <- apply(Boston, 2, max) 
mins <- apply(Boston, 2, min)

scaled <- as.data.frame(scale(Boston, center = mins, scale = maxs - mins))
index <- sample(1:nrow(Boston),round(0.75*nrow(Boston)))

train_ <- scaled[index,]
test_ <- scaled[-index,]
n <- names(train_)
f <- as.formula(paste("medv ~", paste(n[!n %in% "medv"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5,3),linear.output=T)
plot(nn)

In sample Predictions

pr.nn <- compute(nn,train_[,1:13])
pr.nn_tr <- pr.nn$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)#prediction
train.r <- (train_$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)#actual value to compare
# MSE of testing set
MSE.nn_train <- sum((train.r - pr.nn_tr)^2)/nrow(train_)
MSE.nn_train
## [1] 4.284057

Out of sample Predictions

pr.nn <- compute(nn,test_[,1:13])

pr.nn_ <- pr.nn$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
test.r <- (test_$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
# MSE of testing set
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
MSE.nn
## [1] 16.45955
par(mfrow=c(1,2))

plot(train_$medv,pr.nn_tr,col='purple',main='Real vs predicted NN',pch=16,cex=0.6)
abline(lm(pr.nn_tr~train_$medv))
legend('bottomright',legend='Train',pch=18,col='purple', bty='n')

plot(test_$medv,pr.nn_,col='purple',main='Real vs predicted NN',pch=16,cex=0.6)
abline(lm(pr.nn_~test_$medv))
legend('bottomright',legend='Test',pch=18,col='purple', bty='n')

This proves the importance of a good starting point in neural nets…

CV NN

set.seed(450)
cv.error <- NULL
k <- 10

pbar <- create_progress_bar('text')
pbar$init(k)
## 
  |                                                                       
  |                                                                 |   0%
for(i in 1:k){
    index <- sample(1:nrow(Boston),round(0.9*nrow(Boston)))
    train.cv <- scaled[index,]
    test.cv <- scaled[-index,]
    
    nn <- neuralnet(f,data=train.cv,hidden=c(5,2),linear.output=T)
    
    pr.nn <- compute(nn,test.cv[,1:13])
    pr.nn <- pr.nn$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
    
    test.cv.r <- (test.cv$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
    
    cv.error[i] <- sum((test.cv.r - pr.nn)^2)/nrow(test.cv)
    
    pbar$step()
}
## 
  |                                                                       
  |======                                                           |  10%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |====================                                             |  30%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==============================================                   |  70%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |==========================================================       |  90%
  |                                                                       
  |=================================================================| 100%
cvnn<-mean(cv.error)

The cross Validation error for this is 7.6412918

12.0 Insights

Key takeaways:

*Cross validation is more stable and reliable than out of sample validation test set values

*LOOCV : Resampling does not change the LOOCV error value

*MSPE can also sometime be less than in-sample value of error (MSE)

*CART: Re sampling causes the training set to differ, because of which the splitting is changed hence the MSE & MSPE values also differ.

*We observe that at times Linear Regression and at times Decision trees perform better than each other. Mostly it looks like it depends on sampled data. If we closely look in sample MSE is less for DT than for LR & Out of sample MSE does not show any pattern(it is sometimes higher sometimes lower than LR) .Hence, Neither Linear Regression or Decision Trees are better than the other. We already know that LR works really nicely when the data has a linear shape. But, when the data has a non-linear shape, then a linear model cannot capture the non-linear features.So in this case, you can use the decision trees, which do a better job at capturing the non-linearity in the data by dividing the space into smaller sub-spaces depending on the questions asked.

*Advanced Trees: We observe that advanced trees are much better are predicting the prices as compared to any conventional methods.In this project we observed Boosting performed the best among the Bagging and Random forest,which is obvious expectation because Boosting imples a mechanism to pay close attention to errors of last iteration. No wonder these methods(RF & Boosting) are state of the art and extensively used in the market today. But when these conventional methods are compared to advanced tree methods , Advanced trees are bound to out perform the conventional methods.

*Gams perform better when we have nonlinearity in continious predictors.

*NN Performance is quite dependent on initial values for them to actually converge at global minimas.

Conclusion Looking at the prediction performance of various Algotithms for this data set, it can be concluded that Gradient Boosting performs the best. Overall,Random Forest also gave better performance although not as good as GB!.