If you have ever learned Data science or Machine learning this dataset needs no introduction. Most statisticians use it extensively either to learn ML algos or explain them. What makes this dataset unique is its complexity to predict the response variable. Most algorithms cannot accurately predict house prices.
I here will try to analyze how well Linear regression can do it versus Decision Trees…
The Boston Housing Dataset consists of price of houses in various places in Boston. Alongside with price, the dataset also provide information such as Crime (CRIM), areas of non-retail business in the town (INDUS), the age of people who own the house (AGE), and there are many other attributes that available. However, because we are going to use R, we can import it right away from the MASS Package itself. In this story, we will use several R libraries as required here.
Number of Instances: 506
Number of Attributes: 13 continuous attributes (including “class” attribute “MEDV”), 1 binary-valued attribute.
Attribute Information:
Missing Attribute Values: None.
As our goal is to develop a model that has the capacity of predicting the value of houses, we will split the dataset into features and the target variable.
Required Packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
library(MASS) #this data is in MASS package
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(DataExplorer)
library(ggpubr)
## Loading required package: ggplot2
## Loading required package: magrittr
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:plyr':
##
## mutate
library(leaps)
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-2
library("kableExtra")
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(broom)
library(boot)
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library("gbm")
## Loaded gbm 2.1.5
library("mgcv")
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-28. For overview type 'help("mgcv-package")'.
library(neuralnet)
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
library(mgcv)
boston.data <- data(Boston)
dim(Boston)
## [1] 506 14
nam<-names(Boston)
Dimension of data 506 rows and 14 columns The names of attributes are : crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, black, lstat, medv
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Observation: Most columns are quantitative, House prices are in medv varaible.
Snapshot of data
kable(head(Boston,10)) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "250px")
| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | medv |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 | 24.0 |
| 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 | 21.6 |
| 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
| 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
| 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 | 36.2 |
| 0.02985 | 0.0 | 2.18 | 0 | 0.458 | 6.430 | 58.7 | 6.0622 | 3 | 222 | 18.7 | 394.12 | 5.21 | 28.7 |
| 0.08829 | 12.5 | 7.87 | 0 | 0.524 | 6.012 | 66.6 | 5.5605 | 5 | 311 | 15.2 | 395.60 | 12.43 | 22.9 |
| 0.14455 | 12.5 | 7.87 | 0 | 0.524 | 6.172 | 96.1 | 5.9505 | 5 | 311 | 15.2 | 396.90 | 19.15 | 27.1 |
| 0.21124 | 12.5 | 7.87 | 0 | 0.524 | 5.631 | 100.0 | 6.0821 | 5 | 311 | 15.2 | 386.63 | 29.93 | 16.5 |
| 0.17004 | 12.5 | 7.87 | 0 | 0.524 | 6.004 | 85.9 | 6.5921 | 5 | 311 | 15.2 | 386.71 | 17.10 | 18.9 |
Data Quality Check
Data quality is another very important step in EDA, its imperative to have good data quality for a optimum analysis.
metadata<-t(introduce(Boston))
colnames(metadata)<-"Values"
metadata
## Values
## rows 506
## columns 14
## discrete_columns 0
## continuous_columns 14
## all_missing_columns 0
## total_missing_values 0
## complete_rows 506
## total_observations 7084
## memory_usage 56216
plot_intro(Boston)
The good news is that this dataset has no missing/ Abnormal values in dataset. The structure looks coherent
Analysis based on Visualization
Density plots are used to observe the distribution of a variable in a dataset. … An advantage of Density Plots over Histograms is that they’re better at determining the distribution shape because they’re not affected by the number of bins.
# Basic density plot with mean line and marginal rug
crim<-ggdensity(Boston, x = "crim",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
zn<-ggdensity(Boston, x = "zn",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
indus<-ggdensity(Boston, x = "indus",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
chas<-ggdensity(Boston, x = "chas",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
nox<-ggdensity(Boston, x = "nox",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
rm<-ggdensity(Boston, x = "rm",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
age<-ggdensity(Boston, x = "age",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
dis<-ggdensity(Boston, x = "dis",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
rad<-ggdensity(Boston, x = "rad",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
tax<-ggdensity(Boston, x = "tax",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
ptratio<-ggdensity(Boston, x = "ptratio",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
black<-ggdensity(Boston, x = "black",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
lstat<-ggdensity(Boston, x = "lstat",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
medv<-ggdensity(Boston, x = "medv",
fill = "#0073C2FF", color = "#0073C2FF",
add = "mean", rug = TRUE)
figure1 <- ggarrange(crim, zn, indus,chas,
labels = c("1", "2", "3","4"),
ncol = 2, nrow = 2)
figure2 <- ggarrange(nox,rm,age,dis,
labels = c("5", "6", "7","8"),
ncol = 2, nrow = 2)
figure3 <- ggarrange(rad,tax,ptratio,black,
labels = c("9", "10", "11","12"),
ncol = 2, nrow = 2)
figure4 <- ggarrange(lstat,medv,
labels = c("11","12"),
ncol = 2, nrow = 1)
figure1;figure2;figure3;figure4
Observation
Correlation analysis Correlation matrix can help us understand the association between the variables in one snapshot..
plot_correlation(na.omit(Boston), maxcat = 2L)
Observation:
Response variable is highly correlated with lstat & moderately correlated with ptratio, indus, nox, tax. Strong correlation is also present between various predictors.
Here,fitting a model using Linear regression First we need to split the data into training and testing sample
Data split
7:3 split split on data is made
set.seed(1338364)
sample_index <- sample(nrow(Boston),nrow(Boston)*0.70)
Boston_train <- Boston[sample_index,]
Boston_test <- Boston[-sample_index,]
#full model
model_full <- lm(medv~., data=Boston_train)#SMALL R SQUARE
modelFullSummary<-summary(model_full)
modelFullSummary
##
## Call:
## lm(formula = medv ~ ., data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2939 -2.6967 -0.6975 1.5993 25.4414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.903283 5.945785 6.375 5.97e-10 ***
## crim -0.119639 0.037998 -3.149 0.001786 **
## zn 0.048724 0.015804 3.083 0.002216 **
## indus -0.022443 0.074667 -0.301 0.763919
## chas 2.331514 1.002660 2.325 0.020642 *
## nox -17.836726 4.758368 -3.748 0.000209 ***
## rm 3.588560 0.496942 7.221 3.40e-12 ***
## age -0.001905 0.017139 -0.111 0.911561
## dis -1.523971 0.240859 -6.327 7.87e-10 ***
## rad 0.353825 0.078491 4.508 9.02e-06 ***
## tax -0.015400 0.004471 -3.445 0.000643 ***
## ptratio -0.819006 0.161242 -5.079 6.26e-07 ***
## black 0.005917 0.003115 1.899 0.058361 .
## lstat -0.500630 0.062292 -8.037 1.54e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.727 on 340 degrees of freedom
## Multiple R-squared: 0.738, Adjusted R-squared: 0.728
## F-statistic: 73.66 on 13 and 340 DF, p-value: < 2.2e-16
Observation
Except age and indus all variables are significant as per pvalue
subset_reg<-regsubsets(medv~.,
Boston_train,
nbest=1,#ranks in output
nvmax=14) #max variables to be considered
summary(subset_reg)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., Boston_train, nbest = 1, nvmax = 14)
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 6 ( 1 ) " " " " " " "*" "*" "*" " " "*" " " " " "*" " " "*"
## 7 ( 1 ) " " " " " " " " "*" "*" " " "*" "*" "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " " " "*" "*" " " "*" "*" "*" "*" " " "*"
## 9 ( 1 ) "*" "*" " " " " "*" "*" " " "*" "*" "*" "*" " " "*"
## 10 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" " " "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
plot(subset_reg,scale="bic")
Observation
Here we observe that indus and age are insignificant variables. This was also predicted on the basis of pvalue. Hence, we can drop indus and age variables
Note ,Because this is prediction problem we need to use AIC criterian to choose a model. Bic being more parsimonious in approach generally chooses smallest possible model.
Forward summary
summary(model_step_f)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## zn + crim + rad + tax + black, data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2966 -2.6949 -0.7083 1.6644 25.3922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.939342 5.920279 6.408 4.88e-10 ***
## lstat -0.504292 0.057740 -8.734 < 2e-16 ***
## rm 3.597435 0.481771 7.467 6.88e-13 ***
## ptratio -0.822901 0.160319 -5.133 4.80e-07 ***
## dis -1.498428 0.219136 -6.838 3.71e-11 ***
## nox -18.321202 4.402658 -4.161 4.01e-05 ***
## chas 2.306098 0.996616 2.314 0.02126 *
## zn 0.049475 0.015500 3.192 0.00154 **
## crim -0.119024 0.037841 -3.145 0.00180 **
## rad 0.360853 0.075028 4.810 2.27e-06 ***
## tax -0.016028 0.003976 -4.031 6.84e-05 ***
## black 0.005902 0.003098 1.905 0.05763 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.714 on 342 degrees of freedom
## Multiple R-squared: 0.7379, Adjusted R-squared: 0.7295
## F-statistic: 87.53 on 11 and 342 DF, p-value: < 2.2e-16
Backward summary
summary(model_step_b)
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2966 -2.6949 -0.7083 1.6644 25.3922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.939342 5.920279 6.408 4.88e-10 ***
## crim -0.119024 0.037841 -3.145 0.00180 **
## zn 0.049475 0.015500 3.192 0.00154 **
## chas 2.306098 0.996616 2.314 0.02126 *
## nox -18.321202 4.402658 -4.161 4.01e-05 ***
## rm 3.597435 0.481771 7.467 6.88e-13 ***
## dis -1.498428 0.219136 -6.838 3.71e-11 ***
## rad 0.360853 0.075028 4.810 2.27e-06 ***
## tax -0.016028 0.003976 -4.031 6.84e-05 ***
## ptratio -0.822901 0.160319 -5.133 4.80e-07 ***
## black 0.005902 0.003098 1.905 0.05763 .
## lstat -0.504292 0.057740 -8.734 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.714 on 342 degrees of freedom
## Multiple R-squared: 0.7379, Adjusted R-squared: 0.7295
## F-statistic: 87.53 on 11 and 342 DF, p-value: < 2.2e-16
Stepwise Summary
summary(model_step_s)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## zn + crim + rad + tax + black, data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2966 -2.6949 -0.7083 1.6644 25.3922
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.939342 5.920279 6.408 4.88e-10 ***
## lstat -0.504292 0.057740 -8.734 < 2e-16 ***
## rm 3.597435 0.481771 7.467 6.88e-13 ***
## ptratio -0.822901 0.160319 -5.133 4.80e-07 ***
## dis -1.498428 0.219136 -6.838 3.71e-11 ***
## nox -18.321202 4.402658 -4.161 4.01e-05 ***
## chas 2.306098 0.996616 2.314 0.02126 *
## zn 0.049475 0.015500 3.192 0.00154 **
## crim -0.119024 0.037841 -3.145 0.00180 **
## rad 0.360853 0.075028 4.810 2.27e-06 ***
## tax -0.016028 0.003976 -4.031 6.84e-05 ***
## black 0.005902 0.003098 1.905 0.05763 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.714 on 342 degrees of freedom
## Multiple R-squared: 0.7379, Adjusted R-squared: 0.7295
## F-statistic: 87.53 on 11 and 342 DF, p-value: < 2.2e-16
Observation
Forward, backward, stepwise and best subset, even p value approach select the same model. Hence we drop the (indus and age )insignificant variables. Apart from these approaches lets also try to observe the outcome of a more advanced approach:Lasso
In short, ridge regression and lasso are regression techniques optimized for prediction, rather than inference.
Normal regression gives you unbiased regression coefficients (maximum likelihood estimates “as observed in the data-set”).
Ridge and lasso regression allow you to regularize (“shrink”) coefficients. This means that the estimated coefficients are pushed towards 0, to make them work better on new data-sets (“optimized for prediction”). This allows you to use complex models and avoid over-fitting at the same time.
Lets check it our the Lasso on this dataset..
lasso_fit = glmnet(x = as.matrix(Boston_train[, -c(which(colnames(Boston)=='medv'))]),
y = Boston_train$medv,
alpha = 1)#Is lasso penlity alpha =0 become the ridge penality
#lasso_fit
#lambda = 0.5
coef(lasso_fit,s=0.5)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 15.107777246
## crim -0.014014855
## zn .
## indus .
## chas 1.207204963
## nox .
## rm 4.024667476
## age .
## dis -0.020290778
## rad .
## tax -0.001408246
## ptratio -0.628831857
## black 0.002073485
## lstat -0.510448603
#lambda = 1
coef(lasso_fit,s=1)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 15.84925307
## crim .
## zn .
## indus .
## chas .
## nox .
## rm 3.65905943
## age .
## dis .
## rad .
## tax -0.00057969
## ptratio -0.53085900
## black .
## lstat -0.50311093
Randomly choosing the shrinkage variable we observe that with increase in the value of “s” aka “Lambda” of Lasso valiable selection the coefficient of predictors shrinks.
Now, lets try to use cross validation to select the optimum value of this parameter: Here, we are using a 5 fold cross validation
#use 5-fold cross validation to pick lambda
cv_lasso_fit = cv.glmnet(x = as.matrix(Boston_train[, -c(which(colnames(Boston)=='medv'))]),
y = Boston_train$medv,
alpha = 1,
nfolds = 5)# no of folds in CV
plot(cv_lasso_fit)
lambdamin<-cv_lasso_fit$lambda.min
lambda1se<-cv_lasso_fit$lambda.1se
Observation
Here, it is observed that minimum MSE occours at 0.0091247 and MSE 1 standard error occours at 0.3770339.
Lets have a look at the coefficients for these 2 values of Lambda
coef(lasso_fit,lambdamin)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 37.268685172
## crim -0.116102863
## zn 0.047242676
## indus -0.024349259
## chas 2.323414949
## nox -17.453294673
## rm 3.603834918
## age -0.001400228
## dis -1.490139998
## rad 0.336587993
## tax -0.014644048
## ptratio -0.813365987
## black 0.005809118
## lstat -0.502146356
coef(lasso_fit,lambda1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 18.7438896869
## crim -0.0243845454
## zn .
## indus -0.0068557915
## chas 1.5655812406
## nox -3.9000947690
## rm 4.0622259043
## age .
## dis -0.2844690457
## rad .
## tax -0.0009749202
## ptratio -0.6831615800
## black 0.0027886007
## lstat -0.5137573171
Inference
lambda.min Shrinks indus and age variables, this means this would suggest the same model as by above mechanisms. We will use lambda.1se as it shrinks the coefficients much more that lambda.min
data_train.lasso.X = as.matrix(Boston_train[, -c(which(colnames(Boston_train)=='medv'))])
pred.lasso.train<- predict(lasso_fit, newx=data_train.lasso.X, s=lambda1se, type = "response")
mean(pred.lasso.train-Boston_train$medv)^2
## [1] 1.812385e-27
data_test.lasso.X = as.matrix(Boston_test[, -c(which(colnames(Boston_test)=='medv'))])
pred.lasso.test<- predict(lasso_fit, newx=data_test.lasso.X, s=lambda1se, type = "response")
mean(pred.lasso.test-Boston_test$medv)^2
## [1] 0.05510227
Here, we observe that MSE and MSPE Values have been reduced to very small values. Key takeaways:
Lasso suggests same predictors as that of Best Sussets,Forward,Backward,stepwise. It shrinks the values of coefficients for a optimum value of lambda we choose from the plotcp graph. *The model has very low in-sample & out-sample error values.
Final model:
As with most approaches we got the same result; to drop indus and age predictors. On the other hand Lasso had a different take for lambda1se(@ 1 standard error) by additionally shrinking the coefficients and making indus, age, zn,rad as = 0. In this scenario we are not much bothered by for multicollinearity and we also have good number of data rows as compared to predictors. Hence we need not go for the Lasso variable selection using Regularization.
So what model do we select? Here, we keep our approach simple and choose the model based on the other variable selection methodology we employed above(pvalue , best, forward, bacward, stepwise, both).
We select the model that has the lowest value of AIC(Predicting power) as we are more interested in the virtue of correct prediction. The full model has slightly higher value of AIC than the model suggested by selection mechanisms employed (Best Sussets,Forward,Backward,stepwise)
#pvalue , best, forward, bacward, stepwise both
#drop indus & age
model_sel <- lm(medv ~ lstat + rm + ptratio + black + dis + nox + chas + zn + rad + tax + crim, data=Boston_train)
model_selSummary<-summary(model_sel)
#lasso 1se model
model_sel_lso <- lm(medv ~ lstat + rm + ptratio + black + dis + chas, data=Boston_train)
model_sel_lso_Summary<-summary(model_sel_lso)
modelVec<-c("Full model","Model_PvBeFwBkStLmin")
OpDF<-cbind(modelVec,
rbind(glance(model_full),
glance(model_sel)
#,glance(model_sel_lso)
)
)
OpDF$MSE<-OpDF$sigma^2
rel_cols<-c("modelVec","sigma","MSE","r.squared","adj.r.squared","AIC","BIC")
kable(head(OpDF[,rel_cols])) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "170px")
| modelVec | sigma | MSE | r.squared | adj.r.squared | AIC | BIC |
|---|---|---|---|---|---|---|
| Full model | 4.727104 | 22.34551 | 0.7379818 | 0.7279635 | 2120.070 | 2178.109 |
| Model_PvBeFwBkStLmin | 4.713948 | 22.22131 | 0.7379055 | 0.7294755 | 2116.173 | 2166.474 |
Full model: All predictors
lm(formula = medv ~ ., data = Boston_train)
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(model_full)# Plot the model information
Selected model: All predictors all but indus and age
lm(formula = medv ~ lstat + rm + ptratio + black + dis + nox + chas + zn + rad + tax + crim, data = Boston_train)
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(model_sel)
Lasso Model: Predictors suggested by 1se of Lasso
lm(formula = medv ~ lstat + rm + ptratio + black + dis + chas, data = Boston_train)
par(mfrow = c(2, 2)) # Split the plotting panel into a 2 x 2 grid
plot(model_sel_lso)
Inference
Most plots look quite similar, We do observe some pattern in Residual plots. As we already know ideally there should just be a random scatter in Residual plot & Scalled residuals. We may need to investigate the model using heuristic approach further.
Normality is met to a certain extent although a marginal right skewness is observed.
So, which model should we select from there 3, but before that what should be our criteria of selection…
Alert! when compairing models with different predictors one should use Adjr2, AIC or BIC. MSE generally decrease with increase number of predictors also R2 tends to increase with number of predictors, hence they are not good criterians for model comparision
kable(head(OpDF[,rel_cols])) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "170px")
| modelVec | sigma | MSE | r.squared | adj.r.squared | AIC | BIC |
|---|---|---|---|---|---|---|
| Full model | 4.727104 | 22.34551 | 0.7379818 | 0.7279635 | 2120.070 | 2178.109 |
| Model_PvBeFwBkStLmin | 4.713948 | 22.22131 | 0.7379055 | 0.7294755 | 2116.173 | 2166.474 |
Looking at Adj R2 and AIC values we decide to select the model: Model_PvBeFwBkStLmin
This model was suggested by stepwise, Best subset,Pvalue approach,Forward, Backward.
We understand that when comparing models with different set of predictors AIC, Adj R2, BIC are better criterias than SSE, MSE, R2. This is because these parameters generally tend to give better results with addition of more and more predictors.
out-of-sample performance
Lets try to test the out-of-sample performance. Using final linear model built from (i) on the 70% of original data, test with the remaining 30% testing data. (Try predict() function in R.) Report out-of-sample model MSE etc.
pred.test.BH<- predict(model_sel,Boston_test[,-14], type="response")
tstmse<-mean((Boston_test[,14]-pred.test.BH)^2)
tstmse
## [1] 23.83127
Observation
The test MSE(Mean square error of testing sample) is 23.8312659, to our surprise it’s even less that Training sample MSE 24.18.
In general model is expected to underperform on test data than on train data, but sometimes it is out performs also. This can be by fluke.
This value may fluctuate based on initial sapmling of data, hence a more reliable indicator of model perfomance is Cross validation error.
Let’s see what is the CV value of the model…
Cross validation
Let’s use 3-fold cross validation. (Try cv.glm() function in R on the ORIGINAL 100% data.) Does CV yield similar answer as above?
model.glm1 = glm(medv~lstat + rm + ptratio + black + dis + nox +
chas + zn + rad + tax + crim, data = Boston)
#loocv
cv.glm(data = Boston, glmfit = model.glm1)$delta[2]
## [1] 23.51161
#10 FOLD CV
cv.glm(data = Boston, glmfit = model.glm1, K = 10)$delta[2]
## [1] 23.66909
#3 FOLD CV
cv.glm(data = Boston, glmfit = model.glm1, K = 3)$delta[2]
## [1] 23.34635
Observation Here, we observe that the Validation set approach gave a smaller MSE than cross validation. As we increased the number of folds we see that MSE increases marginally
Lets try to fit a regression tree (CART) on the same data; repeat the above step of predicting the error to check how well do Trees perform on this data.
boston.largetree <- rpart(formula = medv ~ ., data = Boston_train, cp = 0.001)
#Try plot it yourself to see its structure.
prp(boston.largetree)
#The plotcp() function gives the relationship between 10-fold cross-validation error in the training set and size of tree.
plotcp(boston.largetree)
printcp(boston.largetree)
##
## Regression tree:
## rpart(formula = medv ~ ., data = Boston_train, cp = 0.001)
##
## Variables actually used in tree construction:
## [1] age black crim dis lstat nox ptratio rm tax
##
## Root node error: 28996/354 = 81.91
##
## n= 354
##
## CP nsplit rel error xerror xstd
## 1 0.4482914 0 1.00000 1.00611 0.099044
## 2 0.1814837 1 0.55171 0.63171 0.069864
## 3 0.0679966 2 0.37022 0.40852 0.052836
## 4 0.0422795 3 0.30223 0.35115 0.050599
## 5 0.0307986 4 0.25995 0.34984 0.050635
## 6 0.0278359 5 0.22915 0.35681 0.051885
## 7 0.0219337 6 0.20131 0.32852 0.048862
## 8 0.0078703 7 0.17938 0.27900 0.039767
## 9 0.0075106 8 0.17151 0.25793 0.033691
## 10 0.0074710 9 0.16400 0.25793 0.033691
## 11 0.0074043 10 0.15653 0.25673 0.033695
## 12 0.0045159 11 0.14912 0.24817 0.033790
## 13 0.0041459 12 0.14461 0.23765 0.032853
## 14 0.0033568 13 0.14046 0.23659 0.033191
## 15 0.0030870 14 0.13711 0.23486 0.033060
## 16 0.0024137 15 0.13402 0.23160 0.032927
## 17 0.0021253 17 0.12919 0.23143 0.032927
## 18 0.0021244 18 0.12707 0.23048 0.032959
## 19 0.0013879 19 0.12494 0.22922 0.032891
## 20 0.0012861 21 0.12217 0.23271 0.032879
## 21 0.0012598 22 0.12088 0.23105 0.032686
## 22 0.0010354 23 0.11962 0.23271 0.032686
## 23 0.0010000 24 0.11858 0.23218 0.032691
Observation xerror: does not reduce post nsplit=10, hence we choose cp= 0.0052405(corresponding to 10 ) to prune the large tree This is much clear looking at the plot where we clearly see that 10 fold cross validation error does not reduce after 10 splits.
Tree prune:
#create a tree
boston.trn.dt.p<-prune(boston.largetree,cp=0.0052405)
#plot a tree
prp(boston.trn.dt.p,digits = 5, extra = 1)
For 354 observations in training data
MSE /MSPE
In-sample & Out-of-sample prediction
#In-sample prediction
boston.train.pred.tree = predict(boston.trn.dt.p)
#Out-of-sample prediction
boston.test.pred.tree = predict(boston.trn.dt.p,Boston_test)
(MSE.tree<- sum((Boston_train$medv-boston.train.pred.tree)^2)/nrow(Boston_train))
## [1] 12.21472
(MSPE.tree <- sum((Boston_test$medv-boston.test.pred.tree)^2)/nrow(Boston_test))
## [1] 17.89879
#In-sample prediction
boston.train.pred.tree.full = predict(boston.largetree)
#Out-of-sample prediction
boston.test.pred.tree.full = predict(boston.largetree,Boston_test)
(MSE.tree<- sum((Boston_train$medv-boston.train.pred.tree)^2)/nrow(Boston_train))
## [1] 12.21472
(MSPE.tree <- sum((Boston_test$medv-boston.test.pred.tree)^2)/nrow(Boston_test))
## [1] 17.89879
Observation
The mse(in-sample) is 12.2147196 which is less that MSE(in-sample) from Linear regression model. The mse(out of sample) aka MSPE is 17.8987945 which is although more that MSE(in-sample) but less that MSPE from Linear regression model.
CV Decision Trees
set.seed(450)
cv.error <- NULL
k <- 10
pbar <- create_progress_bar('text')
pbar$init(k)
##
|
| | 0%
for(i in 1:k){
index <- sample(1:nrow(Boston),round(0.9*nrow(Boston)))
train.cv <- Boston[index,]
test.cv <- Boston[-index,]
boston.ct.cv<- rpart(formula = medv ~ ., data = train.cv, cp = 0.0052405)
pt<-predict(boston.ct.cv, test.cv, n.trees = 1000)
cv.error[i]<-mean((test.cv$medv-pt)^2)
pbar$step()
}
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
cv_ct<-mean(cv.error)
The cross validation error for CT : 20.8169881
Plot of predicted vs actual values
par(mfrow=c(1,2))
plot(boston.train.pred.tree,Boston_train$medv,col=6, lwd=0, xlab = "predicted values", ylab = "Actual Values", main="Training data or OOBE")
abline(c(0,1),col=9)
plot(boston.test.pred.tree,Boston_test$medv,col=6, lwd=0, xlab = "predicted values", ylab = "Actual Values",main="Testing data")
abline(c(0,1),col=9)
Observation
The mse(in-sample) is 12.2147196 which is less that MSE(in-sample) from Linear regression model. The mse(out of sample) aka MSPE is 17.8987945 which is although more that MSE(in-sample) but less that MSPE from Linear regression model.
Inference
#
boston.bag<- randomForest(medv~., data = Boston_train,mtry=ncol(Boston_train)-1,ntree=500, importance=TRUE)
boston.bag
##
## Call:
## randomForest(formula = medv ~ ., data = Boston_train, mtry = ncol(Boston_train) - 1, ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 13
##
## Mean of squared residuals: 11.11915
## % Var explained: 86.43
Out of bag error
hist(boston.bag$mse,col=3, lwd=2)
plot(boston.bag$mse,type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error")
MSE on Training Sample
boston.bag.pred.TR<- predict(boston.bag)
train.mse.bag<-mean((Boston_train$medv-boston.bag.pred.TR)^2)
train.mse.bag
## [1] 11.11915
Hence, MSR is actually OOB Error on training data.
MSPE on testing Sample
boston.bag.pred<- predict(boston.bag, Boston_test)
test.mse.bag<-mean((Boston_test$medv-boston.bag.pred)^2)
test.mse.bag
## [1] 10.1621
Here we observe that train mse is reduced to 11.1191491 & test mse has reduced to 10.1620955 which is way less that previous models
CV Bagging
set.seed(450)
cv.error <- NULL
k <- 10
pbar <- create_progress_bar('text')
pbar$init(k)
##
|
| | 0%
for(i in 1:k){
index <- sample(1:nrow(Boston),round(0.9*nrow(Boston)))
train.cv <- Boston[index,]
test.cv <- Boston[-index,]
boston.bg.cv<- randomForest(medv~., data = train.cv,mtry=ncol(train.cv)-1,ntree=500, importance=TRUE)
pt<-predict(boston.bg.cv, test.cv, n.trees = 1000)
cv.error[i]<-mean((test.cv$medv-pt)^2)
pbar$step()
}
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
cv_bg<-mean(cv.error)
The cross validation error for RF is 10.1267984
Plot of predicted vs actual values
par(mfrow=c(1,2))
plot(boston.bag.pred.TR,Boston_train$medv,col=4, lwd=0, xlab = "predicted values", ylab = "Actual Values", main="Training data or OOBE")
abline(c(0,1),col=2)
plot(boston.bag.pred,Boston_test$medv,col=4, lwd=0, xlab = "predicted values", ylab = "Actual Values",main="Testing data")
abline(c(0,1),col=2)
Used m=sqrt(p)
#library(randomForest)
boston.rf<- randomForest(medv~., data = Boston_train,mtry=round(sqrt(ncol(Boston_train)),0),ntree=500, importance=TRUE)
boston.rf
##
## Call:
## randomForest(formula = medv ~ ., data = Boston_train, mtry = round(sqrt(ncol(Boston_train)), 0), ntree = 500, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 11.95426
## % Var explained: 85.41
OOB ERROR The MSR is MSE of out-of-bag prediction (recall the OOB in bagging). The fitted randomForest actually saves all OOB errors for each ntree value from 1 to 500. We can make a plot to see how the OOB error changes with different ntree.
hist(boston.rf$mse,col=3, lwd=2)
plot(boston.rf$mse,type='l', col=2, lwd=2, xlab = "ntree", ylab = "OOB Error")
#why this error keeps falling?
MSE on Training Sample
boston.rf.pred.TR<- predict(boston.rf)
train.mse.rf<-mean((Boston_train$medv-boston.rf.pred.TR)^2)
train.mse.rf
## [1] 11.95426
Hence, MSR is actually OOB Error on training data.
MSPE on testing Sample
boston.rf.pred<- predict(boston.rf, Boston_test)
test.mse.rf<-mean((Boston_test$medv-boston.rf.pred)^2)
test.mse.rf
## [1] 10.50503
Here we observe that train mse is reduced to 11.9542639 & test mse has reduced to 10.5050258 which is way less that previous models
CV Random Forest
set.seed(450)
cv.error <- NULL
k <- 10
pbar <- create_progress_bar('text')
pbar$init(k)
##
|
| | 0%
for(i in 1:k){
index <- sample(1:nrow(Boston),round(0.9*nrow(Boston)))
train.cv <- Boston[index,]
test.cv <- Boston[-index,]
boston.rf.cv<- randomForest(medv~., data = train.cv,mtry=round(sqrt(ncol(train.cv)),0),ntree=500, importance=TRUE)
pt<-predict(boston.rf.cv, test.cv, n.trees = 1000)
cv.error[i]<-mean((test.cv$medv-pt)^2)
pbar$step()
}
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
cv_rf<-mean(cv.error)
The cross validation error for RF is 6.8847982
Plot of predicted vs actual values
par(mfrow=c(1,2))
plot(boston.rf.pred.TR,Boston_train$medv,col=3, lwd=0, xlab = "predicted values", ylab = "Actual Values", main="Training data or OOBE")
abline(c(0,1),col=2)
plot(boston.rf.pred,Boston_test$medv,col=3, lwd=0, xlab = "predicted values", ylab = "Actual Values",main="Testing data")
abline(c(0,1),col=2)
Variable Importance
varImpPlot(boston.rf,col=2)
The variable importance plot tells us that lstat & rm are far more important variables than the rest. This plot gives us an comparative significance of variables.
#
boston.boost<- gbm(medv~., data = Boston_train, distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, interaction.depth = 8)
summary(boston.boost)
## var rel.inf
## lstat lstat 42.36190988
## rm rm 31.19495549
## dis dis 8.04331205
## nox nox 3.46895424
## crim crim 3.35487848
## age age 2.84156588
## tax tax 2.25568607
## ptratio ptratio 2.18396603
## black black 2.13093999
## indus indus 0.98962284
## chas chas 0.67877728
## rad rad 0.43202787
## zn zn 0.06340388
Prediction on Training sample
boston.boost.pred.train<- predict(boston.boost, n.trees = 1000)
mse.train.Bst<-mean((Boston_train$medv-boston.boost.pred.train)^2)
mse.train.Bst
## [1] 3.022727
Prediction on testing sample
boston.boost.pred.test<- predict(boston.boost, Boston_test, n.trees = 1000)
mse.test.Bst<-mean((Boston_test$medv-boston.boost.pred.test)^2)
mse.test.Bst
## [1] 8.374799
CV Boosting
set.seed(450)
cv.error <- NULL
k <- 10
pbar <- create_progress_bar('text')
pbar$init(k)
##
|
| | 0%
for(i in 1:k){
index <- sample(1:nrow(Boston),round(0.9*nrow(Boston)))
train.cv <- Boston[index,]
test.cv <- Boston[-index,]
boston.boost.cv<- gbm(medv~., data = train.cv, distribution = "gaussian", n.trees = 1000, shrinkage = 0.01, interaction.depth = 8)
pt<-predict(boston.boost.cv, test.cv, n.trees = 1000)
cv.error[i]<-mean((test.cv$medv-pt)^2)
pbar$step()
}
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
cv_boost<-mean(cv.error)
The cv error estimate for Boosting is 9.0850293
Plot of predicted vs actual values for Boosting
par(mfrow=c(1,2))
plot(boston.boost.pred.train,Boston_train$medv,col=3, lwd=0, xlab = "predicted values", ylab = "Actual Values", main="Training data")
abline(c(0,1),col=2)
plot(boston.boost.pred.test,Boston_test$medv,col=3, lwd=0, xlab = "predicted values", ylab = "Actual Values",main="Testing data")
abline(c(0,1),col=2)
Inorder to fit a GAM model, it is imperative to understand the distribution of predictors. This helps speculate the nonlinearity of predictors.
In boston dataset, Column: chas & rad are labeled as continious but posses discreet values. Hence, a non-linear smoothing function is inappropriate.
Boston.gam <-gam(medv ~ s(crim)+s(zn)+s(indus)+chas+s(nox)+s(rm)+s(age)+s(dis)+rad+s(tax)+s(ptratio)+s(black)+s(lstat),data=Boston_train)
summary(Boston.gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## medv ~ s(crim) + s(zn) + s(indus) + chas + s(nox) + s(rm) + s(age) +
## s(dis) + rad + s(tax) + s(ptratio) + s(black) + s(lstat)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.5580 1.2854 14.438 < 2e-16 ***
## chas 1.4301 0.7295 1.961 0.05085 .
## rad 0.3788 0.1292 2.932 0.00363 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(crim) 2.325 2.908 17.746 1.07e-09 ***
## s(zn) 1.000 1.000 0.394 0.53060
## s(indus) 7.378 8.267 3.038 0.00246 **
## s(nox) 8.979 8.999 10.528 1.38e-14 ***
## s(rm) 8.120 8.782 16.416 < 2e-16 ***
## s(age) 1.000 1.000 0.044 0.83378
## s(dis) 8.645 8.960 7.368 1.03e-09 ***
## s(tax) 3.478 4.201 10.975 2.06e-08 ***
## s(ptratio) 1.000 1.000 19.049 1.74e-05 ***
## s(black) 1.855 2.288 1.811 0.14941
## s(lstat) 5.500 6.676 19.364 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.882 Deviance explained = 89.9%
## GCV = 11.357 Scale est. = 9.6801 n = 354
Inference Here we can further notice that columns zn,age,pratio,black have Linear distribution. This is evident from effective degrees of freedom=1. This is confirmed that the marginal plots of the same predictors
plot(Boston.gam,pages=1)
Boston.gam_adv <-gam(medv ~ s(crim)+zn+s(indus)+chas+s(nox)+s(rm)+age+s(dis)+rad+s(tax)+
ptratio+black+s(lstat), data=Boston_train)
bostonGamsummary<-summary(Boston.gam_adv)
bostonGamsummary
##
## Family: gaussian
## Link function: identity
##
## Formula:
## medv ~ s(crim) + zn + s(indus) + chas + s(nox) + s(rm) + age +
## s(dis) + rad + s(tax) + ptratio + black + s(lstat)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.697146 3.494724 8.498 8.98e-16 ***
## zn 0.007444 0.015322 0.486 0.62743
## chas 1.444269 0.732815 1.971 0.04965 *
## age -0.004205 0.014576 -0.288 0.77317
## rad 0.375052 0.130985 2.863 0.00449 **
## ptratio -0.639718 0.148242 -4.315 2.16e-05 ***
## black 0.002761 0.002302 1.200 0.23126
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(crim) 2.387 2.985 17.28 3.90e-10 ***
## s(indus) 7.585 8.418 3.05 0.00222 **
## s(nox) 8.977 8.999 10.56 1.26e-14 ***
## s(rm) 8.075 8.762 15.99 < 2e-16 ***
## s(dis) 8.657 8.963 7.45 7.68e-10 ***
## s(tax) 3.460 4.183 10.71 3.21e-08 ***
## s(lstat) 5.811 6.994 18.30 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.881 Deviance explained = 89.8%
## GCV = 11.435 Scale est. = 9.7567 n = 354
The cross Validation error is 11.46
Model Criterians
AIC(Boston.gam_adv)
## [1] 1860.724
BIC(Boston.gam_adv)
## [1] 2065.616
Boston.gam_adv$deviance
## [1] 2946.968
Prediction Accuracy Insample
#in-sample mse using df
(Boston.gam.mse.train <- Boston.gam_adv$dev/Boston.gam_adv$df.residual)
## [1] 9.756659
#or
Boston.gam_adv$dev/(nrow(Boston_train)-sum(influence(Boston.gam_adv))) #mdf
## [1] 9.756659
#Average Sum of Squared Error
(Boston.gam.mse.train <- Boston.gam_adv$dev/nrow(Boston_train))
## [1] 8.324768
#using the predict() function
pi <- predict(Boston.gam_adv,Boston_train)
mean((pi - Boston_train$medv)^2)
## [1] 8.324768
Out of sample
Boston.gam.predict.test <-predict(Boston.gam_adv,Boston_test) #Boston.gam built on training data
Boston_gam_mse_test <-mean((Boston.gam.predict.test-Boston_test[, "medv"])^2) ## out of sample
Boston_gam_mse_test
## [1] 12.0779
par(mfrow=c(1,2))
plot(Boston_train$medv,pi,col='blue',main='Real vs predicted GAM',pch=16,cex=0.6)
abline(lm(pi~Boston_train$medv))
legend('bottomright',legend='Train',pch=18,col='blue', bty='n')
plot(Boston_test$medv,Boston.gam.predict.test,col='blue',main='Real vs predicted GAM',pch=16,cex=0.6)
abline(lm(Boston.gam.predict.test~Boston_test$medv))
legend('bottomright',legend='Test',pch=18,col='blue', bty='n')
Inference It’s observed that GCV error is 11.435
set.seed(1338364)
maxs <- apply(Boston, 2, max)
mins <- apply(Boston, 2, min)
scaled <- as.data.frame(scale(Boston, center = mins, scale = maxs - mins))
index <- sample(1:nrow(Boston),round(0.75*nrow(Boston)))
train_ <- scaled[index,]
test_ <- scaled[-index,]
n <- names(train_)
f <- as.formula(paste("medv ~", paste(n[!n %in% "medv"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5,3),linear.output=T)
plot(nn)
In sample Predictions
pr.nn <- compute(nn,train_[,1:13])
pr.nn_tr <- pr.nn$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)#prediction
train.r <- (train_$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)#actual value to compare
# MSE of testing set
MSE.nn_train <- sum((train.r - pr.nn_tr)^2)/nrow(train_)
MSE.nn_train
## [1] 3.841102
Out of sample Predictions
pr.nn <- compute(nn,test_[,1:13])
pr.nn_ <- pr.nn$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
test.r <- (test_$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
# MSE of testing set
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
MSE.nn
## [1] 10.26944
par(mfrow=c(1,2))
plot(train_$medv,pr.nn_tr,col='red',main='Real vs predicted NN',pch=16,cex=0.6)
abline(lm(pr.nn_tr~train_$medv))
legend('bottomright',legend='Train',pch=18,col='red', bty='n')
plot(test_$medv,pr.nn_,col='red',main='Real vs predicted NN',pch=16,cex=0.6)
abline(lm(pr.nn_~test_$medv))
legend('bottomright',legend='Test',pch=18,col='red', bty='n')
Inference We can see that the values of train MSE has decreased substantially, But on the other hand the out of sample error did not reduce much.
Lets try a different initialization
set.seed(500)
maxs <- apply(Boston, 2, max)
mins <- apply(Boston, 2, min)
scaled <- as.data.frame(scale(Boston, center = mins, scale = maxs - mins))
index <- sample(1:nrow(Boston),round(0.75*nrow(Boston)))
train_ <- scaled[index,]
test_ <- scaled[-index,]
n <- names(train_)
f <- as.formula(paste("medv ~", paste(n[!n %in% "medv"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5,3),linear.output=T)
plot(nn)
In sample Predictions
pr.nn <- compute(nn,train_[,1:13])
pr.nn_tr <- pr.nn$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)#prediction
train.r <- (train_$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)#actual value to compare
# MSE of testing set
MSE.nn_train <- sum((train.r - pr.nn_tr)^2)/nrow(train_)
MSE.nn_train
## [1] 4.284057
Out of sample Predictions
pr.nn <- compute(nn,test_[,1:13])
pr.nn_ <- pr.nn$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
test.r <- (test_$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
# MSE of testing set
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
MSE.nn
## [1] 16.45955
par(mfrow=c(1,2))
plot(train_$medv,pr.nn_tr,col='purple',main='Real vs predicted NN',pch=16,cex=0.6)
abline(lm(pr.nn_tr~train_$medv))
legend('bottomright',legend='Train',pch=18,col='purple', bty='n')
plot(test_$medv,pr.nn_,col='purple',main='Real vs predicted NN',pch=16,cex=0.6)
abline(lm(pr.nn_~test_$medv))
legend('bottomright',legend='Test',pch=18,col='purple', bty='n')
This proves the importance of a good starting point in neural nets…
CV NN
set.seed(450)
cv.error <- NULL
k <- 10
pbar <- create_progress_bar('text')
pbar$init(k)
##
|
| | 0%
for(i in 1:k){
index <- sample(1:nrow(Boston),round(0.9*nrow(Boston)))
train.cv <- scaled[index,]
test.cv <- scaled[-index,]
nn <- neuralnet(f,data=train.cv,hidden=c(5,2),linear.output=T)
pr.nn <- compute(nn,test.cv[,1:13])
pr.nn <- pr.nn$net.result*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
test.cv.r <- (test.cv$medv)*(max(Boston$medv)-min(Boston$medv))+min(Boston$medv)
cv.error[i] <- sum((test.cv.r - pr.nn)^2)/nrow(test.cv)
pbar$step()
}
##
|
|====== | 10%
|
|============= | 20%
|
|==================== | 30%
|
|========================== | 40%
|
|================================ | 50%
|
|======================================= | 60%
|
|============================================== | 70%
|
|==================================================== | 80%
|
|========================================================== | 90%
|
|=================================================================| 100%
cvnn<-mean(cv.error)
The cross Validation error for this is 7.6412918
Key takeaways:
*Cross validation is more stable and reliable than out of sample validation test set values
*LOOCV : Resampling does not change the LOOCV error value
*MSPE can also sometime be less than in-sample value of error (MSE)
*CART: Re sampling causes the training set to differ, because of which the splitting is changed hence the MSE & MSPE values also differ.
*We observe that at times Linear Regression and at times Decision trees perform better than each other. Mostly it looks like it depends on sampled data. If we closely look in sample MSE is less for DT than for LR & Out of sample MSE does not show any pattern(it is sometimes higher sometimes lower than LR) .Hence, Neither Linear Regression or Decision Trees are better than the other. We already know that LR works really nicely when the data has a linear shape. But, when the data has a non-linear shape, then a linear model cannot capture the non-linear features.So in this case, you can use the decision trees, which do a better job at capturing the non-linearity in the data by dividing the space into smaller sub-spaces depending on the questions asked.
*Advanced Trees: We observe that advanced trees are much better are predicting the prices as compared to any conventional methods.In this project we observed Boosting performed the best among the Bagging and Random forest,which is obvious expectation because Boosting imples a mechanism to pay close attention to errors of last iteration. No wonder these methods(RF & Boosting) are state of the art and extensively used in the market today. But when these conventional methods are compared to advanced tree methods , Advanced trees are bound to out perform the conventional methods.
*Gams perform better when we have nonlinearity in continious predictors.
*NN Performance is quite dependent on initial values for them to actually converge at global minimas.
Conclusion Looking at the prediction performance of various Algotithms for this data set, it can be concluded that Gradient Boosting performs the best. Overall,Random Forest also gave better performance although not as good as GB!.