BOSTON HOUSING DATA

Introduction

This project aims to find the factors affecting the domestic property value in the city of Boston. Factors like per capita income, environmental factors, educational facilities, property size, etc were taken into consideration to determine the most significant parameters. We create multiple linear regression model using forward stepwise selection and compare its performance with the linear regression model containing all the variables. We use the following metrics to compare the performance of the models: R-squared value, Adjusted R-squared value, AIC, BIC and model Mean Squared Error (MSE).

Packages Required

The following packages are required for the project:

library(corrr)
library(gridExtra)
library(ggplot2)
library(tidyverse)
library(dplyr)
library(DT)
library(MASS)
library(leaps)
library(glmnet)
library(PerformanceAnalytics)

Data Exploration

Checking for Data structure

Our data contains 506 observations containing 14 variables. The datatypes are as follows:

glimpse(Boston)

## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

Data Summary

A quick summary of the distribution of every variable in the data

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Checking for Null Values

There are no null values to report in the data

colSums(is.na(Boston))

##    crim      zn   indus    chas     nox      rm     age     dis     rad 
##       0       0       0       0       0       0       0       0       0 
##     tax ptratio   black   lstat    medv 
##       0       0       0       0       0

Data sneak peek

A quick glance into the data:

Boston %>% datatable(caption = "Boston Housing")

Correlation Matrices

Correlation of target variable with predictor variables:

rm and lstat are highly correlated with the target variable medv
black, dis, rm, chas, zn are positively correlated with medv
crim, indus, nox, age, rad, tax, ptratio, lstat are negatively correlated with medv

Boston %>% correlate() %>% focus(medv)

## # A tibble: 13 x 2
##    rowname   medv
##    <chr>    <dbl>
##  1 crim    -0.388
##  2 zn       0.360
##  3 indus   -0.484
##  4 chas     0.175
##  5 nox     -0.427
##  6 rm       0.695
##  7 age     -0.377
##  8 dis      0.250
##  9 rad     -0.382
## 10 tax     -0.469
## 11 ptratio -0.508
## 12 black    0.333
## 13 lstat   -0.738

Correlation among predictor variables:

On plotting the pairwise correlations between each of the variables, we see the following:
The highest positive correlations are between “rad” and “tax”, “indux” and “nox” and negative between “dis” and “age” and “dis” and “nox”.

chart.Correlation(Boston[,-14], histogram=TRUE, pch=19)

Distributions

Predictor vars vs Target var

We plot the scatter plots of target variable medv versus the other variables, we see that rm and lstat show parabolic nature

Boston %>%
  gather(-medv, key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = value, y = medv)) +
  geom_point() +
  stat_smooth() +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Boxplots for Predictors

Boxplots show no significant outliers in the data

Boston %>%
  gather(-medv, key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = '',y = value)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 1) +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Histograms for Predictors

The histograms of predictors give the following insights:

Rad and Tax seem to have two different peaks separated by no data in between
rm follows perfect normal dostribution
Most of the distributions here are skewed

Boston %>%
  gather(-medv, key = "var", value = "value") %>%
  filter(var != "chas") %>%
  ggplot(aes(x = value)) +
  geom_histogram() +
  facet_wrap(~ var, scales = "free") +
  theme_bw()

Splitting the Data

We split our data in 80:20 ratio as training data and test data. We will use our train data for modelling and test data for validation

set.seed(12420352)
index <- sample(nrow(Boston),nrow(Boston)*0.80)
Boston.train <- Boston[index,]
Boston.test <- Boston[-index,]

Linear Regression using all predictors

We build up a Linear regression model using all variables present in the data
We notice that Indus and age have very high p-value and seem to be non-significant
The estimated coefficients are as follows:

model1 <- lm(medv~ ., data = Boston.train)
sum.model1 <- summary(model1)
sum.model1

## 
## Call:
## lm(formula = medv ~ ., data = Boston.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3673  -2.7380  -0.5821   1.6192  24.5081 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.864329   6.019967   6.954 1.50e-11 ***
## crim         -0.133280   0.039488  -3.375 0.000812 ***
## zn            0.054777   0.015498   3.534 0.000458 ***
## indus         0.037333   0.074162   0.503 0.614975    
## chas          3.430143   1.022925   3.353 0.000877 ***
## nox         -17.948596   4.412690  -4.067 5.75e-05 ***
## rm            3.154796   0.480915   6.560 1.71e-10 ***
## age           0.002563   0.014992   0.171 0.864349    
## dis          -1.602194   0.229928  -6.968 1.37e-11 ***
## rad           0.356819   0.076813   4.645 4.65e-06 ***
## tax          -0.013827   0.004391  -3.149 0.001764 ** 
## ptratio      -1.003724   0.153330  -6.546 1.86e-10 ***
## black         0.010420   0.003061   3.404 0.000733 ***
## lstat        -0.564569   0.057875  -9.755  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.855 on 390 degrees of freedom
## Multiple R-squared:  0.7359, Adjusted R-squared:  0.7271 
## F-statistic: 83.59 on 13 and 390 DF,  p-value: < 2.2e-16

Model Stats:

Checking the model stats, using MSE, R-squared, adjusted R-squared, Test MSPE, AIC and BIC as metrics:

model1.mse <- (sum.model1$sigma)^2
model1.rsq <- sum.model1$r.squared
model1.arsq <- sum.model1$adj.r.squared
test.pred.model1 <- predict(model1, newdata=Boston.test) 
model1.mpse <- mean((Boston.test$medv-test.pred.model1)^2)
model1.aic <- AIC(model1)
model1.bic <- BIC(model1)

stats.model1 <- c("full", model1.mse, model1.rsq, model1.arsq, model1.mpse, model1.aic, model1.bic)

comparison_table <- c("model type", "MSE", "R-Squared", "Adjusted R-Squared", "Test MSPE", "AIC", "BIC")
data.frame(cbind(comparison_table, stats.model1))

##     comparison_table      stats.model1
## 1         model type              full
## 2                MSE    23.57194554198
## 3          R-Squared 0.735883231832148
## 4 Adjusted R-Squared 0.727079339559886
## 5          Test MSPE  19.7293018987321
## 6                AIC   2438.9171384413
## 7                BIC  2498.93836161072

Subset Selection

We will use subset selection techniques for variable selection. The three methods employed are:
* Forward Variable selection
* Backward Variable Selection
* Exhaustive Variable Selection

Forward Variable Selection

We start off with Forward selection method, where we keep on adding influential variables to the model.
lstat, rm and ptration are the most significant variables
Following table shows the variables added to the model at each step, along with the BIC, R-squared, adj r-squared, cp values associated woth the model

model2 <- regsubsets(medv~ ., data = Boston.train, nvmax = 13, method="forward")
sum.model2 <- summary(model2)

model2.subsets <- cbind(sum.model2$which, sum.model2$bic, sum.model2$rsq, sum.model2$adjr2,sum.model2$cp)
model2.subsets <- as.data.frame(model2.subsets) 
colnames(model2.subsets)[15:18] <- c("BIC","rsq","adjr2","cp")
model2.subsets

##    (Intercept) crim zn indus chas nox rm age dis rad tax ptratio black
## 1            1    0  0     0    0   0  0   0   0   0   0       0     0
## 2            1    0  0     0    0   0  1   0   0   0   0       0     0
## 3            1    0  0     0    0   0  1   0   0   0   0       1     0
## 4            1    0  0     0    1   0  1   0   0   0   0       1     0
## 5            1    0  0     0    1   0  1   0   1   0   0       1     0
## 6            1    0  0     0    1   1  1   0   1   0   0       1     0
## 7            1    0  1     0    1   1  1   0   1   0   0       1     0
## 8            1    0  1     0    1   1  1   0   1   0   0       1     1
## 9            1    0  1     0    1   1  1   0   1   1   0       1     1
## 10           1    1  1     0    1   1  1   0   1   1   0       1     1
## 11           1    1  1     0    1   1  1   0   1   1   1       1     1
## 12           1    1  1     1    1   1  1   0   1   1   1       1     1
## 13           1    1  1     1    1   1  1   1   1   1   1       1     1
##    lstat       BIC       rsq     adjr2        cp
## 1      1 -311.5594 0.5510738 0.5499570 262.89329
## 2      1 -375.0524 0.6220194 0.6201342 160.13353
## 3      1 -412.6604 0.6606952 0.6581504 105.02407
## 4      1 -424.8811 0.6756594 0.6724078  84.92776
## 5      1 -432.9888 0.6867910 0.6828562  70.49059
## 6      1 -451.2668 0.7050596 0.7006020  45.51482
## 7      1 -454.2284 0.7115310 0.7064318  37.95897
## 8      1 -457.1622 0.7178410 0.7121264  30.64152
## 9      1 -455.7460 0.7210252 0.7146527  27.93961
## 10     1 -460.5552 0.7283914 0.7214802  19.06264
## 11     1 -465.5634 0.7356932 0.7282764  10.28067
## 12     1 -459.8224 0.7358634 0.7277569  12.02923
## 13     1 -453.8512 0.7358832 0.7270793  14.00000

Plotting Model metrics

Checking the 13 models with varying variable size, we plot the model metrics to find out the best model. R-squared keeps on increasing with added variables and hence will always favor model with highest number of variables
Model with 11 variables gives the highest Adjusted R-squared value and the lowest cp and BIC values

#PLOTS OF R2, ADJ R2, CP, BIC#
rsq <- data.frame(round(sum.model2$rsq,5))
model2.rsq.plot <- ggplot(data = rsq, aes(y = rsq, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=rsq), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

adjr2 <- data.frame(round(sum.model2$adjr2,4))
model2.adjrsq.plot <- ggplot(data = adjr2, aes(y = adjr2, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=adjr2), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

bic <- data.frame(round(sum.model2$bic,4))
model2.bic.plot <- ggplot(data = bic, aes(y = bic, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=bic), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

cp <- data.frame(round(sum.model2$cp,4))
model2.cp.plot <- ggplot(data = cp, aes(y = cp, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=cp), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

grid.arrange(model2.rsq.plot,model2.adjrsq.plot,model2.bic.plot,model2.cp.plot, ncol=2)

Selecting best subset

Reiterating our findings from the plots, We find the best model is the model with all variables except age and indus

which.max(sum.model2$rsq)

## [1] 13

which.max(sum.model2$adjr2)

## [1] 11

which.min(sum.model2$cp)

## [1] 11

which.min(sum.model2$bic)

## [1] 11

coef(model2,11)

##  (Intercept)         crim           zn         chas          nox 
##  41.41683298  -0.13427887   0.05374373   3.49302070 -17.05650513 
##           rm          dis          rad          tax      ptratio 
##   3.15619111  -1.63366083   0.34521921  -0.01280323  -0.98839845 
##        black        lstat 
##   0.01043348  -0.55927297

Backward Variable Selection

Now we come to Backward selection method, where we keep on removing non-influential variables from the model.
lstat, rm and ptration are the most significant variables
Following table shows the variables included in different sized models, along with the BIC, R-squared, adj r-squared, cp values associated with the model

model3 <- regsubsets(medv~ ., data = Boston.train, nvmax = 13, method="backward")
sum.model3 <- summary(model3)

model3.subsets <- cbind(sum.model3$which, sum.model3$bic, sum.model3$rsq, sum.model3$adjr2,sum.model3$cp)
model3.subsets <- as.data.frame(model3.subsets) 
colnames(model3.subsets)[15:18] <- c("BIC","rsq","adjr2","cp")
model3.subsets

##    (Intercept) crim zn indus chas nox rm age dis rad tax ptratio black
## 1            1    0  0     0    0   0  0   0   0   0   0       0     0
## 2            1    0  0     0    0   0  1   0   0   0   0       0     0
## 3            1    0  0     0    0   0  1   0   0   0   0       1     0
## 4            1    0  0     0    0   0  1   0   1   0   0       1     0
## 5            1    0  0     0    0   1  1   0   1   0   0       1     0
## 6            1    0  0     0    1   1  1   0   1   0   0       1     0
## 7            1    0  0     0    1   1  1   0   1   0   0       1     1
## 8            1    0  0     0    1   1  1   0   1   1   0       1     1
## 9            1    1  0     0    1   1  1   0   1   1   0       1     1
## 10           1    1  1     0    1   1  1   0   1   1   0       1     1
## 11           1    1  1     0    1   1  1   0   1   1   1       1     1
## 12           1    1  1     1    1   1  1   0   1   1   1       1     1
## 13           1    1  1     1    1   1  1   1   1   1   1       1     1
##    lstat       BIC       rsq     adjr2        cp
## 1      1 -311.5594 0.5510738 0.5499570 262.89329
## 2      1 -375.0524 0.6220194 0.6201342 160.13353
## 3      1 -412.6604 0.6606952 0.6581504 105.02407
## 4      1 -424.8374 0.6756242 0.6723724  84.97961
## 5      1 -441.2403 0.6931232 0.6892680  61.14028
## 6      1 -451.2668 0.7050596 0.7006020  45.51482
## 7      1 -453.2593 0.7108382 0.7057267  38.98203
## 8      1 -454.6626 0.7160898 0.7103397  33.22737
## 9      1 -457.6475 0.7223352 0.7159926  26.00530
## 10     1 -460.5552 0.7283914 0.7214802  19.06264
## 11     1 -465.5634 0.7356932 0.7282764  10.28067
## 12     1 -459.8224 0.7358634 0.7277569  12.02923
## 13     1 -453.8512 0.7358832 0.7270793  14.00000

Plotting Model metrics

#PLOTS OF R2, ADJ R2, CP, BIC#
rsq <- data.frame(round(sum.model3$rsq,5))
model3.rsq.plot <- ggplot(data = rsq, aes(y = rsq, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=rsq), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

adjr2 <- data.frame(round(sum.model3$adjr2,4))
model3.adjrsq.plot <- ggplot(data = adjr2, aes(y = adjr2, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=adjr2), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

bic <- data.frame(round(sum.model3$bic,4))
model3.bic.plot <- ggplot(data = bic, aes(y = bic, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=bic), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

cp <- data.frame(round(sum.model3$cp,4))
model3.cp.plot <- ggplot(data = cp, aes(y = cp, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=cp), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

grid.arrange(model3.rsq.plot,model3.adjrsq.plot,model3.bic.plot,model3.cp.plot, ncol=2)

Selecting best subset

We find the best model is the model with all variables except age and indus

which.max(sum.model3$rsq)

## [1] 13

which.max(sum.model3$adjr2)

## [1] 11

which.min(sum.model3$cp)

## [1] 11

which.min(sum.model3$bic)

## [1] 11

coef(model3,11)

##  (Intercept)         crim           zn         chas          nox 
##  41.41683298  -0.13427887   0.05374373   3.49302070 -17.05650513 
##           rm          dis          rad          tax      ptratio 
##   3.15619111  -1.63366083   0.34521921  -0.01280323  -0.98839845 
##        black        lstat 
##   0.01043348  -0.55927297

Exhaustive Subset Selection

Last subset selection method is exhaustive search. Here we find the best subset of variables of varying sizes
lstat, rm and ptration are the most significant variables
Following table shows the variables included in different sized models, along with the BIC, R-squared, adj r-squared, cp values associated with the model

model4 <- regsubsets(medv~ ., data = Boston.train, nvmax = 13)
sum.model4 <- summary(model4)

model4.subsets <- cbind(sum.model4$which, sum.model4$bic, sum.model4$rsq, sum.model4$adjr2,sum.model4$cp)
model4.subsets <- as.data.frame(model4.subsets) 
colnames(model4.subsets)[15:18] <- c("BIC","rsq","adjr2","cp")
model4.subsets

##    (Intercept) crim zn indus chas nox rm age dis rad tax ptratio black
## 1            1    0  0     0    0   0  0   0   0   0   0       0     0
## 2            1    0  0     0    0   0  1   0   0   0   0       0     0
## 3            1    0  0     0    0   0  1   0   0   0   0       1     0
## 4            1    0  0     0    1   0  1   0   0   0   0       1     0
## 5            1    0  0     0    0   1  1   0   1   0   0       1     0
## 6            1    0  0     0    1   1  1   0   1   0   0       1     0
## 7            1    0  1     0    1   1  1   0   1   0   0       1     0
## 8            1    0  1     0    1   1  1   0   1   0   0       1     1
## 9            1    1  0     0    1   1  1   0   1   1   0       1     1
## 10           1    1  1     0    1   1  1   0   1   1   0       1     1
## 11           1    1  1     0    1   1  1   0   1   1   1       1     1
## 12           1    1  1     1    1   1  1   0   1   1   1       1     1
## 13           1    1  1     1    1   1  1   1   1   1   1       1     1
##    lstat       BIC       rsq     adjr2        cp
## 1      1 -311.5594 0.5510738 0.5499570 262.89329
## 2      1 -375.0524 0.6220194 0.6201342 160.13353
## 3      1 -412.6604 0.6606952 0.6581504 105.02407
## 4      1 -424.8811 0.6756594 0.6724078  84.92776
## 5      1 -441.2403 0.6931232 0.6892680  61.14028
## 6      1 -451.2668 0.7050596 0.7006020  45.51482
## 7      1 -454.2284 0.7115310 0.7064318  37.95897
## 8      1 -457.1622 0.7178410 0.7121264  30.64152
## 9      1 -457.6475 0.7223352 0.7159926  26.00530
## 10     1 -460.5552 0.7283914 0.7214802  19.06264
## 11     1 -465.5634 0.7356932 0.7282764  10.28067
## 12     1 -459.8224 0.7358634 0.7277569  12.02923
## 13     1 -453.8512 0.7358832 0.7270793  14.00000

Plotting Model metrics

#PLOTS OF R2, ADJ R2, CP, BIC#
rsq <- data.frame(round(sum.model4$rsq,5))
model4.rsq.plot <- ggplot(data = rsq, aes(y = rsq, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=rsq), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

adjr2 <- data.frame(round(sum.model4$adjr2,4))
model4.adjrsq.plot <- ggplot(data = adjr2, aes(y = adjr2, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=adjr2), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

bic <- data.frame(round(sum.model4$bic,4))
model4.bic.plot <- ggplot(data = bic, aes(y = bic, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=bic), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

cp <- data.frame(round(sum.model4$cp,4))
model4.cp.plot <- ggplot(data = cp, aes(y = cp, x = 1:13)) + 
  geom_point() + geom_line() + 
  geom_text(aes(label=cp), size=3, vjust=-0.5) +
  scale_x_continuous(breaks=1:13)

grid.arrange(model4.rsq.plot,model4.adjrsq.plot,model4.bic.plot,model4.cp.plot, ncol=2)

Selecting best subset

Again we find the best model is the model with all variables except age and indus, hence we use that model as our selected model

which.max(sum.model4$rsq)

## [1] 13

which.max(sum.model4$adjr2)

## [1] 11

which.min(sum.model4$cp)

## [1] 11

which.min(sum.model4$bic)

## [1] 11

coef(model4,11)

##  (Intercept)         crim           zn         chas          nox 
##  41.41683298  -0.13427887   0.05374373   3.49302070 -17.05650513 
##           rm          dis          rad          tax      ptratio 
##   3.15619111  -1.63366083   0.34521921  -0.01280323  -0.98839845 
##        black        lstat 
##   0.01043348  -0.55927297

SELECTED MODEL = 11, -AGE -INDUS

From our subset selection techniques, we select the model without indus and age as our best model. Summary of the mdoel:

model.ss <- lm(medv ~ . -indus -age, data=Boston.train)
sum.model.ss <- summary(model.ss)
sum.model.ss

## 
## Call:
## lm(formula = medv ~ . - indus - age, data = Boston.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4298  -2.7600  -0.5466   1.6243  24.6067 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  41.416833   5.928789   6.986 1.22e-11 ***
## crim         -0.134279   0.039352  -3.412 0.000711 ***
## zn            0.053744   0.015242   3.526 0.000472 ***
## chas          3.493021   1.013727   3.446 0.000631 ***
## nox         -17.056505   3.994774  -4.270 2.46e-05 ***
## rm            3.156191   0.466475   6.766 4.83e-11 ***
## dis          -1.633661   0.218068  -7.492 4.56e-13 ***
## rad           0.345219   0.073382   4.704 3.53e-06 ***
## tax          -0.012803   0.003891  -3.291 0.001089 ** 
## ptratio      -0.988398   0.150104  -6.585 1.47e-10 ***
## black         0.010433   0.003043   3.429 0.000671 ***
## lstat        -0.559273   0.054487 -10.264  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.844 on 392 degrees of freedom
## Multiple R-squared:  0.7357, Adjusted R-squared:  0.7283 
## F-statistic: 99.19 on 11 and 392 DF,  p-value: < 2.2e-16

Getting the model stats:

model.ss.mse <- (sum.model.ss$sigma)^2
model.ss.rsq <- sum.model.ss$r.squared
model.ss.arsq <- sum.model.ss$adj.r.squared
test.pred.model.ss <- predict(model.ss, newdata=Boston.test) 
model.ss.mpse <- mean((Boston.test$medv-test.pred.model.ss)^2)
modelss.aic <- AIC(model.ss)
modelss.bic <- BIC(model.ss)

#ROW#
stats.model.ss <- c("model.SS", model.ss.mse, model.ss.rsq, model.ss.arsq, model.ss.mpse, modelss.aic, modelss.bic)

data.frame(cbind(comparison_table, stats.model.ss))

##     comparison_table    stats.model.ss
## 1         model type          model.SS
## 2                MSE  23.4685578277093
## 3          R-Squared 0.735693156684753
## 4 Adjusted R-Squared 0.728276383020295
## 5          Test MSPE  19.6614756361586
## 6                AIC   2435.2077778495
## 7                BIC  2487.22617126299

LASSO Variable Selection

Now we use LASSO variable selection technique. We try to shrink the coefficient estimates of non-significant variables to zero.
Here lambda is the penalty factor which helps in variable selection and so higher the lambda, lesser will be the significant variables included in the model.

STANDARDIZE COVARIATES

We need to standardize the variables before using them in model creation

Boston.X.std <- scale(dplyr::select(Boston, -medv))
X.train<- as.matrix(Boston.X.std)[index,]
X.test<-  as.matrix(Boston.X.std)[-index,]
Y.train<- Boston[index, "medv"]
Y.test<- Boston[-index, "medv"]

FIT MODEL

We fit the LASSO model to our data. From the plot below, we see that as the value of lambda keeps on increasing, the coefficients for the variables tend to 0.

lasso.fit<- glmnet(x=X.train, y=Y.train, alpha = 1)
plot(lasso.fit, xvar = "lambda", label=TRUE)

CV TO GET OPTIMAL LAMBDA

Using cross-validation we now find the appropriate lambda value using error versus lambda plot.
We take the value with the least error as well as the error value which is one standard deviation away from the lowest error value. we then build models on the basis of both of these. For the higher error value , the number of variables selected decreases.

For model with lambda=min, coefficients of age and indus get reduced to zero. Formodel with lambda=1se, coefficients of indus, age, rad and tax get reduced to zero

cv.lasso<- cv.glmnet(x=X.train, y=Y.train, alpha = 1, nfolds = 10)
plot(cv.lasso)

names(cv.lasso)

##  [1] "lambda"     "cvm"        "cvsd"       "cvup"       "cvlo"      
##  [6] "nzero"      "name"       "glmnet.fit" "lambda.min" "lambda.1se"

#Lambda with minimum error
cv.lasso$lambda.min

## [1] 0.02847133

#Lambda with Error 1 SD above
cv.lasso$lambda.1se

## [1] 0.3198253

#Coefficients for Lambda min
coef(lasso.fit, s=cv.lasso$lambda.min)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 22.4388805
## crim        -1.0571455
## zn           1.1597953
## indus        .        
## chas         0.8851066
## nox         -1.8571240
## rm           2.2585490
## age          .        
## dis         -3.2309659
## rad          2.5828855
## tax         -1.8169441
## ptratio     -2.0912414
## black        0.9226658
## lstat       -4.0046371

#Coefficients for lambda 1se
coef(lasso.fit, s=cv.lasso$lambda.1se)

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 22.4923614
## crim        -0.2688068
## zn           0.3094614
## indus        .        
## chas         0.7857114
## nox         -0.5591879
## rm           2.5469712
## age          .        
## dis         -1.1976543
## rad          .        
## tax          .        
## ptratio     -1.6591923
## black        0.6574595
## lstat       -4.0502353

Model Stats

Computing various model performance metrics:

#TRAIN DATA PREDICTION
pred.lasso.train.min <- predict(lasso.fit, newx = X.train, s=cv.lasso$lambda.min)
pred.lasso.train.1se <- predict(lasso.fit, newx = X.train, s=cv.lasso$lambda.1se)

#TEST DATA PREDICTION
pred.lasso.test.min<- predict(lasso.fit, newx = X.test, s=cv.lasso$lambda.min)
pred.lasso.test.1se<- predict(lasso.fit, newx = X.test, s=cv.lasso$lambda.1se)

#MSE
lasso.min.mse <- sum((Y.train-pred.lasso.train.min)^2)/(404-14)
lasso.1se.mse <- sum((Y.train-pred.lasso.train.1se)^2)/(404-11)

#MSPE
lasso.min.mpse <- mean((Y.test-pred.lasso.test.min)^2)
lasso.1se.mpse <- mean((Y.test-pred.lasso.test.1se)^2)

#R_squared
sst <- sum((Y.train - mean(Y.train))^2)
sse_min <- sum((Y.train-pred.lasso.train.min)^2)
sse_1se <- sum((Y.train-pred.lasso.train.1se)^2)

rsq_min <- 1 - sse_min / sst
rsq_1se <- 1 - sse_1se / sst

#adj_R_squared
#adj r squared = 1 - ((n-1)/(n-p-1))(1-r_squared)

adj_rsq_min <- 1 - (dim(X.train)[1]-1)*(1-rsq_min)/(dim(X.train)[1]-12-1)
adj_rsq_1se <- 1 - (dim(X.train)[1]-1)*(1-rsq_1se)/(dim(X.train)[1]-10-1)

stats.model.lasso.min <- c("model.lasso.min", lasso.min.mse, rsq_min, adj_rsq_min, lasso.min.mpse)
stats.model.lasso.1se <- c("model.lasso.1se", lasso.1se.mse, rsq_1se, adj_rsq_1se, lasso.1se.mpse)

comparison_table <- c("model type", "MSE", "R-Squared", "Adjusted R-Squared", "Test MSPE")
data.frame(cbind(comparison_table, stats.model.lasso.min, stats.model.lasso.1se))

##     comparison_table stats.model.lasso.min stats.model.lasso.1se
## 1         model type       model.lasso.min       model.lasso.1se
## 2                MSE      23.6285729645258      26.3963732376923
## 3          R-Squared     0.735248738094418     0.701961239023836
## 4 Adjusted R-Squared     0.727123379672764     0.694377555538438
## 5          Test MSPE      19.2715641026489      18.8801839579017

comparing models from Subset selection, LASSO with Full model

Comparing the performance of 4 models obrtained so far:

MSE: MSE of all models are comparable around the 23 mark, except the LASSO.1se model which gives a MSE of 26.39
R-Squared: Full model performs best in this category as expected, and the LASSO,1se model performs the worst, as expected again
Adjusted R-squared: A better metric for comparing models of diff variable sizes, Subset selection model performs the best here
Test MSPE: LASSO.1se model performs the best here with a low MSPE of 18.88. All other models also do a pretty good job with scores around the 19 mark

We select the subset selection model as our best model: Full model - age - indus

data.frame(cbind(comparison_table, c("full", model1.mse, model1.rsq, model1.arsq, model1.mpse), c("model.SS", model.ss.mse, model.ss.rsq, model.ss.arsq, model.ss.mpse), stats.model.lasso.min, stats.model.lasso.1se))

##     comparison_table                V2                V3
## 1         model type              full          model.SS
## 2                MSE    23.57194554198  23.4685578277093
## 3          R-Squared 0.735883231832148 0.735693156684753
## 4 Adjusted R-Squared 0.727079339559886 0.728276383020295
## 5          Test MSPE  19.7293018987321  19.6614756361586
##   stats.model.lasso.min stats.model.lasso.1se
## 1       model.lasso.min       model.lasso.1se
## 2      23.6285729645258      26.3963732376923
## 3     0.735248738094418     0.701961239023836
## 4     0.727123379672764     0.694377555538438
## 5      19.2715641026489      18.8801839579017

Residual Analysis plots

We do a quick residual analysis of the selected subset model and observe the following:

The variance is not completely constant and hence the assumption of constant variance is not totally satisfied
From the q-q plot we see that it is not completely normal and a little skewed to the right
There is no autocorrelation observed in the model
There are no observed outliers

par(mfrow=c(2,2))
plot(model.ss)

Boston Dataset

Hemang Goswami

BOSTON HOUSING DATA

Introduction

Packages Required

Data Exploration

Checking for Data structure

Data Summary

Checking for Null Values

Data sneak peek

Correlation Matrices

Distributions

Predictor vars vs Target var

Boxplots for Predictors

Histograms for Predictors

Splitting the Data

Linear Regression using all predictors

Model Stats:

Subset Selection

Forward Variable Selection

Plotting Model metrics

Selecting best subset

Backward Variable Selection

Plotting Model metrics

Selecting best subset

Exhaustive Subset Selection

Plotting Model metrics

Selecting best subset

SELECTED MODEL = 11, -AGE -INDUS

LASSO Variable Selection

STANDARDIZE COVARIATES

FIT MODEL

CV TO GET OPTIMAL LAMBDA

Model Stats

comparing models from Subset selection, LASSO with Full model

Residual Analysis plots