Executive Summary

The Boston housing dataset is a dataset that has median value of the house along with 13 other parameters that could potentially be related to housing prices. These are the factors such as socio-economic conditions, environmental conditions, educational facilities and some other similar factors. There are 506 observations in the data for 14 variables including the median price of house in Boston. There are 12 numerical variables in our dataset and 1 categorical variable. The aim of this project is to build a linear regression model estimate the median price of owner-occupied homes in Boston.

The summary statistics of the dataset were first observed to understand the range of values of the variables in the dataset. The correlation matrix was then observed and we notice that housing value has a strong positive correlation with rm(number of rooms). It is expected, as a spacious house with more rooms would have a higher valuation. Also, medv has a strong negative correlation with lstat. It means that a house in an area with lower socioeconomic status naturally has a lower value.

Multivariate regression analysis was then performed without any standardization to generate results that acted as a benchmark. These results were later compared with the results generated from other regression techniques to understand the accuracy of the predictions.

The different methods used for variable selection were best subset selection, forward selection, backward elimination, stepwise selection and LASSO regression. The results from the different variable selection methods were compared to come up with the best fit model to predict the median housing values in Boston. The models generated similar results and there was not a considerable difference between them. LASSO suggested a 10-variable model whereas all other techniques suggested 11 variable models.

Finally, residual diagnostics were performed on the best fit model and the assumptions of linear regression modeling were checked.

Data Summary

data(Boston)
#random sampling
index <- sample(nrow(Boston),nrow(Boston)*0.80)
train <- Boston[index,]
test <- Boston[-index,]

Let’s take a look at the summary statistics and distribution of the observations in each variable. There are no missing values in the data. But, the summary statistics suggest that there might be outliers in some of the variables.

#summary statistics
summary(train)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08354   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25372   Median :  0.00   Median : 8.56   Median :0.00000  
##  Mean   : 3.68856   Mean   : 11.38   Mean   :11.14   Mean   :0.06931  
##  3rd Qu.: 3.54343   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox              rm             age              dis        
##  Min.   :0.385   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.449   1st Qu.:5.883   1st Qu.: 43.62   1st Qu.: 2.106  
##  Median :0.532   Median :6.189   Median : 79.05   Median : 3.158  
##  Mean   :0.554   Mean   :6.273   Mean   : 68.74   Mean   : 3.819  
##  3rd Qu.:0.624   3rd Qu.:6.597   3rd Qu.: 94.35   3rd Qu.: 5.287  
##  Max.   :0.871   Max.   :8.725   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:280.5   1st Qu.:17.23   1st Qu.:376.09  
##  Median : 5.000   Median :330.0   Median :19.10   Median :391.44  
##  Mean   : 9.483   Mean   :406.7   Mean   :18.46   Mean   :358.84  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.25  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat             medv      
##  Min.   : 1.730   Min.   : 5.00  
##  1st Qu.: 7.135   1st Qu.:17.10  
##  Median :11.570   Median :21.20  
##  Mean   :12.820   Mean   :22.63  
##  3rd Qu.:17.225   3rd Qu.:25.02  
##  Max.   :37.970   Max.   :50.00

Now, let’s look at the variation in the values of various variables present in the dataset. This is achieved by plotting a boxplot of all the variables.

boxplot(train)

From the boxplot, it is evident that there are outliers in the variables ‘crim’, ‘zn’ and ‘black’. Highest variability is observed in the full-value property tax rates. We are interested in answering questions like ‘does high tax imply higher median prices?’. We calculate pairwise correlations to answer such questions.

corr <- round(cor(train), 1)
ggcorrplot(corr, hc.order = TRUE, type = "lower", lab = TRUE,
   outline.col = "white",
   ggtheme = ggplot2::theme_gray,
   colors = c("#6D9EC1", "white", "#E46726"))

Very high correlation is observed between the variables ‘tax’ and ‘medv’. The feature with the least correlation to medv is the proximity to Charles River(chas). In some cases, we observe a zero correlation between the variables.

Linear Regression

Let’s perform linear regression on the dataset. The variable of our interest here is ‘medv’. Our main objective is to predict the value of ‘medv’ based on other independent variables present in the dataset. We will first perform multiple linear regression using all the variables.

fit <- lm(medv~.,data = train)
sum.fit <- summary(fit)

From the summary statistics of the model we observe that lstat has the most significance as it has the lowest p value followed by rm. The coefficient estimate of lstat variable is negative indicating that the value of response variable medv decreases as the percentage of people with lower economic status increases. We also observe that the variables age and indus are not significant in determining the median value of homes in Boston.

Model Stats

Checking the model stats, using MSE, R-squared, adjusted R-squared, Test MSPE, AIC and BIC as metrics:

fit.mse <- (sum.fit$sigma)^2
fit.rsq <- sum.fit$r.squared
fit.arsq <- sum.fit$adj.r.squared
test.pred.fit <- predict(fit, newdata=test) 
fit.mpse <- mean((test$medv-test.pred.fit)^2)
fit.aic <- AIC(fit)
fit.bic <- BIC(fit)

stats.fit <- c("FULL", fit.mse, fit.rsq, fit.arsq, fit.mpse, fit.aic, fit.bic)

comparison_table <- c("model type", "MSE", "R-Squared", "Adjusted R-Squared", "Test MSPE", "AIC", "BIC")
data.frame(cbind(comparison_table, stats.fit))
##     comparison_table         stats.fit
## 1         model type              FULL
## 2                MSE  21.6469957573516
## 3          R-Squared 0.754758430353452
## 4 Adjusted R-Squared 0.746583711365233
## 5          Test MSPE  26.5009972929696
## 6                AIC  2404.50014575205
## 7                BIC  2464.52136892146

We will now apply different techniques of variable selection to decide the best fit model.

Best subset selection

Here we apply the best subset selection approach to the Boston data. We wish to predict the median value of a house based on the available predictors. We decide on the number of variables that are required in the model using BIC, r squared and adjusted r squared.

regs_model <- regsubsets(medv~., train,nvmax = 13)
regs_summ <- summary(regs_model)
cbind(regs_summ$which,bic=regs_summ$bic,adj.r2=regs_summ$adjr2)
##    (Intercept) crim zn indus chas nox rm age dis rad tax ptratio black
## 1            1    0  0     0    0   0  0   0   0   0   0       0     0
## 2            1    0  0     0    0   0  1   0   0   0   0       0     0
## 3            1    0  0     0    0   0  1   0   0   0   0       1     0
## 4            1    0  0     0    0   0  1   0   1   0   0       1     0
## 5            1    0  0     0    0   1  1   0   1   0   0       1     0
## 6            1    0  0     0    1   1  1   0   1   0   0       1     0
## 7            1    0  0     0    1   1  1   0   1   0   0       1     1
## 8            1    0  1     0    1   1  1   0   1   0   0       1     1
## 9            1    1  0     0    1   1  1   0   1   1   0       1     1
## 10           1    1  1     0    1   1  1   0   1   1   1       1     0
## 11           1    1  1     0    1   1  1   0   1   1   1       1     1
## 12           1    1  1     0    1   1  1   1   1   1   1       1     1
## 13           1    1  1     1    1   1  1   1   1   1   1       1     1
##    lstat       bic    adj.r2
## 1      1 -311.5275 0.5499215
## 2      1 -406.4929 0.6485755
## 3      1 -456.8011 0.6935325
## 4      1 -468.6637 0.7060538
## 5      1 -487.8519 0.7231279
## 6      1 -492.3544 0.7295542
## 7      1 -494.4685 0.7342633
## 8      1 -494.4398 0.7375002
## 9      1 -492.9798 0.7397757
## 10     1 -494.5191 0.7439378
## 11     1 -495.7083 0.7478134
## 12     1 -489.7991 0.7472261
## 13     1 -483.8069 0.7465837

The bic and adjusted r squared values in the above table suggests that the model with 11 variables is the best model. We arrive at the model medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + black + lstat as our best model, with 11 predictors.

Model Stats

Let us compare the model stats with the stats obtained from the full model in the previous section

model.ss <- lm(medv ~ . -indus -age, data=train)
sum.model.ss <- summary(model.ss)
sum.model.ss
## 
## Call:
## lm(formula = medv ~ . - indus - age, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6654  -2.8045  -0.5478   2.0366  25.9836 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.486454   5.613851   6.856 2.77e-11 ***
## crim         -0.103586   0.035834  -2.891  0.00406 ** 
## zn            0.046075   0.014929   3.086  0.00217 ** 
## chas          2.735509   0.939021   2.913  0.00378 ** 
## nox         -16.380386   3.929948  -4.168 3.78e-05 ***
## rm            3.649630   0.442425   8.249 2.45e-15 ***
## dis          -1.511341   0.202103  -7.478 4.99e-13 ***
## rad           0.271856   0.068768   3.953 9.15e-05 ***
## tax          -0.011179   0.003755  -2.977  0.00309 ** 
## ptratio      -0.997898   0.144439  -6.909 1.99e-11 ***
## black         0.008107   0.003056   2.653  0.00830 ** 
## lstat        -0.526805   0.050377 -10.457  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared:  0.7547, Adjusted R-squared:  0.7478 
## F-statistic: 109.6 on 11 and 392 DF,  p-value: < 2.2e-16
model.ss.mse <- (sum.model.ss$sigma)^2
model.ss.rsq <- sum.model.ss$r.squared
model.ss.arsq <- sum.model.ss$adj.r.squared
test.pred.model.ss <- predict(model.ss, newdata=test) 
model.ss.mpse <- mean((test$medv-test.pred.model.ss)^2)
modelss.aic <- AIC(model.ss)
modelss.bic <- BIC(model.ss)

stats.model.ss <- c("best_subset", model.ss.mse, model.ss.rsq, model.ss.arsq, model.ss.mpse, modelss.aic, modelss.bic)

data.frame(cbind(comparison_table, stats.fit, stats.model.ss))
##     comparison_table         stats.fit    stats.model.ss
## 1         model type              FULL       best_subset
## 2                MSE  21.6469957573516  21.5419588675389
## 3          R-Squared 0.754758430353452  0.75469686001213
## 4 Adjusted R-Squared 0.746583711365233 0.747813353532878
## 5          Test MSPE  26.5009972929696  26.4299999958324
## 6                AIC  2404.50014575205  2400.60156125011
## 7                BIC  2464.52136892146   2452.6199546636

Forward selection

Forward stepwise selection is a computationally efficient alternative to best subset selection. While the best subset selection procedure considers all 2p possible models containing subsets of the p predictors, forward stepwise considers a much smaller set of models.

Best model selected by forward selection method is given below.

nullmodel <- lm(medv~1, data = train)
fullmodel <- lm(medv~., data = train)

model.step.f<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='forward', trace=FALSE)
summary(model.step.f)
## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     black + zn + crim + rad + tax, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6654  -2.8045  -0.5478   2.0366  25.9836 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.486454   5.613851   6.856 2.77e-11 ***
## lstat        -0.526805   0.050377 -10.457  < 2e-16 ***
## rm            3.649630   0.442425   8.249 2.45e-15 ***
## ptratio      -0.997898   0.144439  -6.909 1.99e-11 ***
## dis          -1.511341   0.202103  -7.478 4.99e-13 ***
## nox         -16.380386   3.929948  -4.168 3.78e-05 ***
## chas          2.735509   0.939021   2.913  0.00378 ** 
## black         0.008107   0.003056   2.653  0.00830 ** 
## zn            0.046075   0.014929   3.086  0.00217 ** 
## crim         -0.103586   0.035834  -2.891  0.00406 ** 
## rad           0.271856   0.068768   3.953 9.15e-05 ***
## tax          -0.011179   0.003755  -2.977  0.00309 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared:  0.7547, Adjusted R-squared:  0.7478 
## F-statistic: 109.6 on 11 and 392 DF,  p-value: < 2.2e-16

We observe that the model is same as the one selected using best subset selection method Hence, the model stats also remain the same.

##     comparison_table         stats.fit    stats.model.ss     stats.model.f
## 1         model type              FULL       best_subset forward_selection
## 2                MSE  21.6469957573516  21.5419588675389  21.5419588675389
## 3          R-Squared 0.754758430353452  0.75469686001213  0.75469686001213
## 4 Adjusted R-Squared 0.746583711365233 0.747813353532878 0.747813353532879
## 5          Test MSPE  26.5009972929696  26.4299999958324  26.4299999958324
## 6                AIC  2404.50014575205  2400.60156125011  2400.60156125011
## 7                BIC  2464.52136892146   2452.6199546636   2452.6199546636

Backward elimination

Like forward stepwise selection, backward stepwise selection provides an efficient alternative to best subset selection. However, unlike forward stepwise selection, it begins with the full least squares model containing all p predictors, and then iteratively removes the least useful predictor, one-at-a-time.

Best model selected by backward elimination method is given below.

#backward elimination
model.step.b<- step(fullmodel,direction='backward',trace=FALSE)
summary(model.step.b)
## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6654  -2.8045  -0.5478   2.0366  25.9836 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.486454   5.613851   6.856 2.77e-11 ***
## crim         -0.103586   0.035834  -2.891  0.00406 ** 
## zn            0.046075   0.014929   3.086  0.00217 ** 
## chas          2.735509   0.939021   2.913  0.00378 ** 
## nox         -16.380386   3.929948  -4.168 3.78e-05 ***
## rm            3.649630   0.442425   8.249 2.45e-15 ***
## dis          -1.511341   0.202103  -7.478 4.99e-13 ***
## rad           0.271856   0.068768   3.953 9.15e-05 ***
## tax          -0.011179   0.003755  -2.977  0.00309 ** 
## ptratio      -0.997898   0.144439  -6.909 1.99e-11 ***
## black         0.008107   0.003056   2.653  0.00830 ** 
## lstat        -0.526805   0.050377 -10.457  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared:  0.7547, Adjusted R-squared:  0.7478 
## F-statistic: 109.6 on 11 and 392 DF,  p-value: < 2.2e-16

Stepwise selection

Stepwise selection has the flexibility to both add/ remove variables in the process of selecting the best regression model. Best model selected by stepwise selection method is given below

#stepwise(mixed)
model.step.s<- step(nullmodel, scope=list(lower=nullmodel, upper=fullmodel), direction='both',trace=FALSE)
summary(model.step.s)
## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     black + zn + crim + rad + tax, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6654  -2.8045  -0.5478   2.0366  25.9836 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.486454   5.613851   6.856 2.77e-11 ***
## lstat        -0.526805   0.050377 -10.457  < 2e-16 ***
## rm            3.649630   0.442425   8.249 2.45e-15 ***
## ptratio      -0.997898   0.144439  -6.909 1.99e-11 ***
## dis          -1.511341   0.202103  -7.478 4.99e-13 ***
## nox         -16.380386   3.929948  -4.168 3.78e-05 ***
## chas          2.735509   0.939021   2.913  0.00378 ** 
## black         0.008107   0.003056   2.653  0.00830 ** 
## zn            0.046075   0.014929   3.086  0.00217 ** 
## crim         -0.103586   0.035834  -2.891  0.00406 ** 
## rad           0.271856   0.068768   3.953 9.15e-05 ***
## tax          -0.011179   0.003755  -2.977  0.00309 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.641 on 392 degrees of freedom
## Multiple R-squared:  0.7547, Adjusted R-squared:  0.7478 
## F-statistic: 109.6 on 11 and 392 DF,  p-value: < 2.2e-16

Hence, we observe that Using stepwise regression in any direction results in the same best model as before where medv is regressed on all other variables except indus and age.

Lasso regression

We will now use Lasso to build the regression model

#standardize covariates
Boston.X.std<- scale(dplyr::select(Boston, -medv))
X.train<- as.matrix(Boston.X.std)[index,]
X.test<-  as.matrix(Boston.X.std)[-index,]
Y.train<- Boston[index, "medv"]
Y.test<- Boston[-index, "medv"]

#fit model
lasso.fit<- glmnet(x=X.train, y=Y.train, family = "gaussian", alpha = 1)
plot(lasso.fit, xvar = "lambda")

** Cross validation to get Optimal Lamda**

Using cross-validation we find appropriate lambda value using error versus lambda plot. We take the value with the least error as well as the error value which is one standard deviation away from the lowest error value. we then build models based on both. For the higher error value, the number of variables selected decreases.

cv.lasso<- cv.glmnet(x=X.train, y=Y.train, family = "gaussian", alpha = 1, nfolds = 10)
plot(cv.lasso)

cv.lasso$lambda.min
## [1] 0.01345128
cv.lasso$lambda.1se
## [1] 0.2405962

We find lambda.min to be 0.0135 and lambda.se to be 0.241.

We try to fit the model using lambda.1se and lamda.min. We observe that lamda.1se gives one predictor less as compared to lamda.1se.

Coefficients for lamda.1se

coef(lasso.fit, s=cv.lasso$lambda.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept) 22.7688737
## crim        -0.2984214
## zn           0.2243296
## indus       -0.1666811
## chas         0.6019586
## nox         -0.8193901
## rm           2.8285363
## age          .        
## dis         -1.5187074
## rad          .        
## tax          .        
## ptratio     -1.9292898
## black        0.5784803
## lstat       -3.6979675

** Coefficients for lamda.min**

coef(lasso.fit, s=cv.lasso$lambda.min)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) 22.78140868
## crim        -0.84940440
## zn           0.99725880
## indus       -0.08037496
## chas         0.69929510
## nox         -1.79191071
## rm           2.59897984
## age         -0.09602244
## dis         -3.12359093
## rad          2.11532641
## tax         -1.64535068
## ptratio     -2.13760198
## black        0.73231750
## lstat       -3.72307947

We will select lamd.1se for a less complex model.

Comparing model peformances

  • MSE: Full model and best subset model have a better MSE as compared to LASSO
  • R-Squared: Full model performs best in this category as expected, and the LASSO1se model performs the worst, as expected
  • Adjusted R-squared: A better metric for comparing models of diff variable sizes, Subset selection model performs the best here
  • Test MSPE: Best subset model performs the best here with a low MSPE of 26.42.

We select the subset selection model as our best model: medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + black + lstat

##     comparison_table         stats.fit    stats.model.ss
## 1         model type              FULL       best_subset
## 2                MSE  21.6469957573516  21.5419588675389
## 3          R-Squared 0.754758430353452  0.75469686001213
## 4 Adjusted R-Squared 0.746583711365233 0.747813353532878
## 5          Test MSPE  26.5009972929696  26.4299999958324
## 6                AIC  2404.50014575205  2400.60156125011
## 7                BIC  2464.52136892146   2452.6199546636
##   stats.model.lasso.1se
## 1                 LASSO
## 2      23.2090350619489
## 3     0.735039287433406
## 4     0.728297284569116
## 5       29.147140630975
## 6                    NA
## 7                    NA

Residual diagnostics

Residual diagnosis needs to be performed to check the validity of the assumptions made for building the regression model. We draw residual plots for this activity. From the residual plots, it can be inferred that i) Variance is not constant. But transformation can be made to solve this problem. ii) Q-Q plot suggests normal data but with a skewed tail. iii) It doesn’t look like there is any auto correlation present. iv) There are no outliers.

par(mfrow=c(2,2))
plot(model.ss)