Overview

In this report I will apply the regression models that I have implemented to accurately predict the housing prices in Boston suburbs. The dataset for this experiment is accessed from the UCI Machine Learning repository via https://archive.ics.uci.edu/ml/datasets/Housing. The report is organized in such a way as to demonstrate the entire process right from getting and cleaning the data, to exploratory analysis of the dataset to understand the distribution and importance of various features in influencing the algorithm, to coming with a hypothesis, training ML models, evaluation of the models, etc.

Introduction

The dataset consists of 506 observations of 14 attributes. The median value of house price in $1000s, denoted by MEDV, is the outcome or the dependent variable in our model. Below is a brief description of each feature and the outcome in our dataset:

CRIM – per capita crime rate by town ZN – proportion of residential land zoned for lots over 25,000 sq.ft INDUS – proportion of non-retail business acres per town CHAS – Charles River dummy variable (1 if tract bounds river; else 0) NOX – nitric oxides concentration (parts per 10 million) RM – average number of rooms per dwelling AGE – proportion of owner-occupied units built prior to 1940 DIS – weighted distances to five Boston employment centres RAD – index of accessibility to radial highways TAX – full-value property-tax rate per $10,000 PTRATIO – pupil-teacher ratio by town B – 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT – % lower status of the population MEDV – Median value of owner-occupied homes in $1000’s

Package Required

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --

## v tibble  2.1.3     v purrr   0.3.2
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(corrplot)

## corrplot 0.84 loaded

library(rpart)
library (caTools)
library (sp)
library (raster)

## 
## Attaching package: 'raster'

## The following object is masked from 'package:tidyr':
## 
##     extract

## The following object is masked from 'package:dplyr':
## 
##     select

library(usdm)
library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library (sandwich)
library(broom)
options (scipen=99999)

Getting and Cleaning the Data

myData  <- read.csv ("E:/4th sem/BA_Assignment_Group4/Raw/Boston.csv",header=TRUE)              
df<-myData

Looking the data

dim(df)

## [1] 506  15

head(df)

##   X    crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

Removing the first column

df<- df[,-1]

Structure of the data

str(df)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

summary(df)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

IDENTIFY MISSING VALUES

sum(is.na(df))

## [1] 0

IDENTIFY OUTLIERS & TREAT

par(mfrow = c(1,2))
boxplot(df)

abline(h = min(df), col = "Blue")
abline(h = max(df), col = "Yellow")

abline(h = quantile(df$medv, c(0.25, 0.75)), col = "Red")
abline(h = quantile(df$black, c(0.25, 0.75)), col = "Red")

#### OUTLIERS TREATMENT, CAPPING OUTLIERS    

caps <- quantile(df$medv, probs=c(.05, .95), na.rm = T)
df$medv <- ifelse (df$medv < caps[1],caps[1],df$medv)
df$medv <- ifelse (df$medv > caps[2],caps[2],df$medv)  


caps <- quantile(df$crim, probs=c(.05, .95), na.rm = T)
df$crim <- ifelse (df$crim < caps[1],caps[1],df$crim)
df$crim <- ifelse (df$crim > caps[2],caps[2],df$crim)  


caps <- quantile(df$indus, probs=c(.05, .95), na.rm = T)
df$indus <- ifelse (df$indus < caps[1],caps[1],df$indus)
df$indus <- ifelse (df$indus > caps[2],caps[2],df$indus)  

caps <- quantile(df$dis, probs=c(.05, .95), na.rm = T)
df$dis <- ifelse (df$dis < caps[1],caps[1],df$dis)
df$dis <- ifelse (df$dis > caps[2],caps[2],df$dis)  

caps <- quantile(df$rad, probs=c(.05, .95), na.rm = T)
df$rad <- ifelse (df$rad < caps[1],caps[1],df$rad)
df$rad <- ifelse (df$rad > caps[2],caps[2],df$rad)  

caps <- quantile(df$tax, probs=c(.05, .95), na.rm = T)
df$tax <- ifelse (df$tax < caps[1],caps[1],df$tax)
df$tax <- ifelse (df$tax > caps[2],caps[2],df$tax)  

caps <- quantile(df$zn, probs=c(.05, .95), na.rm = T)
df$zn <- ifelse (df$zn < caps[1],caps[1],df$zn)
df$zn <- ifelse (df$zn > caps[2],caps[2],df$zn)  

caps <- quantile(df$black, probs=c(.05, .95), na.rm = T)
df$black <- ifelse (df$black < caps[1],caps[1],df$black)
df$black <- ifelse (df$black > caps[2],caps[2],df$black)  

boxplot(df)

Some outliers were present in medv, crim , indus, zn, dis, rad, tax and black. They are treated by replacing the extreme values by 95 and 5 percentile values.

Data Exploration

#checking correlation between variables

par(mfrow=c(1,1))
corrplot(cor(df), method = "number", type = "upper", diag = FALSE)

There are no missing values.

From correlation matrix, some of the observations made are as follows:

Median value of owner-occupied homes (in 1000$) increases as average number of rooms per dwelling increases and it decreases if percent of lower status population in the area increases
nox or nitrogen oxides concentration (ppm) increases with increase in proportion of non-retail business acres per town and proportion of owner-occupied units built prior to 1940.
rad and tax have a strong positive correlation of 0.92 which implies that as accessibility of radial highways increases, the full value property-tax rate per $10,000 also increases.
crim is strongly associated with variables rad and tax which implies as accessibility to radial highways increases, per capita crime rate increases.
indus has strong positive correlation with nox, which supports the notion that nitrogen oxides concentration is high in industrial areas

Data Visualization

df %>%
  gather(key, val, -medv) %>%
  ggplot(aes(x = val, y = medv)) +
  geom_point() +
  stat_smooth(method = "lm", se = TRUE, col = "blue") +
  facet_wrap(~key, scales = "free") +
  theme_gray() +
  ggtitle("Scatter plot of dependent variables vs Median Value (medv)")

table(df$chas)

## 
##   0   1 
## 471  35

#Observations
1. Proportion of owner occupied units built prior to 1940 (age) and proportion of blacks by town (black) is heavily skewed to left, while per capita crime rate in town (crim) and weighted mean of distances to five Boston employment centres (dis) is heavily skewed to right. 2. rm is normally distributed with mean of approximately 6. 3. Most of the properties are situated close to the five Boston employment centres (dis skewed to right) 4. There is a high proportion of owner occupied units built prior to 1940 (age skewed to left) and blacks in town (black skewed to right) 5. From scatter plots, it is seen that lstat and rm show strong correlation with medv. 6. 93% of the properties are away from Charles river. The properties bordering the river seems to have higher median prices.

Building Linear Regression Model

model1 <- lm(medv ~ ., data = df)
model1.sum <- summary(model1)
model1.sum

## 
## Call:
## lm(formula = medv ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.2319  -2.5572  -0.4885   1.5944  20.5989 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.799995   4.600436   8.434 3.74e-16 ***
## crim         -0.190658   0.089359  -2.134 0.033367 *  
## zn            0.033755   0.012335   2.736 0.006434 ** 
## indus        -0.023110   0.058157  -0.397 0.691259    
## chas          2.039352   0.752394   2.710 0.006953 ** 
## nox         -17.873037   3.416867  -5.231 2.50e-07 ***
## rm            3.453089   0.362500   9.526  < 2e-16 ***
## age          -0.012814   0.011593  -1.105 0.269575    
## dis          -1.591317   0.195170  -8.154 2.96e-15 ***
## rad           0.304452   0.068996   4.413 1.26e-05 ***
## tax          -0.011082   0.003523  -3.146 0.001756 ** 
## ptratio      -0.945110   0.113561  -8.323 8.56e-16 ***
## black         0.009780   0.002648   3.694 0.000246 ***
## lstat        -0.438845   0.045729  -9.597  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.122 on 492 degrees of freedom
## Multiple R-squared:  0.7597, Adjusted R-squared:  0.7534 
## F-statistic: 119.6 on 13 and 492 DF,  p-value: < 2.2e-16

#Looking at model summary, we see that variables crim, indus and age are insignificant

#Building model without variables indus and age, and convering medv into log

medv_n<- log(df$medv)
df<- cbind(df,medv_n)
model2 <- lm(medv_n ~ .-indus -age -medv, data = df)
model2.sum <- summary(model2)
model2.sum

## 
## Call:
## lm(formula = medv_n ~ . - indus - age - medv, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60839 -0.10172 -0.00987  0.07779  0.76885 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.1016772  0.1844688  22.235  < 2e-16 ***
## crim        -0.0198838  0.0035893  -5.540 4.93e-08 ***
## zn           0.0010235  0.0004865   2.104  0.03590 *  
## chas         0.0816225  0.0299808   2.722  0.00671 ** 
## nox         -0.8390664  0.1282158  -6.544 1.50e-10 ***
## rm           0.0908910  0.0142014   6.400 3.61e-10 ***
## dis         -0.0572489  0.0073013  -7.841 2.78e-14 ***
## rad          0.0162700  0.0027002   6.026 3.30e-09 ***
## tax         -0.0005146  0.0001269  -4.054 5.85e-05 ***
## ptratio     -0.0396055  0.0045315  -8.740  < 2e-16 ***
## black        0.0004556  0.0001061   4.294 2.12e-05 ***
## lstat       -0.0229497  0.0017263 -13.294  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1657 on 494 degrees of freedom
## Multiple R-squared:  0.7979, Adjusted R-squared:  0.7934 
## F-statistic: 177.3 on 11 and 494 DF,  p-value: < 2.2e-16

Multicolinerity test

#Checking vif

vif(df[,-c(3,7,14,15)])

##    Variables       VIF
## 1       crim  5.165720
## 2         zn  2.180946
## 3       chas  1.067136
## 4        nox  4.062275
## 5         rm  1.832256
## 6        dis  3.713727
## 7        rad 10.086903
## 8        tax  8.217313
## 9    ptratio  1.771196
## 10     black  1.368923
## 11     lstat  2.796591

Multicolinearity between tax and rad as VIF > 10 so we will remove tax because rad is more significant

#Building model without variables indus, age and tax

model3 <- lm(medv_n ~ .-indus -age -tax -medv, data = df)
model3.sum <- summary(model2)
model3.sum

## 
## Call:
## lm(formula = medv_n ~ . - indus - age - medv, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60839 -0.10172 -0.00987  0.07779  0.76885 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.1016772  0.1844688  22.235  < 2e-16 ***
## crim        -0.0198838  0.0035893  -5.540 4.93e-08 ***
## zn           0.0010235  0.0004865   2.104  0.03590 *  
## chas         0.0816225  0.0299808   2.722  0.00671 ** 
## nox         -0.8390664  0.1282158  -6.544 1.50e-10 ***
## rm           0.0908910  0.0142014   6.400 3.61e-10 ***
## dis         -0.0572489  0.0073013  -7.841 2.78e-14 ***
## rad          0.0162700  0.0027002   6.026 3.30e-09 ***
## tax         -0.0005146  0.0001269  -4.054 5.85e-05 ***
## ptratio     -0.0396055  0.0045315  -8.740  < 2e-16 ***
## black        0.0004556  0.0001061   4.294 2.12e-05 ***
## lstat       -0.0229497  0.0017263 -13.294  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1657 on 494 degrees of freedom
## Multiple R-squared:  0.7979, Adjusted R-squared:  0.7934 
## F-statistic: 177.3 on 11 and 494 DF,  p-value: < 2.2e-16

#Again Checking the VIF

vif(df[,-c(3,7,10,14,15)])

##    Variables      VIF
## 1       crim 5.151328
## 2         zn 2.085299
## 3       chas 1.057882
## 4        nox 3.870185
## 5         rm 1.802072
## 6        dis 3.648244
## 7        rad 5.053853
## 8    ptratio 1.734311
## 9      black 1.365118
## 10     lstat 2.790153

No multicolinearity as VIF < 10

Heteroscedasticity Test

par(mfrow=c(2,2))
plot(model3)

The top-left is the chart of residuals vs fitted values, while in the bottom-left one, it is standardised residuals on Y axis. If there is absolutely no heteroscedastity, you should see a completely random, equal distribution of points throughout the range of X axis and a flat red line.

White Test

bptest (df$medv_n ~ . -indus -age -tax -medv, data=df)

## 
##  studentized Breusch-Pagan test
## 
## data:  df$medv_n ~ . - indus - age - tax - medv
## BP = 65.679, df = 10, p-value = 3.004e-10

These test have a p-value less that a significance level of 0.05, therefore we can reject the null hypothesis that the variance of the residuals is constant and infer that heteroscedasticity is indeed present.

Newey West Covariance Matrix Estimation

#Newey & West (1994) compute this type of estimator
NeweyWest(model2)

##               (Intercept)          crim            zn          chas
## (Intercept)  2.472207e-01 -2.014728e-04  1.218395e-04  5.945737e-03
## crim        -2.014728e-04  2.717476e-05 -4.952847e-07  2.746371e-05
## zn           1.218395e-04 -4.952847e-07  3.106551e-07  3.483521e-06
## chas         5.945737e-03  2.746371e-05  3.483521e-06  1.757547e-03
## nox         -7.593736e-02 -2.490702e-04 -2.026979e-05 -2.992160e-03
## rm          -2.021117e-02  3.343812e-05 -1.378503e-05 -5.776250e-04
## dis         -5.396112e-03 -3.691817e-06 -4.505236e-06 -1.349409e-04
## rad          1.107975e-03 -7.305302e-06  7.296953e-07  2.566204e-05
## tax         -1.304137e-05  9.798847e-08 -5.550768e-09  6.036050e-07
## ptratio     -1.384055e-03  2.278064e-06 -4.116985e-08 -8.123242e-06
## black       -5.033149e-05  2.497244e-07 -3.947017e-09 -8.444690e-08
## lstat       -1.413577e-03 -7.076829e-07 -8.570409e-07 -4.972699e-05
##                       nox            rm           dis           rad
## (Intercept) -7.593736e-02 -2.021117e-02 -5.396112e-03  1.107975e-03
## crim        -2.490702e-04  3.343812e-05 -3.691817e-06 -7.305302e-06
## zn          -2.026979e-05 -1.378503e-05 -4.505236e-06  7.296953e-07
## chas        -2.992160e-03 -5.776250e-04 -1.349409e-04  2.566204e-05
## nox          4.642186e-02  4.074315e-03  1.951391e-03 -2.110057e-04
## rm           4.074315e-03  1.992746e-03  4.397922e-04 -1.063479e-04
## dis          1.951391e-03  4.397922e-04  1.837258e-04 -2.344133e-05
## rad         -2.110057e-04 -1.063479e-04 -2.344133e-05  1.192430e-05
## tax         -1.105169e-06  1.308947e-06  3.799659e-07 -2.228867e-07
## ptratio      5.310505e-04  7.867700e-05  1.280890e-05 -5.310961e-06
## black        2.136162e-05  2.039138e-06  6.260831e-07 -1.140361e-07
## lstat        1.826049e-04  1.540193e-04  3.535628e-05 -7.358895e-06
##                       tax       ptratio         black         lstat
## (Intercept) -1.304137e-05 -1.384055e-03 -5.033149e-05 -1.413577e-03
## crim         9.798847e-08  2.278064e-06  2.497244e-07 -7.076829e-07
## zn          -5.550768e-09 -4.116985e-08 -3.947017e-09 -8.570409e-07
## chas         6.036050e-07 -8.123242e-06 -8.444690e-08 -4.972699e-05
## nox         -1.105169e-06  5.310505e-04  2.136162e-05  1.826049e-04
## rm           1.308947e-06  7.867700e-05  2.039138e-06  1.540193e-04
## dis          3.799659e-07  1.280890e-05  6.260831e-07  3.535628e-05
## rad         -2.228867e-07 -5.310961e-06 -1.140361e-07 -7.358895e-06
## tax          1.481489e-08 -1.077994e-07  2.629797e-09  5.613449e-08
## ptratio     -1.077994e-07  2.988973e-05  2.272649e-07 -1.420095e-07
## black        2.629797e-09  2.272649e-07  4.916147e-08  5.832321e-08
## lstat        5.613449e-08 -1.420095e-07  5.832321e-08  1.840695e-05

#The Newey & West (1987) estimator requires specification of the lag  and suppression of prewhitening       
NeweyWest(model2, lag = 4, prewhite = FALSE)

##               (Intercept)          crim            zn          chas
## (Intercept)  1.839433e-01 -1.373423e-04  8.265688e-05  2.417586e-03
## crim        -1.373423e-04  2.618724e-05 -4.071213e-07  1.910099e-05
## zn           8.265688e-05 -4.071213e-07  2.496499e-07  2.166001e-06
## chas         2.417586e-03  1.910099e-05  2.166001e-06  1.934723e-03
## nox         -5.432961e-02 -1.925773e-04 -1.331698e-05 -2.177983e-03
## rm          -1.477209e-02  2.420552e-05 -9.557232e-06 -2.610757e-04
## dis         -3.813771e-03 -6.927629e-07 -3.300631e-06 -9.633201e-05
## rad          7.492744e-04 -7.784275e-06  4.936570e-07  1.668569e-05
## tax         -1.094528e-05  6.230308e-08 -5.021864e-09  1.084793e-06
## ptratio     -1.103870e-03  1.783188e-06  7.847281e-08  1.794108e-05
## black       -4.220202e-05  1.698268e-07 -3.201793e-09  4.989522e-07
## lstat       -1.016353e-03 -2.692420e-07 -5.523680e-07 -3.421846e-05
##                       nox            rm           dis           rad
## (Intercept) -5.432961e-02 -1.477209e-02 -3.813771e-03  7.492744e-04
## crim        -1.925773e-04  2.420552e-05 -6.927629e-07 -7.784275e-06
## zn          -1.331698e-05 -9.557232e-06 -3.300631e-06  4.936570e-07
## chas        -2.177983e-03 -2.610757e-04 -9.633201e-05  1.668569e-05
## nox          3.676294e-02  2.634655e-03  1.465688e-03 -1.370064e-04
## rm           2.634655e-03  1.450083e-03  2.988117e-04 -7.085393e-05
## dis          1.465688e-03  2.988117e-04  1.311580e-04 -1.585719e-05
## rad         -1.370064e-04 -7.085393e-05 -1.585719e-05  9.953219e-06
## tax         -9.781616e-07  9.868642e-07  2.076475e-07 -2.002872e-07
## ptratio      3.790018e-04  6.386596e-05  1.146667e-05 -3.824353e-06
## black        1.603048e-05  1.877798e-06  5.173355e-07 -8.115430e-08
## lstat        7.158574e-05  1.129415e-04  2.305259e-05 -4.675232e-06
##                       tax       ptratio         black         lstat
## (Intercept) -1.094528e-05 -1.103870e-03 -4.220202e-05 -1.016353e-03
## crim         6.230308e-08  1.783188e-06  1.698268e-07 -2.692420e-07
## zn          -5.021864e-09  7.847281e-08 -3.201793e-09 -5.523680e-07
## chas         1.084793e-06  1.794108e-05  4.989522e-07 -3.421846e-05
## nox         -9.781616e-07  3.790018e-04  1.603048e-05  7.158574e-05
## rm           9.868642e-07  6.386596e-05  1.877798e-06  1.129415e-04
## dis          2.076475e-07  1.146667e-05  5.173355e-07  2.305259e-05
## rad         -2.002872e-07 -3.824353e-06 -8.115430e-08 -4.675232e-06
## tax          1.385921e-08 -1.992146e-08  2.735151e-09 -1.891525e-09
## ptratio     -1.992146e-08  2.253698e-05  1.839535e-07 -6.067635e-08
## black        2.735151e-09  1.839535e-07  3.957314e-08  8.682211e-08
## lstat       -1.891525e-09 -6.067635e-08  8.682211e-08  1.525464e-05

#bwNeweyWest() can also be passed to kernHAC(), e.g.for the quadratic spectral kernel
kernHAC(model2, bw = bwNeweyWest)

##               (Intercept)          crim            zn          chas
## (Intercept)  2.534461e-01 -1.977417e-04  1.213002e-04  5.688605e-03
## crim        -1.977417e-04  2.894024e-05 -5.009421e-07  3.729870e-05
## zn           1.213002e-04 -5.009421e-07  3.259741e-07  3.266569e-06
## chas         5.688605e-03  3.729870e-05  3.266569e-06  1.898595e-03
## nox         -7.829547e-02 -2.941205e-04 -2.004883e-05 -3.092251e-03
## rm          -2.054541e-02  3.606467e-05 -1.384499e-05 -5.526248e-04
## dis         -5.553067e-03 -5.203350e-06 -4.638883e-06 -1.337992e-04
## rad          1.109686e-03 -7.476705e-06  7.199800e-07  2.316008e-05
## tax         -1.289919e-05  1.122660e-07 -4.672457e-09  5.991687e-07
## ptratio     -1.400177e-03  2.157357e-06  1.425548e-08 -9.270018e-06
## black       -5.386745e-05  2.596249e-07 -3.875833e-09  2.403154e-07
## lstat       -1.466396e-03 -4.760912e-07 -8.738428e-07 -4.543039e-05
##                       nox            rm           dis           rad
## (Intercept) -7.829547e-02 -2.054541e-02 -5.553067e-03  1.109686e-03
## crim        -2.941205e-04  3.606467e-05 -5.203350e-06 -7.476705e-06
## zn          -2.004883e-05 -1.384499e-05 -4.638883e-06  7.199800e-07
## chas        -3.092251e-03 -5.526248e-04 -1.337992e-04  2.316008e-05
## nox          4.891818e-02  4.126445e-03  2.040090e-03 -2.109591e-04
## rm           4.126445e-03  2.016382e-03  4.499742e-04 -1.064904e-04
## dis          2.040090e-03  4.499742e-04  1.920124e-04 -2.336643e-05
## rad         -2.109591e-04 -1.064904e-04 -2.336643e-05  1.203816e-05
## tax         -1.345769e-06  1.294066e-06  3.729080e-07 -2.252009e-07
## ptratio      5.304773e-04  7.838874e-05  1.198927e-05 -5.191294e-06
## black        2.234120e-05  2.223894e-06  6.523950e-07 -1.175325e-07
## lstat        1.929992e-04  1.585342e-04  3.702794e-05 -7.522085e-06
##                       tax       ptratio         black         lstat
## (Intercept) -1.289919e-05 -1.400177e-03 -5.386745e-05 -1.466396e-03
## crim         1.122660e-07  2.157357e-06  2.596249e-07 -4.760912e-07
## zn          -4.672457e-09  1.425548e-08 -3.875833e-09 -8.738428e-07
## chas         5.991687e-07 -9.270018e-06  2.403154e-07 -4.543039e-05
## nox         -1.345769e-06  5.304773e-04  2.234120e-05  1.929992e-04
## rm           1.294066e-06  7.838874e-05  2.223894e-06  1.585342e-04
## dis          3.729080e-07  1.198927e-05  6.523950e-07  3.702794e-05
## rad         -2.252009e-07 -5.191294e-06 -1.175325e-07 -7.522085e-06
## tax          1.511750e-08 -1.210214e-07  3.172883e-09  5.781504e-08
## ptratio     -1.210214e-07  3.057456e-05  2.549519e-07  8.819678e-08
## black        3.172883e-09  2.549519e-07  5.186156e-08  6.053389e-08
## lstat        5.781504e-08  8.819678e-08  6.053389e-08  1.895825e-05

# Newey West Adjusted Regression Estimation
ht<-coeftest(model3, df = Inf, vcov = NeweyWest(model3, lag = 4, prewhite = FALSE)) 
ht

## 
## z test of coefficients:
## 
##                Estimate  Std. Error z value  Pr(>|z|)    
## (Intercept)  4.01360943  0.43549104  9.2163 < 2.2e-16 ***
## crim        -0.01911578  0.00522037 -3.6618 0.0002505 ***
## zn           0.00061046  0.00051527  1.1847 0.2361233    
## chas         0.09294084  0.04268523  2.1774 0.0294542 *  
## nox         -0.95209639  0.19764981 -4.8171 1.457e-06 ***
## rm           0.09828034  0.03886071  2.5290 0.0114375 *  
## dis         -0.05331846  0.01171117 -4.5528 5.294e-06 ***
## rad          0.00853763  0.00265633  3.2141 0.0013087 ** 
## ptratio     -0.04225658  0.00485076 -8.7113 < 2.2e-16 ***
## black        0.00047828  0.00019916  2.4015 0.0163264 *  
## lstat       -0.02328544  0.00397583 -5.8567 4.720e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As we can see zn is insignificant so #Building model without variables indus, age, tax and zn

model4 <- lm(medv_n ~ .-indus -age -zn -tax -medv, data = df)
model4.sum <- summary(model4)
model4.sum

## 
## Call:
## lm(formula = medv_n ~ . - indus - age - zn - tax - medv, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58062 -0.10397 -0.00716  0.09115  0.77994 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.0279688  0.1857837  21.681  < 2e-16 ***
## crim        -0.0184431  0.0036028  -5.119 4.40e-07 ***
## chas         0.0923608  0.0303271   3.045  0.00245 ** 
## nox         -0.9619543  0.1269205  -7.579 1.72e-13 ***
## rm           0.1002435  0.0142258   7.047 6.17e-12 ***
## dis         -0.0490981  0.0065499  -7.496 3.05e-13 ***
## rad          0.0085632  0.0019419   4.410 1.27e-05 ***
## ptratio     -0.0439714  0.0043492 -10.110  < 2e-16 ***
## black        0.0004792  0.0001077   4.450 1.06e-05 ***
## lstat       -0.0233657  0.0017509 -13.345  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1683 on 496 degrees of freedom
## Multiple R-squared:  0.7905, Adjusted R-squared:  0.7867 
## F-statistic: 207.9 on 9 and 496 DF,  p-value: < 2.2e-16

All variables ara significant with p<0.05.

#Again run the Newey West Adjusted Regression Estimation

ht<-coeftest(model4, df = Inf, vcov = NeweyWest(model4, lag = 4, prewhite = FALSE)) 
ht

## 
## z test of coefficients:
## 
##                Estimate  Std. Error z value  Pr(>|z|)    
## (Intercept)  4.02796876  0.44209318  9.1111 < 2.2e-16 ***
## crim        -0.01844308  0.00517040 -3.5671  0.000361 ***
## chas         0.09236082  0.04287646  2.1541  0.031231 *  
## nox         -0.96195431  0.19956954 -4.8201 1.435e-06 ***
## rm           0.10024354  0.03844047  2.6078  0.009114 ** 
## dis         -0.04909814  0.01007714 -4.8722 1.103e-06 ***
## rad          0.00856322  0.00268343  3.1911  0.001417 ** 
## ptratio     -0.04397140  0.00501692 -8.7646 < 2.2e-16 ***
## black        0.00047917  0.00019930  2.4042  0.016206 *  
## lstat       -0.02336565  0.00402656 -5.8029 6.519e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

res<- resid (model4, df = Inf, vcov = NeweyWest(model4, lag = 4, prewhite = FALSE))
df1<-cbind(df,res)
bptest (df1$medv_n ~ .-indus -age -zn -tax -medv , data=df1, studentize = TRUE)

## 
##  studentized Breusch-Pagan test
## 
## data:  df1$medv_n ~ . - indus - age - zn - tax - medv
## BP = 13.13, df = 10, p-value = 0.2165

#P > 0.05 so we will not reject the null hyphothesis so there is no hetroscasticity in the model

Autocorrelation Check

#Durbin Watson Test
dwtest (df1$medv_n ~ .-indus -age -zn -tax -medv,data=df1)

## 
##  Durbin-Watson test
## 
## data:  df1$medv_n ~ . - indus - age - zn - tax - medv
## DW = 2.2715, p-value = 0.9971
## alternative hypothesis: true autocorrelation is greater than 0

This test gives p value greater than 0.5 so we will not reject the null hypothesis So there is no anutocorrelation present in the data.

Multicolinearity check

vif(df[,-c(2,3,7,10,14,15)])

##   Variables      VIF
## 1      crim 5.041150
## 2      chas 1.057640
## 3       nox 3.855603
## 4        rm 1.780811
## 5       dis 2.894817
## 6       rad 5.053303
## 7   ptratio 1.580296
## 8     black 1.365060
## 9     lstat 2.786487

Cross-Section Data: Random Split: TRAIN AND TEST (70:30)

set.seed(1234)    
split1=sample.split(df$medv,SplitRatio=0.70)
#train
train=subset(df,split1==TRUE)
#test 
test=subset(df,split1==FALSE)

Re-run Regression

m1=lm(train$medv_n ~ .-indus -age -zn -tax -medv, data=train)

#Extracting the standard robst error from coefftest

k <- m1 %>%  coeftest(m1, df = Inf, vcov = NeweyWest(m1, lag = 4, prewhite = FALSE)) %>% tidy()
k

## # A tibble: 10 x 5
##    term         estimate std.error statistic  p.value
##    <chr>           <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)  3.55      0.433         8.21 2.22e-16
##  2 crim        -0.00703   0.00509      -1.38 1.67e- 1
##  3 chas         0.101     0.0442        2.29 2.20e- 2
##  4 nox         -0.963     0.220        -4.37 1.23e- 5
##  5 rm           0.147     0.0355        4.14 3.54e- 5
##  6 dis         -0.0452    0.00993      -4.56 5.20e- 6
##  7 rad          0.00377   0.00278       1.36 1.75e- 1
##  8 ptratio     -0.0368    0.00495      -7.44 9.88e-14
##  9 black        0.000562  0.000228      2.46 1.38e- 2
## 10 lstat       -0.0220    0.00386      -5.70 1.19e- 8

#Replacing lm function coefficient with coeftest coefficient

m1$coefficients<- k$estimate
summary(m1)

## 
## Call:
## lm(formula = train$medv_n ~ . - indus - age - zn - tax - medv, 
##     data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40813 -0.10301 -0.01043  0.08599  0.77045 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
##  [1,]  3.5530220  0.2114260  16.805  < 2e-16 ***
##  [2,] -0.0070322  0.0041823  -1.681  0.09356 .  
##  [3,]  0.1012310  0.0316543   3.198  0.00151 ** 
##  [4,] -0.9633140  0.1472050  -6.544 2.10e-10 ***
##  [5,]  0.1467302  0.0160457   9.145  < 2e-16 ***
##  [6,] -0.0452316  0.0074374  -6.082 3.08e-09 ***
##  [7,]  0.0037667  0.0021876   1.722  0.08597 .  
##  [8,] -0.0368152  0.0049081  -7.501 5.16e-13 ***
##  [9,]  0.0005618  0.0001250   4.495 9.42e-06 ***
## [10,] -0.0219837  0.0020088 -10.944  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1582 on 354 degrees of freedom
## Multiple R-squared:  0.8198, Adjusted R-squared:  0.8152 
## F-statistic: 178.9 on 9 and 354 DF,  p-value: < 2.2e-16

#List of residuals

res1<-resid(m1)             
par(mfrow=c(2,2))
plot(m1)

As the plot between residual and fitted becomes linear so there is no Heteroscedasticity present in the data.

Train data performance

#MSE
model.sum <- summary(m1)
(model.sum$sigma) ^ 2

## [1] 0.02504147

Mean square error in the trained data is 2.5%

Out-of-sample Prediction or test error (MSPE)

model4.pred.test <- predict(m1, newdata = test)
model4.mspe <- mean((model4.pred.test - test$medv_n) ^ 2)
c(RMSE = model4.mspe, R2 = summary(m1)$r.squared)

##       RMSE         R2 
## 0.04096198 0.81976799

Mean Square error in the test dataset is 4.09% and R- square value is 81.97%

Conclusion

We have applied linear regression model with different variables and the best model was the varaibles without crim, indus and age which are insignificant. There is Heteroscedasticity present in the model which was removed by adding residual variable in the dataset.

Boston Housing Price Prediction