In this report I will apply the regression models that I have implemented to accurately predict the housing prices in Boston suburbs. The dataset for this experiment is accessed from the UCI Machine Learning repository via https://archive.ics.uci.edu/ml/datasets/Housing. The report is organized in such a way as to demonstrate the entire process right from getting and cleaning the data, to exploratory analysis of the dataset to understand the distribution and importance of various features in influencing the algorithm, to coming with a hypothesis, training ML models, evaluation of the models, etc.
The dataset consists of 506 observations of 14 attributes. The median value of house price in $1000s, denoted by MEDV, is the outcome or the dependent variable in our model. Below is a brief description of each feature and the outcome in our dataset:
CRIM – per capita crime rate by town ZN – proportion of residential land zoned for lots over 25,000 sq.ft INDUS – proportion of non-retail business acres per town CHAS – Charles River dummy variable (1 if tract bounds river; else 0) NOX – nitric oxides concentration (parts per 10 million) RM – average number of rooms per dwelling AGE – proportion of owner-occupied units built prior to 1940 DIS – weighted distances to five Boston employment centres RAD – index of accessibility to radial highways TAX – full-value property-tax rate per $10,000 PTRATIO – pupil-teacher ratio by town B – 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT – % lower status of the population MEDV – Median value of owner-occupied homes in $1000’s
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages -------------------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.3 v purrr 0.3.2
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ----------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(corrplot)
## corrplot 0.84 loaded
library(rpart)
library (caTools)
library (sp)
library (raster)
##
## Attaching package: 'raster'
## The following object is masked from 'package:tidyr':
##
## extract
## The following object is masked from 'package:dplyr':
##
## select
library(usdm)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library (sandwich)
library(broom)
options (scipen=99999)
myData <- read.csv ("E:/4th sem/BA_Assignment_Group4/Raw/Boston.csv",header=TRUE)
df<-myData
Looking the data
dim(df)
## [1] 506 15
head(df)
## X crim zn indus chas nox rm age dis rad tax ptratio black
## 1 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
Removing the first column
df<- df[,-1]
Structure of the data
str(df)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : int 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(df)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
sum(is.na(df))
## [1] 0
par(mfrow = c(1,2))
boxplot(df)
abline(h = min(df), col = "Blue")
abline(h = max(df), col = "Yellow")
abline(h = quantile(df$medv, c(0.25, 0.75)), col = "Red")
abline(h = quantile(df$black, c(0.25, 0.75)), col = "Red")
#### OUTLIERS TREATMENT, CAPPING OUTLIERS
caps <- quantile(df$medv, probs=c(.05, .95), na.rm = T)
df$medv <- ifelse (df$medv < caps[1],caps[1],df$medv)
df$medv <- ifelse (df$medv > caps[2],caps[2],df$medv)
caps <- quantile(df$crim, probs=c(.05, .95), na.rm = T)
df$crim <- ifelse (df$crim < caps[1],caps[1],df$crim)
df$crim <- ifelse (df$crim > caps[2],caps[2],df$crim)
caps <- quantile(df$indus, probs=c(.05, .95), na.rm = T)
df$indus <- ifelse (df$indus < caps[1],caps[1],df$indus)
df$indus <- ifelse (df$indus > caps[2],caps[2],df$indus)
caps <- quantile(df$dis, probs=c(.05, .95), na.rm = T)
df$dis <- ifelse (df$dis < caps[1],caps[1],df$dis)
df$dis <- ifelse (df$dis > caps[2],caps[2],df$dis)
caps <- quantile(df$rad, probs=c(.05, .95), na.rm = T)
df$rad <- ifelse (df$rad < caps[1],caps[1],df$rad)
df$rad <- ifelse (df$rad > caps[2],caps[2],df$rad)
caps <- quantile(df$tax, probs=c(.05, .95), na.rm = T)
df$tax <- ifelse (df$tax < caps[1],caps[1],df$tax)
df$tax <- ifelse (df$tax > caps[2],caps[2],df$tax)
caps <- quantile(df$zn, probs=c(.05, .95), na.rm = T)
df$zn <- ifelse (df$zn < caps[1],caps[1],df$zn)
df$zn <- ifelse (df$zn > caps[2],caps[2],df$zn)
caps <- quantile(df$black, probs=c(.05, .95), na.rm = T)
df$black <- ifelse (df$black < caps[1],caps[1],df$black)
df$black <- ifelse (df$black > caps[2],caps[2],df$black)
boxplot(df)
Some outliers were present in medv, crim , indus, zn, dis, rad, tax and black. They are treated by replacing the extreme values by 95 and 5 percentile values.
#checking correlation between variables
par(mfrow=c(1,1))
corrplot(cor(df), method = "number", type = "upper", diag = FALSE)
There are no missing values.
From correlation matrix, some of the observations made are as follows:
df %>%
gather(key, val, -medv) %>%
ggplot(aes(x = val, y = medv)) +
geom_point() +
stat_smooth(method = "lm", se = TRUE, col = "blue") +
facet_wrap(~key, scales = "free") +
theme_gray() +
ggtitle("Scatter plot of dependent variables vs Median Value (medv)")
table(df$chas)
##
## 0 1
## 471 35
#Observations
1. Proportion of owner occupied units built prior to 1940 (age) and proportion of blacks by town (black) is heavily skewed to left, while per capita crime rate in town (crim) and weighted mean of distances to five Boston employment centres (dis) is heavily skewed to right. 2. rm is normally distributed with mean of approximately 6. 3. Most of the properties are situated close to the five Boston employment centres (dis skewed to right) 4. There is a high proportion of owner occupied units built prior to 1940 (age skewed to left) and blacks in town (black skewed to right) 5. From scatter plots, it is seen that lstat and rm show strong correlation with medv. 6. 93% of the properties are away from Charles river. The properties bordering the river seems to have higher median prices.
model1 <- lm(medv ~ ., data = df)
model1.sum <- summary(model1)
model1.sum
##
## Call:
## lm(formula = medv ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.2319 -2.5572 -0.4885 1.5944 20.5989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.799995 4.600436 8.434 3.74e-16 ***
## crim -0.190658 0.089359 -2.134 0.033367 *
## zn 0.033755 0.012335 2.736 0.006434 **
## indus -0.023110 0.058157 -0.397 0.691259
## chas 2.039352 0.752394 2.710 0.006953 **
## nox -17.873037 3.416867 -5.231 2.50e-07 ***
## rm 3.453089 0.362500 9.526 < 2e-16 ***
## age -0.012814 0.011593 -1.105 0.269575
## dis -1.591317 0.195170 -8.154 2.96e-15 ***
## rad 0.304452 0.068996 4.413 1.26e-05 ***
## tax -0.011082 0.003523 -3.146 0.001756 **
## ptratio -0.945110 0.113561 -8.323 8.56e-16 ***
## black 0.009780 0.002648 3.694 0.000246 ***
## lstat -0.438845 0.045729 -9.597 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.122 on 492 degrees of freedom
## Multiple R-squared: 0.7597, Adjusted R-squared: 0.7534
## F-statistic: 119.6 on 13 and 492 DF, p-value: < 2.2e-16
#Looking at model summary, we see that variables crim, indus and age are insignificant
#Building model without variables indus and age, and convering medv into log
medv_n<- log(df$medv)
df<- cbind(df,medv_n)
model2 <- lm(medv_n ~ .-indus -age -medv, data = df)
model2.sum <- summary(model2)
model2.sum
##
## Call:
## lm(formula = medv_n ~ . - indus - age - medv, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60839 -0.10172 -0.00987 0.07779 0.76885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1016772 0.1844688 22.235 < 2e-16 ***
## crim -0.0198838 0.0035893 -5.540 4.93e-08 ***
## zn 0.0010235 0.0004865 2.104 0.03590 *
## chas 0.0816225 0.0299808 2.722 0.00671 **
## nox -0.8390664 0.1282158 -6.544 1.50e-10 ***
## rm 0.0908910 0.0142014 6.400 3.61e-10 ***
## dis -0.0572489 0.0073013 -7.841 2.78e-14 ***
## rad 0.0162700 0.0027002 6.026 3.30e-09 ***
## tax -0.0005146 0.0001269 -4.054 5.85e-05 ***
## ptratio -0.0396055 0.0045315 -8.740 < 2e-16 ***
## black 0.0004556 0.0001061 4.294 2.12e-05 ***
## lstat -0.0229497 0.0017263 -13.294 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1657 on 494 degrees of freedom
## Multiple R-squared: 0.7979, Adjusted R-squared: 0.7934
## F-statistic: 177.3 on 11 and 494 DF, p-value: < 2.2e-16
#Checking vif
vif(df[,-c(3,7,14,15)])
## Variables VIF
## 1 crim 5.165720
## 2 zn 2.180946
## 3 chas 1.067136
## 4 nox 4.062275
## 5 rm 1.832256
## 6 dis 3.713727
## 7 rad 10.086903
## 8 tax 8.217313
## 9 ptratio 1.771196
## 10 black 1.368923
## 11 lstat 2.796591
Multicolinearity between tax and rad as VIF > 10 so we will remove tax because rad is more significant
#Building model without variables indus, age and tax
model3 <- lm(medv_n ~ .-indus -age -tax -medv, data = df)
model3.sum <- summary(model2)
model3.sum
##
## Call:
## lm(formula = medv_n ~ . - indus - age - medv, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.60839 -0.10172 -0.00987 0.07779 0.76885
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.1016772 0.1844688 22.235 < 2e-16 ***
## crim -0.0198838 0.0035893 -5.540 4.93e-08 ***
## zn 0.0010235 0.0004865 2.104 0.03590 *
## chas 0.0816225 0.0299808 2.722 0.00671 **
## nox -0.8390664 0.1282158 -6.544 1.50e-10 ***
## rm 0.0908910 0.0142014 6.400 3.61e-10 ***
## dis -0.0572489 0.0073013 -7.841 2.78e-14 ***
## rad 0.0162700 0.0027002 6.026 3.30e-09 ***
## tax -0.0005146 0.0001269 -4.054 5.85e-05 ***
## ptratio -0.0396055 0.0045315 -8.740 < 2e-16 ***
## black 0.0004556 0.0001061 4.294 2.12e-05 ***
## lstat -0.0229497 0.0017263 -13.294 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1657 on 494 degrees of freedom
## Multiple R-squared: 0.7979, Adjusted R-squared: 0.7934
## F-statistic: 177.3 on 11 and 494 DF, p-value: < 2.2e-16
#Again Checking the VIF
vif(df[,-c(3,7,10,14,15)])
## Variables VIF
## 1 crim 5.151328
## 2 zn 2.085299
## 3 chas 1.057882
## 4 nox 3.870185
## 5 rm 1.802072
## 6 dis 3.648244
## 7 rad 5.053853
## 8 ptratio 1.734311
## 9 black 1.365118
## 10 lstat 2.790153
No multicolinearity as VIF < 10
par(mfrow=c(2,2))
plot(model3)
The top-left is the chart of residuals vs fitted values, while in the bottom-left one, it is standardised residuals on Y axis. If there is absolutely no heteroscedastity, you should see a completely random, equal distribution of points throughout the range of X axis and a flat red line.
bptest (df$medv_n ~ . -indus -age -tax -medv, data=df)
##
## studentized Breusch-Pagan test
##
## data: df$medv_n ~ . - indus - age - tax - medv
## BP = 65.679, df = 10, p-value = 3.004e-10
These test have a p-value less that a significance level of 0.05, therefore we can reject the null hypothesis that the variance of the residuals is constant and infer that heteroscedasticity is indeed present.
#Newey & West (1994) compute this type of estimator
NeweyWest(model2)
## (Intercept) crim zn chas
## (Intercept) 2.472207e-01 -2.014728e-04 1.218395e-04 5.945737e-03
## crim -2.014728e-04 2.717476e-05 -4.952847e-07 2.746371e-05
## zn 1.218395e-04 -4.952847e-07 3.106551e-07 3.483521e-06
## chas 5.945737e-03 2.746371e-05 3.483521e-06 1.757547e-03
## nox -7.593736e-02 -2.490702e-04 -2.026979e-05 -2.992160e-03
## rm -2.021117e-02 3.343812e-05 -1.378503e-05 -5.776250e-04
## dis -5.396112e-03 -3.691817e-06 -4.505236e-06 -1.349409e-04
## rad 1.107975e-03 -7.305302e-06 7.296953e-07 2.566204e-05
## tax -1.304137e-05 9.798847e-08 -5.550768e-09 6.036050e-07
## ptratio -1.384055e-03 2.278064e-06 -4.116985e-08 -8.123242e-06
## black -5.033149e-05 2.497244e-07 -3.947017e-09 -8.444690e-08
## lstat -1.413577e-03 -7.076829e-07 -8.570409e-07 -4.972699e-05
## nox rm dis rad
## (Intercept) -7.593736e-02 -2.021117e-02 -5.396112e-03 1.107975e-03
## crim -2.490702e-04 3.343812e-05 -3.691817e-06 -7.305302e-06
## zn -2.026979e-05 -1.378503e-05 -4.505236e-06 7.296953e-07
## chas -2.992160e-03 -5.776250e-04 -1.349409e-04 2.566204e-05
## nox 4.642186e-02 4.074315e-03 1.951391e-03 -2.110057e-04
## rm 4.074315e-03 1.992746e-03 4.397922e-04 -1.063479e-04
## dis 1.951391e-03 4.397922e-04 1.837258e-04 -2.344133e-05
## rad -2.110057e-04 -1.063479e-04 -2.344133e-05 1.192430e-05
## tax -1.105169e-06 1.308947e-06 3.799659e-07 -2.228867e-07
## ptratio 5.310505e-04 7.867700e-05 1.280890e-05 -5.310961e-06
## black 2.136162e-05 2.039138e-06 6.260831e-07 -1.140361e-07
## lstat 1.826049e-04 1.540193e-04 3.535628e-05 -7.358895e-06
## tax ptratio black lstat
## (Intercept) -1.304137e-05 -1.384055e-03 -5.033149e-05 -1.413577e-03
## crim 9.798847e-08 2.278064e-06 2.497244e-07 -7.076829e-07
## zn -5.550768e-09 -4.116985e-08 -3.947017e-09 -8.570409e-07
## chas 6.036050e-07 -8.123242e-06 -8.444690e-08 -4.972699e-05
## nox -1.105169e-06 5.310505e-04 2.136162e-05 1.826049e-04
## rm 1.308947e-06 7.867700e-05 2.039138e-06 1.540193e-04
## dis 3.799659e-07 1.280890e-05 6.260831e-07 3.535628e-05
## rad -2.228867e-07 -5.310961e-06 -1.140361e-07 -7.358895e-06
## tax 1.481489e-08 -1.077994e-07 2.629797e-09 5.613449e-08
## ptratio -1.077994e-07 2.988973e-05 2.272649e-07 -1.420095e-07
## black 2.629797e-09 2.272649e-07 4.916147e-08 5.832321e-08
## lstat 5.613449e-08 -1.420095e-07 5.832321e-08 1.840695e-05
#The Newey & West (1987) estimator requires specification of the lag and suppression of prewhitening
NeweyWest(model2, lag = 4, prewhite = FALSE)
## (Intercept) crim zn chas
## (Intercept) 1.839433e-01 -1.373423e-04 8.265688e-05 2.417586e-03
## crim -1.373423e-04 2.618724e-05 -4.071213e-07 1.910099e-05
## zn 8.265688e-05 -4.071213e-07 2.496499e-07 2.166001e-06
## chas 2.417586e-03 1.910099e-05 2.166001e-06 1.934723e-03
## nox -5.432961e-02 -1.925773e-04 -1.331698e-05 -2.177983e-03
## rm -1.477209e-02 2.420552e-05 -9.557232e-06 -2.610757e-04
## dis -3.813771e-03 -6.927629e-07 -3.300631e-06 -9.633201e-05
## rad 7.492744e-04 -7.784275e-06 4.936570e-07 1.668569e-05
## tax -1.094528e-05 6.230308e-08 -5.021864e-09 1.084793e-06
## ptratio -1.103870e-03 1.783188e-06 7.847281e-08 1.794108e-05
## black -4.220202e-05 1.698268e-07 -3.201793e-09 4.989522e-07
## lstat -1.016353e-03 -2.692420e-07 -5.523680e-07 -3.421846e-05
## nox rm dis rad
## (Intercept) -5.432961e-02 -1.477209e-02 -3.813771e-03 7.492744e-04
## crim -1.925773e-04 2.420552e-05 -6.927629e-07 -7.784275e-06
## zn -1.331698e-05 -9.557232e-06 -3.300631e-06 4.936570e-07
## chas -2.177983e-03 -2.610757e-04 -9.633201e-05 1.668569e-05
## nox 3.676294e-02 2.634655e-03 1.465688e-03 -1.370064e-04
## rm 2.634655e-03 1.450083e-03 2.988117e-04 -7.085393e-05
## dis 1.465688e-03 2.988117e-04 1.311580e-04 -1.585719e-05
## rad -1.370064e-04 -7.085393e-05 -1.585719e-05 9.953219e-06
## tax -9.781616e-07 9.868642e-07 2.076475e-07 -2.002872e-07
## ptratio 3.790018e-04 6.386596e-05 1.146667e-05 -3.824353e-06
## black 1.603048e-05 1.877798e-06 5.173355e-07 -8.115430e-08
## lstat 7.158574e-05 1.129415e-04 2.305259e-05 -4.675232e-06
## tax ptratio black lstat
## (Intercept) -1.094528e-05 -1.103870e-03 -4.220202e-05 -1.016353e-03
## crim 6.230308e-08 1.783188e-06 1.698268e-07 -2.692420e-07
## zn -5.021864e-09 7.847281e-08 -3.201793e-09 -5.523680e-07
## chas 1.084793e-06 1.794108e-05 4.989522e-07 -3.421846e-05
## nox -9.781616e-07 3.790018e-04 1.603048e-05 7.158574e-05
## rm 9.868642e-07 6.386596e-05 1.877798e-06 1.129415e-04
## dis 2.076475e-07 1.146667e-05 5.173355e-07 2.305259e-05
## rad -2.002872e-07 -3.824353e-06 -8.115430e-08 -4.675232e-06
## tax 1.385921e-08 -1.992146e-08 2.735151e-09 -1.891525e-09
## ptratio -1.992146e-08 2.253698e-05 1.839535e-07 -6.067635e-08
## black 2.735151e-09 1.839535e-07 3.957314e-08 8.682211e-08
## lstat -1.891525e-09 -6.067635e-08 8.682211e-08 1.525464e-05
#bwNeweyWest() can also be passed to kernHAC(), e.g.for the quadratic spectral kernel
kernHAC(model2, bw = bwNeweyWest)
## (Intercept) crim zn chas
## (Intercept) 2.534461e-01 -1.977417e-04 1.213002e-04 5.688605e-03
## crim -1.977417e-04 2.894024e-05 -5.009421e-07 3.729870e-05
## zn 1.213002e-04 -5.009421e-07 3.259741e-07 3.266569e-06
## chas 5.688605e-03 3.729870e-05 3.266569e-06 1.898595e-03
## nox -7.829547e-02 -2.941205e-04 -2.004883e-05 -3.092251e-03
## rm -2.054541e-02 3.606467e-05 -1.384499e-05 -5.526248e-04
## dis -5.553067e-03 -5.203350e-06 -4.638883e-06 -1.337992e-04
## rad 1.109686e-03 -7.476705e-06 7.199800e-07 2.316008e-05
## tax -1.289919e-05 1.122660e-07 -4.672457e-09 5.991687e-07
## ptratio -1.400177e-03 2.157357e-06 1.425548e-08 -9.270018e-06
## black -5.386745e-05 2.596249e-07 -3.875833e-09 2.403154e-07
## lstat -1.466396e-03 -4.760912e-07 -8.738428e-07 -4.543039e-05
## nox rm dis rad
## (Intercept) -7.829547e-02 -2.054541e-02 -5.553067e-03 1.109686e-03
## crim -2.941205e-04 3.606467e-05 -5.203350e-06 -7.476705e-06
## zn -2.004883e-05 -1.384499e-05 -4.638883e-06 7.199800e-07
## chas -3.092251e-03 -5.526248e-04 -1.337992e-04 2.316008e-05
## nox 4.891818e-02 4.126445e-03 2.040090e-03 -2.109591e-04
## rm 4.126445e-03 2.016382e-03 4.499742e-04 -1.064904e-04
## dis 2.040090e-03 4.499742e-04 1.920124e-04 -2.336643e-05
## rad -2.109591e-04 -1.064904e-04 -2.336643e-05 1.203816e-05
## tax -1.345769e-06 1.294066e-06 3.729080e-07 -2.252009e-07
## ptratio 5.304773e-04 7.838874e-05 1.198927e-05 -5.191294e-06
## black 2.234120e-05 2.223894e-06 6.523950e-07 -1.175325e-07
## lstat 1.929992e-04 1.585342e-04 3.702794e-05 -7.522085e-06
## tax ptratio black lstat
## (Intercept) -1.289919e-05 -1.400177e-03 -5.386745e-05 -1.466396e-03
## crim 1.122660e-07 2.157357e-06 2.596249e-07 -4.760912e-07
## zn -4.672457e-09 1.425548e-08 -3.875833e-09 -8.738428e-07
## chas 5.991687e-07 -9.270018e-06 2.403154e-07 -4.543039e-05
## nox -1.345769e-06 5.304773e-04 2.234120e-05 1.929992e-04
## rm 1.294066e-06 7.838874e-05 2.223894e-06 1.585342e-04
## dis 3.729080e-07 1.198927e-05 6.523950e-07 3.702794e-05
## rad -2.252009e-07 -5.191294e-06 -1.175325e-07 -7.522085e-06
## tax 1.511750e-08 -1.210214e-07 3.172883e-09 5.781504e-08
## ptratio -1.210214e-07 3.057456e-05 2.549519e-07 8.819678e-08
## black 3.172883e-09 2.549519e-07 5.186156e-08 6.053389e-08
## lstat 5.781504e-08 8.819678e-08 6.053389e-08 1.895825e-05
# Newey West Adjusted Regression Estimation
ht<-coeftest(model3, df = Inf, vcov = NeweyWest(model3, lag = 4, prewhite = FALSE))
ht
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.01360943 0.43549104 9.2163 < 2.2e-16 ***
## crim -0.01911578 0.00522037 -3.6618 0.0002505 ***
## zn 0.00061046 0.00051527 1.1847 0.2361233
## chas 0.09294084 0.04268523 2.1774 0.0294542 *
## nox -0.95209639 0.19764981 -4.8171 1.457e-06 ***
## rm 0.09828034 0.03886071 2.5290 0.0114375 *
## dis -0.05331846 0.01171117 -4.5528 5.294e-06 ***
## rad 0.00853763 0.00265633 3.2141 0.0013087 **
## ptratio -0.04225658 0.00485076 -8.7113 < 2.2e-16 ***
## black 0.00047828 0.00019916 2.4015 0.0163264 *
## lstat -0.02328544 0.00397583 -5.8567 4.720e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As we can see zn is insignificant so #Building model without variables indus, age, tax and zn
model4 <- lm(medv_n ~ .-indus -age -zn -tax -medv, data = df)
model4.sum <- summary(model4)
model4.sum
##
## Call:
## lm(formula = medv_n ~ . - indus - age - zn - tax - medv, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.58062 -0.10397 -0.00716 0.09115 0.77994
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.0279688 0.1857837 21.681 < 2e-16 ***
## crim -0.0184431 0.0036028 -5.119 4.40e-07 ***
## chas 0.0923608 0.0303271 3.045 0.00245 **
## nox -0.9619543 0.1269205 -7.579 1.72e-13 ***
## rm 0.1002435 0.0142258 7.047 6.17e-12 ***
## dis -0.0490981 0.0065499 -7.496 3.05e-13 ***
## rad 0.0085632 0.0019419 4.410 1.27e-05 ***
## ptratio -0.0439714 0.0043492 -10.110 < 2e-16 ***
## black 0.0004792 0.0001077 4.450 1.06e-05 ***
## lstat -0.0233657 0.0017509 -13.345 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1683 on 496 degrees of freedom
## Multiple R-squared: 0.7905, Adjusted R-squared: 0.7867
## F-statistic: 207.9 on 9 and 496 DF, p-value: < 2.2e-16
All variables ara significant with p<0.05.
#Again run the Newey West Adjusted Regression Estimation
ht<-coeftest(model4, df = Inf, vcov = NeweyWest(model4, lag = 4, prewhite = FALSE))
ht
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.02796876 0.44209318 9.1111 < 2.2e-16 ***
## crim -0.01844308 0.00517040 -3.5671 0.000361 ***
## chas 0.09236082 0.04287646 2.1541 0.031231 *
## nox -0.96195431 0.19956954 -4.8201 1.435e-06 ***
## rm 0.10024354 0.03844047 2.6078 0.009114 **
## dis -0.04909814 0.01007714 -4.8722 1.103e-06 ***
## rad 0.00856322 0.00268343 3.1911 0.001417 **
## ptratio -0.04397140 0.00501692 -8.7646 < 2.2e-16 ***
## black 0.00047917 0.00019930 2.4042 0.016206 *
## lstat -0.02336565 0.00402656 -5.8029 6.519e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
res<- resid (model4, df = Inf, vcov = NeweyWest(model4, lag = 4, prewhite = FALSE))
df1<-cbind(df,res)
bptest (df1$medv_n ~ .-indus -age -zn -tax -medv , data=df1, studentize = TRUE)
##
## studentized Breusch-Pagan test
##
## data: df1$medv_n ~ . - indus - age - zn - tax - medv
## BP = 13.13, df = 10, p-value = 0.2165
#P > 0.05 so we will not reject the null hyphothesis so there is no hetroscasticity in the model
#Durbin Watson Test
dwtest (df1$medv_n ~ .-indus -age -zn -tax -medv,data=df1)
##
## Durbin-Watson test
##
## data: df1$medv_n ~ . - indus - age - zn - tax - medv
## DW = 2.2715, p-value = 0.9971
## alternative hypothesis: true autocorrelation is greater than 0
This test gives p value greater than 0.5 so we will not reject the null hypothesis So there is no anutocorrelation present in the data.
vif(df[,-c(2,3,7,10,14,15)])
## Variables VIF
## 1 crim 5.041150
## 2 chas 1.057640
## 3 nox 3.855603
## 4 rm 1.780811
## 5 dis 2.894817
## 6 rad 5.053303
## 7 ptratio 1.580296
## 8 black 1.365060
## 9 lstat 2.786487
set.seed(1234)
split1=sample.split(df$medv,SplitRatio=0.70)
#train
train=subset(df,split1==TRUE)
#test
test=subset(df,split1==FALSE)
m1=lm(train$medv_n ~ .-indus -age -zn -tax -medv, data=train)
#Extracting the standard robst error from coefftest
k <- m1 %>% coeftest(m1, df = Inf, vcov = NeweyWest(m1, lag = 4, prewhite = FALSE)) %>% tidy()
k
## # A tibble: 10 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.55 0.433 8.21 2.22e-16
## 2 crim -0.00703 0.00509 -1.38 1.67e- 1
## 3 chas 0.101 0.0442 2.29 2.20e- 2
## 4 nox -0.963 0.220 -4.37 1.23e- 5
## 5 rm 0.147 0.0355 4.14 3.54e- 5
## 6 dis -0.0452 0.00993 -4.56 5.20e- 6
## 7 rad 0.00377 0.00278 1.36 1.75e- 1
## 8 ptratio -0.0368 0.00495 -7.44 9.88e-14
## 9 black 0.000562 0.000228 2.46 1.38e- 2
## 10 lstat -0.0220 0.00386 -5.70 1.19e- 8
#Replacing lm function coefficient with coeftest coefficient
m1$coefficients<- k$estimate
summary(m1)
##
## Call:
## lm(formula = train$medv_n ~ . - indus - age - zn - tax - medv,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40813 -0.10301 -0.01043 0.08599 0.77045
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## [1,] 3.5530220 0.2114260 16.805 < 2e-16 ***
## [2,] -0.0070322 0.0041823 -1.681 0.09356 .
## [3,] 0.1012310 0.0316543 3.198 0.00151 **
## [4,] -0.9633140 0.1472050 -6.544 2.10e-10 ***
## [5,] 0.1467302 0.0160457 9.145 < 2e-16 ***
## [6,] -0.0452316 0.0074374 -6.082 3.08e-09 ***
## [7,] 0.0037667 0.0021876 1.722 0.08597 .
## [8,] -0.0368152 0.0049081 -7.501 5.16e-13 ***
## [9,] 0.0005618 0.0001250 4.495 9.42e-06 ***
## [10,] -0.0219837 0.0020088 -10.944 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1582 on 354 degrees of freedom
## Multiple R-squared: 0.8198, Adjusted R-squared: 0.8152
## F-statistic: 178.9 on 9 and 354 DF, p-value: < 2.2e-16
#List of residuals
res1<-resid(m1)
par(mfrow=c(2,2))
plot(m1)
As the plot between residual and fitted becomes linear so there is no Heteroscedasticity present in the data.
#MSE
model.sum <- summary(m1)
(model.sum$sigma) ^ 2
## [1] 0.02504147
Mean square error in the trained data is 2.5%
model4.pred.test <- predict(m1, newdata = test)
model4.mspe <- mean((model4.pred.test - test$medv_n) ^ 2)
c(RMSE = model4.mspe, R2 = summary(m1)$r.squared)
## RMSE R2
## 0.04096198 0.81976799
Mean Square error in the test dataset is 4.09% and R- square value is 81.97%
We have applied linear regression model with different variables and the best model was the varaibles without crim, indus and age which are insignificant. There is Heteroscedasticity present in the model which was removed by adding residual variable in the dataset.