v.srivastavvaibhav@gmail.com

B.I.E.T. Jhansi

Introduction

This project is based on the boston dataset. The data to be analyzed were collected by Harrison and Rubinfeld in 1978 for the purpose of discovering whether or not clean air influenced the value of houses in Boston. Their results are documented in a paper titled Hedonic prices and the demand for clean air, published in J. Environ. Economics and Management 5, 81-102.

Overview

This report seeks to examine the influence of several neighborhood attributes on the prices of housing, in an attempt to discover the most suitable explanatory variables. The specific neighborhood attributes to be considered are proximity to the Charles River, distance to the main employment centers, pupil-teacher ratio in schools, and levels of crime. Whereas the original study focused on air pollution using nitrogen oxide concentrations as an explanatory variable, this report examines whether or not there are other, better explanatory variables for the median value of houses in Boston. The R programming language will be used to conduct this analysis.

DATA

The dataset (Boston Housing Price) was taken from the StatLib library which is maintained at Carnegie Mellon University and is freely available for download from the UCI Machine Learning Repository. The dataset consists of 506 observations of 14 attributes. The median value of house price in $1000s, denoted by MEDV, is the outcome or the dependent variable in our model. Below is a brief description of each feature and the outcome in our dataset:

  1. CRIM – per capita crime rate by town
  2. ZN – proportion of residential land zoned for lots over 25,000 sq.ft
  3. INDUS – proportion of non-retail business acres per town
  4. CHAS – Charles River dummy variable (1 if tract bounds river; else 0)
  5. NOX – nitric oxides concentration (parts per 10 million)
  6. RM – average number of rooms per dwelling
  7. AGE – proportion of owner-occupied units built prior to 1940
  8. DIS – weighted distances to five Boston employment centres
  9. RAD – index of accessibility to radial highways
  10. TAX – full-value property-tax rate per $10,000
  11. PT – pupil-teacher ratio by town
  12. B – 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT – % lower status of the population
  14. MV – Median value of owner-occupied homes in $1000’s

Objective

The prime objective of this project is to construct a working model which has the capability of predicting the value of houses, we will need to separate the dataset into features and the target variable. The features, ‘RM’, ‘LSTAT’, and ‘PTRATIO’, give us quantitative information about each data point. The target variable, ‘MEDV’, will be the variable we seek to predict. These are stored in features and prices, respectively.

Load Dataset

setwd("~/Desktop/Data Analytics Internship/Boston")
library(readxl)
boston <- read_excel("boston.xls")
View(boston)

Size of Dataset

dim(boston)
## [1] 506  14

Explore Data

summary(boston)
##       CRIM                ZN             INDUS            CHAS        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       NOX               RM             AGE              DIS        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       RAD              TAX              PT              B         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      LSTAT             MV       
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Description

library(psych)
describe(boston)
##       vars   n   mean     sd median trimmed    mad    min    max  range
## CRIM     1 506   3.61   8.60   0.26    1.68   0.33   0.01  88.98  88.97
## ZN       2 506  11.36  23.32   0.00    5.08   0.00   0.00 100.00 100.00
## INDUS    3 506  11.14   6.86   9.69   10.93   9.37   0.46  27.74  27.28
## CHAS     4 506   0.07   0.25   0.00    0.00   0.00   0.00   1.00   1.00
## NOX      5 506   0.55   0.12   0.54    0.55   0.13   0.38   0.87   0.49
## RM       6 506   6.28   0.70   6.21    6.25   0.51   3.56   8.78   5.22
## AGE      7 506  68.57  28.15  77.50   71.20  28.98   2.90 100.00  97.10
## DIS      8 506   3.80   2.11   3.21    3.54   1.91   1.13  12.13  11.00
## RAD      9 506   9.55   8.71   5.00    8.73   2.97   1.00  24.00  23.00
## TAX     10 506 408.24 168.54 330.00  400.04 108.23 187.00 711.00 524.00
## PT      11 506  18.46   2.16  19.05   18.66   1.70  12.60  22.00   9.40
## B       12 506 356.67  91.29 391.44  383.17   8.09   0.32 396.90 396.58
## LSTAT   13 506  12.65   7.14  11.36   11.90   7.11   1.73  37.97  36.24
## MV      14 506  22.53   9.20  21.20   21.56   5.93   5.00  50.00  45.00
##        skew kurtosis   se
## CRIM   5.19    36.60 0.38
## ZN     2.21     3.95 1.04
## INDUS  0.29    -1.24 0.30
## CHAS   3.39     9.48 0.01
## NOX    0.72    -0.09 0.01
## RM     0.40     1.84 0.03
## AGE   -0.60    -0.98 1.25
## DIS    1.01     0.46 0.09
## RAD    1.00    -0.88 0.39
## TAX    0.67    -1.15 7.49
## PT    -0.80    -0.30 0.10
## B     -2.87     7.10 4.06
## LSTAT  0.90     0.46 0.32
## MV     1.10     1.45 0.41

Contingency table

This table consists of our 2 categorical variables ‘CHAS’ and ‘RAD’.

ftable(CHAS ~ RAD, data=boston)
##     CHAS   0   1
## RAD             
## 1         19   1
## 2         24   0
## 3         36   2
## 4        102   8
## 5        104  11
## 6         26   0
## 7         17   0
## 8         19   5
## 24       124   8

Boxplot

We boxplot all the variables that effect our study of the target variable ’MEDV’00

plot(boston)

Plotting dependent variables with target variable

plot(boston[,c(3,5,6,11,13,14)],pch=3)

Correlation and near zero variance

A few important properties to check now are the correlation of input features with the dependent variable, and to check if any feature has near zero variance (values not varying much within the column).

cor(boston,boston$MV)
##             [,1]
## CRIM  -0.3883046
## ZN     0.3604453
## INDUS -0.4837252
## CHAS   0.1752602
## NOX   -0.4273208
## RM     0.6953599
## AGE   -0.3769546
## DIS    0.2499287
## RAD   -0.3816262
## TAX   -0.4685359
## PT    -0.5077867
## B      0.3334608
## LSTAT -0.7376627
## MV     1.0000000

We see that the number of rooms RM has the strongest positive correlation with the median value of the housing price, while the percentage of lower status population, LSTAT and the pupil-teacher ratio, PTRATIO, have strong negative correlation. The feature with the least correlation to MV is the proximity to Charles River, CHAS.

House Pricing

hist(boston$MV,xlab="Median Value", main="House Pricing", col="grey")

The right skewed distribution suggests that a log transformation would be appropriat. Similarly, the variables crim, dis, nox, zn are found to be right skewed, making log transformations appropriate. The left skewed distribution of ptratio suggests that squaring it could make for a better fit.

Data Partitoning

We partition the data on a 7/3 ratio as training/test datasets.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Warning: replacing previous import by 'plyr::ddply' when loading 'caret'
## Warning: replacing previous import by 'tidyr::%>%' when loading 'broom'
## Warning: replacing previous import by 'tidyr::gather' when loading 'broom'
## Warning: replacing previous import by 'tidyr::spread' when loading 'broom'
## Warning: replacing previous import by 'rlang::!!' when loading 'recipes'
## Warning: replacing previous import by 'rlang::expr' when loading 'recipes'
## Warning: replacing previous import by 'rlang::f_lhs' when loading 'recipes'
## Warning: replacing previous import by 'rlang::f_rhs' when loading 'recipes'
## Warning: replacing previous import by 'rlang::is_empty' when loading
## 'recipes'
## Warning: replacing previous import by 'rlang::lang' when loading 'recipes'
## Warning: replacing previous import by 'rlang::na_dbl' when loading
## 'recipes'
## Warning: replacing previous import by 'rlang::names2' when loading
## 'recipes'
## Warning: replacing previous import by 'rlang::quos' when loading 'recipes'
## Warning: replacing previous import by 'rlang::sym' when loading 'recipes'
## Warning: replacing previous import by 'rlang::syms' when loading 'recipes'
set.seed(12345)
#Do data partitioning
inTrain <- createDataPartition(y = boston$MV, p = 0.70, list = FALSE)
training <- boston[inTrain,]
testing <- boston[-inTrain,]

Regression Models

General Linear Model

First, let us try generalized linear regression model with MEDV as the dependent variable and all the remaining variables as independent variables. We train the model with the training dataset. For this linear model, below is the coefficients of all the features, and the intercept. Next, we use the trained model to predict the outcome (MV) for the testing dataset. A good metric to test the accuracy of the model is to calculate the root-mean squared error.

fit.lm <- lm(MV~.,data = training)

#Check Coefficients
data.frame(coef = round(fit.lm$coefficients,2))
##               coef
## (Intercept)  42.96
## CRIM         -0.10
## ZN            0.05
## INDUS         0.00
## CHAS          2.50
## NOX         -19.98
## RM            3.21
## AGE           0.01
## DIS          -1.59
## RAD           0.33
## TAX          -0.01
## PT           -1.04
## B             0.01
## LSTAT        -0.52
set.seed(12345)
#predict on test set
pred.lm <- predict(fit.lm, newdata = testing)

# Root-mean squared error
rmse.lm <- sqrt(sum((pred.lm - testing$MV)^2)/
                   length(testing$MV))
                   
c(RMSE = rmse.lm, R2 = summary(fit.lm)$r.squared)
##      RMSE        R2 
## 4.3819921 0.7239054

We see that the RMSE is 4.381992 and the R2 value is 0.7239 for this model.

Linear Model 2

For the purposes of fitting a regression model for the first time, the log transformation will only be applied to mv, log(MV) to determine whether it does indeed provide a better fit.

set.seed(12345)
#Try linear model using all features
fit.lm1 <- lm(log(MV)~.,data = training)

set.seed(12345)
#predict on test set
pred.lm1 <- predict(fit.lm1, newdata = testing)

# Root-mean squared error
rmse.lm1 <- sqrt(sum((exp(pred.lm1) - testing$MV)^2)/
                   length(testing$MV))
                   
c(RMSE = rmse.lm1, R2 = summary(fit.lm1)$r.squared)
##     RMSE       R2 
## 4.146613 0.770014

Let us examine the calculated p-value for each feature in the linear model. Any feature which is not significant (p<0.05) is not contributing signicantly for the model, probably due to multicollinearity among other features. We see that the features, ZN, INDUS, and AGE are not significant.

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
summary(fit.lm1)
## 
## Call:
## lm(formula = log(MV) ~ ., data = training)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73926 -0.10694 -0.01578  0.10220  0.78240 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.4720938  0.2550150  17.537  < 2e-16 ***
## CRIM        -0.0106413  0.0018236  -5.835 1.25e-08 ***
## ZN           0.0011054  0.0007049   1.568  0.11778    
## INDUS        0.0017993  0.0029814   0.603  0.54658    
## CHAS         0.0995050  0.0410116   2.426  0.01577 *  
## NOX         -0.9741578  0.1875861  -5.193 3.56e-07 ***
## RM           0.0655990  0.0205322   3.195  0.00153 ** 
## AGE          0.0004851  0.0006502   0.746  0.45607    
## DIS         -0.0541356  0.0100680  -5.377 1.41e-07 ***
## RAD          0.0148185  0.0033352   4.443 1.20e-05 ***
## TAX         -0.0005851  0.0001871  -3.127  0.00192 ** 
## PT          -0.0437995  0.0064565  -6.784 5.17e-11 ***
## B            0.0003509  0.0001342   2.614  0.00934 ** 
## LSTAT       -0.0277881  0.0023630 -11.760  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1979 on 342 degrees of freedom
## Multiple R-squared:   0.77,  Adjusted R-squared:  0.7613 
## F-statistic: 88.08 on 13 and 342 DF,  p-value: < 2.2e-16

Thus, the log transformation of medv is indeed appropriate. Now, the predictors that are not statistically significant can be removed from the model. These are indus, age, and zn. Moreover, for the purpose of this report, the variables tax and rad are not of interest, since they are highly correlated with proximity to industries which itself is not significant.

Linear Model 3

Based on all these observations, we now construct a new linear model as below:

log(MEDV) ~ CRIM + CHAS + NOX + RM + DIS + PTRATIO + RAD + B + LSTAT

set.seed(12345)
#Try simple linear model using selected features
fit.lm2 <- lm(formula = log(MV) ~ CRIM + CHAS + NOX + RM + DIS + PT + 
            RAD + B + LSTAT, data = training)

set.seed(12345)
#predict on test set
pred.lm2 <- predict(fit.lm2, newdata = testing)

# Root-mean squared error
rmse.lm2 <- sqrt(sum((exp(pred.lm2) - testing$MV)^2)/
                   length(testing$MV))
                   
c(RMSE = rmse.lm2, R2 = summary(fit.lm2)$r.squared)
##      RMSE        R2 
## 4.1786652 0.7624071

This model is marginally less accurate than linear model 2, based on slight increase in RMSE and slight decrease in R2 value.

Let us plot the predicted vs actual values of the outcome MEDV.

# Plot of predicted price vs actual price
plot(exp(pred.lm2),testing$MV, xlab = "Predicted Price", ylab = "Actual Price", col="blue")

Diagnostics plots

layout(matrix(c(1,2,3,4),2,2))
plot(fit.lm2)

Conclusion

The goal of this report was to determine the neighborhood attributes that best explained variation in house pricing. Various statistical techniques were used to eliminate predictors and extraneous observations. In examining the final model, one finds – quite reasonably – that house prices are higher in areas with lower crime and lower pupil-teacher ratios. House prices also tend to be higher closer to the Charles River, and houses with more rooms are pricier. This report is interested in the neighborhood attributes of houses, so the number of rooms is not an important predictor. The most interesting factors to consider are nitrogen oxide levels and distance to the main employment centers. On the one hand, people would want to live close to their place of employment. Yet it is reasonable to suggest that pollution levels are higher as one moves closer to these main employment centers. Most importantly, when talking of pollution, it is not just nitrogen oxide levels that are higher, but also noise pollution levels. The regression model that was fitted shows that higher levels of pollution decrease house prices to a greater extent than distance to employment centers. This suggests that people would prefer to live further away from their place of employment if it meant lower levels of pollution, which is an interesting point to consider. On a concluding note, it is important to note that the data for this report was collected several decades ago. In the years since, there is no doubt that pollution levels have risen and it would be interesting to examine the ways in which that affects house pricing in Boston today.