Synopsis

This study reviews The Boston Housing Dataset , which was originally used in the paper `Hedonic prices and the demand for clean air’ by Harrison, D. and Rubinfeld, D.L.

Our aim is to investigate whether there is a relation between house prices and air pollution and teacher-pupil ratio in the corresponding neighborhood.

Before testing any hypothesis statistically we make some exploratory analysis after preprocessing to get a better grasp of the data.

Preprocessing the Data

First we will load all the libraries on to R that we will use in this study.

setwd("D:/PROGRAMCILIK/SABANCI/518 Visualization/Proje")
library(ggplot2)
library(ggmap)
library(mapproj)

Data

The description of the data variables is as follows:

  • CRIM - per capita crime rate by town
  • ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS - proportion of non-retail business acres per town.
  • CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  • NOX - nitric oxides concentration (parts per 10 million)
  • RM - average number of rooms per dwelling
  • AGE - proportion of owner-occupied units built prior to 1940
  • DIS - weighted distances to five Boston employment centers
  • RAD - index of accessibility to radial highways
  • TAX - full-value property-tax rate per $10,000
  • PTRATIO - pupil-teacher ratio by town
  • B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT - % lower status of the population
  • MEDV - Median value of owner-occupied homes in $1000’s

Exploratory Data Analysis

First we will load the data on to R and will check some summary statistics as follows:

boston<- read.csv("boston.csv")
str(boston)
## 'data.frame':    506 obs. of  16 variables:
##  $ TOWN   : Factor w/ 92 levels "Arlington","Ashland",..: 54 77 77 46 46 46 69 69 69 69 ...
##  $ TRACT  : int  2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
##  $ LON    : num  -71 -71 -70.9 -70.9 -70.9 ...
##  $ LAT    : num  42.3 42.3 42.3 42.3 42.3 ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
##  $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM     : num  6.58 6.42 7.18 7 7.15 ...
##  $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
summary(boston)
##                 TOWN         TRACT           LON              LAT       
##  Cambridge        : 30   Min.   :   1   Min.   :-71.29   Min.   :42.03  
##  Boston Savin Hill: 23   1st Qu.:1303   1st Qu.:-71.09   1st Qu.:42.18  
##  Lynn             : 22   Median :3394   Median :-71.05   Median :42.22  
##  Boston Roxbury   : 19   Mean   :2700   Mean   :-71.06   Mean   :42.22  
##  Newton           : 18   3rd Qu.:3740   3rd Qu.:-71.02   3rd Qu.:42.25  
##  Somerville       : 15   Max.   :5082   Max.   :-70.81   Max.   :42.38  
##  (Other)          :379                                                  
##       MEDV            CRIM                ZN             INDUS      
##  Min.   : 5.00   Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46  
##  1st Qu.:17.02   1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19  
##  Median :21.20   Median : 0.25651   Median :  0.00   Median : 9.69  
##  Mean   :22.53   Mean   : 3.61352   Mean   : 11.36   Mean   :11.14  
##  3rd Qu.:25.00   3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10  
##  Max.   :50.00   Max.   :88.97620   Max.   :100.00   Max.   :27.74  
##                                                                     
##       CHAS              NOX               RM             AGE        
##  Min.   :0.00000   Min.   :0.3850   Min.   :3.561   Min.   :  2.90  
##  1st Qu.:0.00000   1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02  
##  Median :0.00000   Median :0.5380   Median :6.208   Median : 77.50  
##  Mean   :0.06917   Mean   :0.5547   Mean   :6.285   Mean   : 68.57  
##  3rd Qu.:0.00000   3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08  
##  Max.   :1.00000   Max.   :0.8710   Max.   :8.780   Max.   :100.00  
##                                                                     
##       DIS              RAD              TAX           PTRATIO     
##  Min.   : 1.130   Min.   : 1.000   Min.   :187.0   Min.   :12.60  
##  1st Qu.: 2.100   1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40  
##  Median : 3.207   Median : 5.000   Median :330.0   Median :19.05  
##  Mean   : 3.795   Mean   : 9.549   Mean   :408.2   Mean   :18.46  
##  3rd Qu.: 5.188   3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20  
##  Max.   :12.127   Max.   :24.000   Max.   :711.0   Max.   :22.00  
## 

Except the “TOWN” feature, all of our variables are numeric ones; hence we can build this as a regression problem.

  • Plotting

Since we have the latitude and longitude information, we can plot variables on a map. In order to do this we will use get_map function from the ggmap package. This will load the Google-map of the specified area. In our case it is the Boston map.

BostonLL <-c(-71.30, 42.00, -70.80, 42.40)
map <- get_map(location = BostonLL, zoom = 11)

First we will plot the house pricings from low to high with color, darker hues indicates higher prices, and vice versa:

mapPoints <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=MEDV),data = boston, alpha = 0.7,size=3)+ scale_color_gradient(low="#9ebcda",high="#8856a7")+labs(title="Median value of owner-occupied homes in $1000's",y="Latitude",x="Longtitude",color="House Prices (in 1000$'s)" )

mapPoints

To see the prices in categories we will break the prices in to quantiles and then see their distribution on the map.

cutpoints<-quantile(boston$MEDV,seq(0,1,length=4),na.rm=TRUE)
boston$MEDVQuantiles <- cut(boston$MEDV,breaks=cutpoints,include.lowest=TRUE,labels =c("Low priced","Mid Priced","High priced"))
table(boston$MEDVQuantiles)
## 
##  Low priced  Mid Priced High priced 
##         172         168         166
mapPoints2 <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=MEDVQuantiles),data = boston, alpha = 0.7,size=3) + ggtitle("Median value of owner-occupied homes in $1000's") +facet_grid(.~MEDVQuantiles) +labs(title="Median House Prices Categorized",y="Latitude",x="Longtitude",color="House Prices (in 1000$'s)" )

mapPoints2

Next we will check the nitric oxides concentration:

mapPointsNOX <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=NOX),data = boston, alpha = 1,size=3)+scale_color_gradient(low="#a1d99b",high="#31a354")+labs(title="Nitric Oxides Concentration ",y="Latitude",x="Longtitude",color="NOX Conc. (parts per 10 million)" )

mapPointsNOX

Finally we check the distribution of pupil-teacher ratio:

mapPointsPTRATIO <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=PTRATIO),data = boston, alpha = 1,size=3)+scale_color_gradient(low="#fdae6b",high="#e6550d")+labs(title="Pupil Teacher Ratio",y="Latitude",x="Longtitude",color="Pupil-Teacher Ratio by Town" )

mapPointsPTRATIO

Hypothesis

The exploratory analysis shows signs of negative correlation, so we can build our initial hypothesis as that the nitric oxides concentration and pupil teacher ratio has an adverse effect on Median value of owner-occupied homes.

In order to check our hypothesis we will run a linear regression to see their relation with median homes prices and check whether there is a statistically significant relation.

Splitting the Data

In order to validate our model later, we will first split our data into test and train data and build our model on train data and test it later with the test data.

library(caTools)
split=sample.split(boston$MEDV,SplitRatio = 0.7)
train=subset(boston,split==TRUE)
test=subset(boston,split==FALSE)

Relation with House Pricings

linreg =lm(MEDV~NOX+PTRATIO,data=train)
summary(linreg)
## 
## Call:
## lm(formula = MEDV ~ NOX + PTRATIO, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.853  -4.619  -1.183   2.605  33.870 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  74.0638     3.5449  20.893  < 2e-16 ***
## NOX         -28.9137     3.3955  -8.515 4.54e-16 ***
## PTRATIO      -1.9118     0.1802 -10.607  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.493 on 361 degrees of freedom
## Multiple R-squared:  0.3802, Adjusted R-squared:  0.3768 
## F-statistic: 110.7 on 2 and 361 DF,  p-value: < 2.2e-16

The summary statistics of the linear regression shows us that we can reject the null hypothesis for both variables NOX and PTRATIO. In other words there is indeed a correlation between the variables,albeit a negative one.

To further test our hypothesis and validate our model, we will predict the house prices with our model based on these two variables.

linregPrediction<-predict(linreg,newdata=test)
test$prediction <- linregPrediction
linregSSE <- sum((linregPrediction-test$MEDV)^2)
linregSSE
## [1] 6548.048

Below is the distribution of real house price Values of the test data.

cutpoints2<-quantile(test$MEDV,seq(0,1,length=4),na.rm=TRUE)
test$MEDVQuantilesTest <- cut(test$MEDV,breaks=cutpoints2,include.lowest=TRUE,labels =c("Low priced","Mid Priced","High priced"))
table(test$MEDVQuantilesTest)
## 
##  Low priced  Mid Priced High priced 
##          48          47          47
testMEDV <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=MEDVQuantilesTest),data = test, alpha = 0.7,size=3) +facet_grid(.~MEDVQuantilesTest) +labs(title="House Prices in Test Data", y="Latitude",x="Longtitude", color="House Prices (in 1000$'s)")

testMEDV

And here are the values of the house prices with predicted with the model build from NOX and PTRATIO variables:

cutpoints3<-quantile(test$prediction,seq(0,1,length=4),na.rm=TRUE)
test$MEDVQuantilesprediction <-cut(test$prediction,breaks=cutpoints3,include.lowest=TRUE,labels =c("Low priced predicted","Mid Priced predicted","High priced predicted"))
table(test$MEDVQuantilesprediction)
## 
##  Low priced predicted  Mid Priced predicted High priced predicted 
##                    50                    47                    45
testPrediction <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=MEDVQuantilesprediction),data = test, alpha = 0.7,size=3) +facet_grid(.~MEDVQuantilesprediction) +labs(title="Predicted House Prices in Test Data", y="Latitude",x="Longtitude",color="House Prices (in 1000$'s)")

testPrediction

Finally we can see the performance of the model with plotting the real values vs. the predicted ones. A perfect match would have resulted dots lying very close to the line.

qplot(x=prediction,y=MEDV,data=test,geom=c("point","smooth"),method="lm", xlab="Predicted Values",ylab="Real Values", main="Real vs. Predicted House Prices" )

Conclusion

In the light of exploratory analysis we build our hypothesis that there is a negative correlation between the house prices and nitric oxides concentration and pupil teacher ratio. Our statistical analysis showed that there is indeed a negative relation. Additionally, to test our claim we build a model to test our hypothesis outside the training data. In order to do this we split our data and use only %70 percent of it to build our model and tested it on the remaining %30. A visual representation of the match of the predicted values with the real ones is also provided. The natural next step would be to test this claim in other cities.