This study reviews The Boston Housing Dataset , which was originally used in the paper `Hedonic prices and the demand for clean air’ by Harrison, D. and Rubinfeld, D.L.
Our aim is to investigate whether there is a relation between house prices and air pollution and teacher-pupil ratio in the corresponding neighborhood.
Before testing any hypothesis statistically we make some exploratory analysis after preprocessing to get a better grasp of the data.
First we will load all the libraries on to R that we will use in this study.
setwd("D:/PROGRAMCILIK/SABANCI/518 Visualization/Proje")
library(ggplot2)
library(ggmap)
library(mapproj)
The description of the data variables is as follows:
First we will load the data on to R and will check some summary statistics as follows:
boston<- read.csv("boston.csv")
str(boston)
## 'data.frame': 506 obs. of 16 variables:
## $ TOWN : Factor w/ 92 levels "Arlington","Ashland",..: 54 77 77 46 46 46 69 69 69 69 ...
## $ TRACT : int 2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
## $ LON : num -71 -71 -70.9 -70.9 -70.9 ...
## $ LAT : num 42.3 42.3 42.3 42.3 42.3 ...
## $ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ CHAS : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ RM : num 6.58 6.42 7.18 7 7.15 ...
## $ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
## $ RAD : int 1 2 2 3 3 3 5 5 5 5 ...
## $ TAX : int 296 242 242 222 222 222 311 311 311 311 ...
## $ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
summary(boston)
## TOWN TRACT LON LAT
## Cambridge : 30 Min. : 1 Min. :-71.29 Min. :42.03
## Boston Savin Hill: 23 1st Qu.:1303 1st Qu.:-71.09 1st Qu.:42.18
## Lynn : 22 Median :3394 Median :-71.05 Median :42.22
## Boston Roxbury : 19 Mean :2700 Mean :-71.06 Mean :42.22
## Newton : 18 3rd Qu.:3740 3rd Qu.:-71.02 3rd Qu.:42.25
## Somerville : 15 Max. :5082 Max. :-70.81 Max. :42.38
## (Other) :379
## MEDV CRIM ZN INDUS
## Min. : 5.00 Min. : 0.00632 Min. : 0.00 Min. : 0.46
## 1st Qu.:17.02 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19
## Median :21.20 Median : 0.25651 Median : 0.00 Median : 9.69
## Mean :22.53 Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.:25.00 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :50.00 Max. :88.97620 Max. :100.00 Max. :27.74
##
## CHAS NOX RM AGE
## Min. :0.00000 Min. :0.3850 Min. :3.561 Min. : 2.90
## 1st Qu.:0.00000 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02
## Median :0.00000 Median :0.5380 Median :6.208 Median : 77.50
## Mean :0.06917 Mean :0.5547 Mean :6.285 Mean : 68.57
## 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08
## Max. :1.00000 Max. :0.8710 Max. :8.780 Max. :100.00
##
## DIS RAD TAX PTRATIO
## Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60
## 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40
## Median : 3.207 Median : 5.000 Median :330.0 Median :19.05
## Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46
## 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20
## Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00
##
Except the “TOWN” feature, all of our variables are numeric ones; hence we can build this as a regression problem.
Since we have the latitude and longitude information, we can plot variables on a map. In order to do this we will use get_map function from the ggmap package. This will load the Google-map of the specified area. In our case it is the Boston map.
BostonLL <-c(-71.30, 42.00, -70.80, 42.40)
map <- get_map(location = BostonLL, zoom = 11)
First we will plot the house pricings from low to high with color, darker hues indicates higher prices, and vice versa:
mapPoints <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=MEDV),data = boston, alpha = 0.7,size=3)+ scale_color_gradient(low="#9ebcda",high="#8856a7")+labs(title="Median value of owner-occupied homes in $1000's",y="Latitude",x="Longtitude",color="House Prices (in 1000$'s)" )
mapPoints
To see the prices in categories we will break the prices in to quantiles and then see their distribution on the map.
cutpoints<-quantile(boston$MEDV,seq(0,1,length=4),na.rm=TRUE)
boston$MEDVQuantiles <- cut(boston$MEDV,breaks=cutpoints,include.lowest=TRUE,labels =c("Low priced","Mid Priced","High priced"))
table(boston$MEDVQuantiles)
##
## Low priced Mid Priced High priced
## 172 168 166
mapPoints2 <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=MEDVQuantiles),data = boston, alpha = 0.7,size=3) + ggtitle("Median value of owner-occupied homes in $1000's") +facet_grid(.~MEDVQuantiles) +labs(title="Median House Prices Categorized",y="Latitude",x="Longtitude",color="House Prices (in 1000$'s)" )
mapPoints2
Next we will check the nitric oxides concentration:
mapPointsNOX <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=NOX),data = boston, alpha = 1,size=3)+scale_color_gradient(low="#a1d99b",high="#31a354")+labs(title="Nitric Oxides Concentration ",y="Latitude",x="Longtitude",color="NOX Conc. (parts per 10 million)" )
mapPointsNOX
Finally we check the distribution of pupil-teacher ratio:
mapPointsPTRATIO <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=PTRATIO),data = boston, alpha = 1,size=3)+scale_color_gradient(low="#fdae6b",high="#e6550d")+labs(title="Pupil Teacher Ratio",y="Latitude",x="Longtitude",color="Pupil-Teacher Ratio by Town" )
mapPointsPTRATIO
The exploratory analysis shows signs of negative correlation, so we can build our initial hypothesis as that the nitric oxides concentration and pupil teacher ratio has an adverse effect on Median value of owner-occupied homes.
In order to check our hypothesis we will run a linear regression to see their relation with median homes prices and check whether there is a statistically significant relation.
In order to validate our model later, we will first split our data into test and train data and build our model on train data and test it later with the test data.
library(caTools)
split=sample.split(boston$MEDV,SplitRatio = 0.7)
train=subset(boston,split==TRUE)
test=subset(boston,split==FALSE)
linreg =lm(MEDV~NOX+PTRATIO,data=train)
summary(linreg)
##
## Call:
## lm(formula = MEDV ~ NOX + PTRATIO, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.853 -4.619 -1.183 2.605 33.870
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.0638 3.5449 20.893 < 2e-16 ***
## NOX -28.9137 3.3955 -8.515 4.54e-16 ***
## PTRATIO -1.9118 0.1802 -10.607 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.493 on 361 degrees of freedom
## Multiple R-squared: 0.3802, Adjusted R-squared: 0.3768
## F-statistic: 110.7 on 2 and 361 DF, p-value: < 2.2e-16
The summary statistics of the linear regression shows us that we can reject the null hypothesis for both variables NOX and PTRATIO. In other words there is indeed a correlation between the variables,albeit a negative one.
To further test our hypothesis and validate our model, we will predict the house prices with our model based on these two variables.
linregPrediction<-predict(linreg,newdata=test)
test$prediction <- linregPrediction
linregSSE <- sum((linregPrediction-test$MEDV)^2)
linregSSE
## [1] 6548.048
Below is the distribution of real house price Values of the test data.
cutpoints2<-quantile(test$MEDV,seq(0,1,length=4),na.rm=TRUE)
test$MEDVQuantilesTest <- cut(test$MEDV,breaks=cutpoints2,include.lowest=TRUE,labels =c("Low priced","Mid Priced","High priced"))
table(test$MEDVQuantilesTest)
##
## Low priced Mid Priced High priced
## 48 47 47
testMEDV <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=MEDVQuantilesTest),data = test, alpha = 0.7,size=3) +facet_grid(.~MEDVQuantilesTest) +labs(title="House Prices in Test Data", y="Latitude",x="Longtitude", color="House Prices (in 1000$'s)")
testMEDV
And here are the values of the house prices with predicted with the model build from NOX and PTRATIO variables:
cutpoints3<-quantile(test$prediction,seq(0,1,length=4),na.rm=TRUE)
test$MEDVQuantilesprediction <-cut(test$prediction,breaks=cutpoints3,include.lowest=TRUE,labels =c("Low priced predicted","Mid Priced predicted","High priced predicted"))
table(test$MEDVQuantilesprediction)
##
## Low priced predicted Mid Priced predicted High priced predicted
## 50 47 45
testPrediction <- ggmap(map) + geom_point(aes(x = LON, y = LAT,color=MEDVQuantilesprediction),data = test, alpha = 0.7,size=3) +facet_grid(.~MEDVQuantilesprediction) +labs(title="Predicted House Prices in Test Data", y="Latitude",x="Longtitude",color="House Prices (in 1000$'s)")
testPrediction
Finally we can see the performance of the model with plotting the real values vs. the predicted ones. A perfect match would have resulted dots lying very close to the line.
qplot(x=prediction,y=MEDV,data=test,geom=c("point","smooth"),method="lm", xlab="Predicted Values",ylab="Real Values", main="Real vs. Predicted House Prices" )
In the light of exploratory analysis we build our hypothesis that there is a negative correlation between the house prices and nitric oxides concentration and pupil teacher ratio. Our statistical analysis showed that there is indeed a negative relation. Additionally, to test our claim we build a model to test our hypothesis outside the training data. In order to do this we split our data and use only %70 percent of it to build our model and tested it on the remaining %30. A visual representation of the match of the predicted values with the real ones is also provided. The natural next step would be to test this claim in other cities.