Load Manhattan Building dbf File
library("foreign", lib.loc = "/Library/Frameworks/R.framework/Versions/2.15/Resources/library")
MN <- read.dbf("/Users/telekineticturtle/Desktop/Colorado 13/Quant Methods/Data/mnmappluto.dbf")
Removal of buildings without location information
summary(MN[, c("YCoord", "XCoord")])
## YCoord XCoord
## Min. : 0 Min. : 0
## 1st Qu.:207645 1st Qu.: 986617
## Median :219274 Median : 991591
## Mean :217078 Mean : 977235
## 3rd Qu.:231006 3rd Qu.: 998354
## Max. :259301 Max. :1009761
MN <- MN[MN$YCoord > 0 & MN$XCoord > 0, ]
# SS: dim() returns the number of rows and comlumns in a table
dim(MN) # AC: If this returns 43318 rows, the above operation was successful.
## [1] 43318 78
plot(MN$YCoord ~ MN$XCoord, xlab = "X Coordinate", ylab = "Y Coordinate", main = "Manhattan Building Locations")
Creating a Dummy Variable
# AC: Using the ifelse() and is.na() functions to create a dummy variable
# to indicate whether a building is in an historic district (1) or not
# (0).
MN$HD <- ifelse(is.na(MN[, "HistDist"]), 0, 1)
# AC: We convert MN$HD to a factor because it is a dummy variable for
# categories, not a ratio level variable
MN$HD <- as.factor(MN$HD)
# AC: Replotting with historic district buildings highlighted.
plot(y = MN$YCoord, x = MN$XCoord, col = MN$HD, pch = 16, cex = 0.5, xlab = "X Coordinate",
ylab = "Y Coordinate", main = "Manhattan Building Locations")
Making subsets of the data
# AC: Make a subset of Manhattan Buildings that are in an historic
# district using a logic operater and bracket notation.
inHD <- MN[MN$HD == 1, ]
# AC: Now do the same, but for buildings NOT in an historic district.
outHD <- MN[MN$HD == 0, ]
SS: Desired Null Hypothesis: The designation of historic districts has no effect on property values, the buildings in a historic district have the same value as those outside of a historic district, and difference between the two groups is due to random chance.
Test 1 - Property Value Split by Historic Designation
# AC: Two sample t-test (two-sided, variance treated as equal) comparing
# assessment values for buildings in and out of an historic district.
# Null Hypothesis: The buildings in an historic district have the same
# value as those outside of an historic district, and difference between
# the two groups is due to random chance.
t.test(x = inHD$AssessTot, y = outHD$AssessTot)
##
## Welch Two Sample t-test
##
## data: inHD$AssessTot and outHD$AssessTot
## t = -15.05, df = 43286, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1724546 -1327117
## sample estimates:
## mean of x mean of y
## 1233743 2759574
Firstly, hypothesis test 1 yielded a t value with a large magnitude, indicating that the difference in assessment values between historic and non-historic districts is much greater than would be expected if the variation was due only to random chance. Secondly, the p-value indicates that this difference is statistically significant. If in reality the two subsets had the same mean value, then it would be very very very unlikely for us to receive a t statistic as or more extreme than -15. Thirdly, the sign of the t value is negative and the mean value for property in historic districts is lower than the mean for property outside historic districts. Assuming the test was appropriate, it confirms that historic districts are associated with lower property values. It does not, of course, say why.
SS: That test was not appropriate for our desired null hypothesis because of a conflating variable: building size. Older buildings may be smaller than newer buildings (on average).
Test 2 - Building Area Split by Historic Designation
# AC: Two sample t-test (two-sided, variance treated as equal) comparing
# building area for buildings in and out of an historic district. Null
# Hypothesis: The buildings in a historic district have the same area as
# those outside of a historic district, and difference between the two
# groups is due to random chance.
t.test(x = inHD$BldgArea, y = outHD$BldgArea)
##
## Welch Two Sample t-test
##
## data: inHD$BldgArea and outHD$BldgArea
## t = -9.037, df = 15819, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -25103 -16154
## sample estimates:
## mean of x mean of y
## 22050 42678
Hypothesis test 2 yielded another large t value, and the p-value indicated that there is a significant size difference between buildings in historic districts and those outside of historic districts. On average, historic district buildings are smaller than non-histroic district buildings (about half the size, actually). These findings suggest that to answer the original question correctly, we should control for building size.
Test 3 - Property Value Split by Historic Designation after Controlling for Location
SS: Select buildings on the same block as a historic district
# SS: Get a list of all blocks that contain historic buildings.
blocks <- inHD$Block
# SS: Select all buildings (from MN) that are on the same block as
# historic buildings. Save the result as a new object. AC: the %in%
# operator selects all rows where the block column contains values in our
# list of blocks.
HDB <- MN[MN$Block %in% blocks, ]
# AC: Select all rows (buildings) with the same block number as an
# historical district building that are NOT themselves in an historic
# district.
HDB_out <- HDB[HDB$HD == 0, ]
# AC: Select all rows (buildings) with the same block number as an
# historical district building that are themselves in an historic
# district. Note that this data frame will be equivalent to inHD.
HDB_in <- HDB[HDB$HD == 1, ]
Hypothesis Test 3
# AC: Two sample t-test (two-sided, variance treated as equal) comparing
# property value for buildings in and out of an historic district but on
# the same block as buildings in an historic district. Null Hypothesis:
# For blocks that contain buildings within an historic district, the
# buildings in an historic district have the same property value as those
# outside of an historic district, and difference between the two groups
# is due to random chance.
t.test(x = HDB_in$AssessTot, y = HDB_out$AssessTot)
##
## Welch Two Sample t-test
##
## data: HDB_in$AssessTot and HDB_out$AssessTot
## t = -9.728, df = 4349, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1507426 -1001727
## sample estimates:
## mean of x mean of y
## 1233743 2488319
After controlling for location, the historic designation is still associated with different (lower) property values. The p-value is much less than 0.05; this indicates that the difference is still statistically significant. Relative to hypothesis test 1, the confidence interval has shifted by about +$0.2 million (from -$1.7 - -$1.2 million to -$1.5 - -$1.0 million).
SS: Remove properties with a building area of 0.
# SS: Calcuate price per square foot for historic buildings and then
# nonhistoric buildings, only for buildings with an area greater than 0.
HDB_in_sqft <- HDB_in[HDB_in$BldgArea > 0, "AssessTot"]/HDB_in[HDB_in$BldgArea >
0, "BldgArea"]
HDB_out_sqft <- HDB_out[HDB_out$BldgArea > 0, "AssessTot"]/HDB_out[HDB_out$BldgArea >
0, "BldgArea"]
# AC: Two sample t-test (two-sided, variance treated as equal) comparing
# property value per square footage for buildings in and out of an
# historic district but on the same block as buildings in an historic
# district. Null Hypothesis: For blocks that contain buildings wihtin an
# historic district, buildings in an historic district have the same value
# per square foot as those outside of an historic district, and difference
# between the two groups is due to random chance.
t.test(HDB_in_sqft, HDB_out_sqft)
##
## Welch Two Sample t-test
##
## data: HDB_in_sqft and HDB_out_sqft
## t = -1.664, df = 4521, p-value = 0.09614
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -36.413 2.976
## sample estimates:
## mean of x mean of y
## 66.76 83.48
After controlling for both location and building size, the p-value is greater than 0.05, so the null hypothesis cannot be rejected. Although the mean value for buildings in historic districts is less than for those outside, this could due entirely to random chance. Note also that the 95% confidence interval crosses 0. My conclusion did not change between test 1 and test 3 (which controlled for location), so the difference maker is most likely building size. Test 2 showed that buildings in historic districts are smaller on average than other buildings. Combined, Tests 1 through 4 suggest that the property value discrepancy between buildings inside and outside historic districts exists because buildings in historic districts are smaller rather than because they are designated historic. Based on these findings and implications, further research could be conducted on the relationship between building size and property value.
AC: Test to see if there is a correlation between N-S location (YCoord) and property value (AssessTot)
First plot the data.
qqnorm(MN$AssessTot, ylim = c(0, 5e+08), main = "Normal Q-Q Plot for Property Assessment Values") # One outlier omitted
qqline(MN$AssessTot)
qqnorm(MN$YCoord, main = "Normal Q-Q Plot for Y Coordinate")
qqline(MN$YCoord)
plot(MN$AssessTot, MN$YCoord, xlim = c(0, 5e+08), main = "Property Assessment Value by Y Coordinate",
ylab = "Y Coordinate (Larger = Further North)", xlab = "Property Assessment Value ($)") # Some outliers omitted
Perform Correlation Tests
cor.test(MN$AssessTot, MN$YCoord)
##
## Pearson's product-moment correlation
##
## data: MN$AssessTot and MN$YCoord
## t = -12.43, df = 43316, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06902 -0.05025
## sample estimates:
## cor
## -0.05964
cor.test(MN$AssessTot, MN$YCoord, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: MN$AssessTot and MN$YCoord
## z = -60.93, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## -0.1952
cor.test(MN$AssessTot, MN$YCoord, method = "spearman")
## Warning: Cannot compute exact p-values with ties
##
## Spearman's rank correlation rho
##
## data: MN$AssessTot and MN$YCoord
## S = 1.754e+13, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.2947
The large number of buildings in this sample makes a significant result very likely. Therefore, although there is only a minor correlation between y coordinate and and assessment value (-0.060 using Pearson's and -0.120 using Kendall's). A ranked correlation (e.g. Kendall's) is better in this case because the data are not normally distributed (see the normal q-q plots). Since the two variables are negatively correlated, the higher property values are associated with more southerly (downtown) properties. Closer inspection of the scatter plot shows that the spatial pattern of property values is more complicated. Visually, it is apparent that the greatest property values are located between 210,000 and 220,000. A second maximum is located around the southernmost properties.
Two other notes worth making in this document are 1) that none of the data examined appears normally distributed when looking at normal q-q plots or histograms and 2) We seemed to be treating buildings and properties interchangably, but some properties have multiple buildings. The median and both quartiles are 1 building and the mean is 1.078, but there are some outliers, and several properties have 2 buildings in the first 100 rows. Multiple buildings (NumBldgs) should be reflected by larger building areas (BldgArea), but there may be an impact on total assessment of property (AssessTot). In fact, I found a greater correlation between number of buildings and assessment total than between y coordinate and assessment total.
summary(MN$NumBldgs)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 1.00 1.08 1.00 163.00
MN[1:100, "NumBldgs"]
## [1] 163 12 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1
## [18] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [35] 1 1 1 1 6 1 2 1 1 1 1 1 1 1 0 1 1
## [52] 1 1 1 2 1 2 1 1 2 1 1 1 1 1 1 1 1
## [69] 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 0
## [86] 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
cor.test(MN$NumBldgs, MN$AssessTot, method = "kendall")
##
## Kendall's rank correlation tau
##
## data: MN$NumBldgs and MN$AssessTot
## z = 52.41, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.202
matrx <- matrix(data = c(1, 2), 1, 2)
layout(matrx)
plot(MN$YCoord ~ MN$XCoord, xlab = "X Coordinate", ylab = "Y Coordinate", main = "Manhattan Building Locations")
plot(MN$AssessTot, MN$YCoord, xlim = c(0, 5e+08), main = "Assessment Value by Y Coord",
ylab = "Y Coordinate (Larger = Further North)", xlab = "Property Assessment Value ($)")