options(scipen=999)
Loading and preparing the data:Load the data
library(foreign)
MN.original <- read.dbf("C:/CU BOULDER/Coursework/Y2S2/GEOG 5023 - Quant Methods Geo/Week 3 - Bivariate Regression/HW2/mnmappluto.dbf")
Plot of the original dataset:
# Plot the spatial data
plot(MN.original$YCoord ~ MN.original$XCoord)
Clean the original dataset (e.g., exlcude obs that do not have valid geographic coordinates:
# Summarize the spatial data (using slice notation)
summary(MN.original[, c("YCoord", "XCoord")])
## YCoord XCoord
## Min. : 0 Min. : 0
## 1st Qu.:207645 1st Qu.: 986617
## Median :219274 Median : 991591
## Mean :217078 Mean : 977235
## 3rd Qu.:231006 3rd Qu.: 998354
## Max. :259301 Max. :1009761
# identify buildings that have valid coordintate data and extract those
# obs (including all columns in the original dataset)
MN <- MN.original[MN.original$YCoord > 0 & MN.original$XCoord > 0, ]
# check if above operation done correctly; should get 43318 obs and 78
# vars
dim(MN)
## [1] 43318 78
Plot of the cleaned dataset (including only building with valid geographic coordinates):
plot(MN$YCoord ~ MN$XCoord)
Finding famous people with the search function:
bloomberg.bldgs <- grep("bloomberg", MN$OwnerName, ignore.case = TRUE)
MN[bloomberg.bldgs, c("Address", "AssessTot", "OwnerName")]
## Address AssessTot OwnerName
## 25516 17 EAST 79 STREET 465177 BLOOMBERG MICHAEL
Identifying historic districts:Make a dummy variable to identify buildings in a historic districts:
MN$HD <- ifelse(is.na(MN$HistDist), 0, 1)
# check if above operation done correctly; should get a mean of 0.215
summary(MN$HD)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.215 0.000 1.000
# convert dummy variable column to factor
MN$HD <- as.factor(MN$HD)
summary(MN$HD)
## 0 1
## 34024 9294
Plot of the historic districts:
plot(y = MN$YCoord, x = MN$XCoord, col = MN$HD, pch = 16, cex = 0.5)
Splitting the dataset by historic district dummy variable:
# should end up with one 'in'' object that has 9294 obs and 79 vars
inHD <- MN[MN$HD == 1, ]
dim(inHD)
## [1] 9294 79
# and a second 'out' object that has 34024 obs and 79 vars
outHD <- MN[MN$HD == 0, ]
dim(outHD)
## [1] 34024 79
Hypothesis testing:Q1: What does hypothesis test 1 tell you?
# hypothesis test #1, two-sided difference of means test Ho: there is no
# difference in the means between buildings in historic districts compared
# to those not in historic districts
t.test(x = inHD$AssessTot, y = outHD$AssessTot)
##
## Welch Two Sample t-test
##
## data: inHD$AssessTot and outHD$AssessTot
## t = -15.05, df = 43286, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1724546 -1327117
## sample estimates:
## mean of x mean of y
## 1233743 2759574
The null hypothesis for test #1 is that there is no difference between the average value of buildings in historic districts and the average value of buildings outside of historic districts. We reject this null hypothesis based on result of the difference of means test; we are highly confident that there is a difference in the average value of homes in and out of historic districts, and that this difference is not due to chance (as evidenced by the extremely high t-statistic of -15 and the very low p-value). Specifically, the 95% confidence interval for the true difference shows that buildings outside of historic districts are worth 1.3 to 1.7 million dollars more than those within historic districts.
Q2: What does hypothesis test 2 tell you about the size of building inside and outside of historic districts?
# hypothesis test #2, two-sided difference of means test Ho: there is no
# difference in the average size of buildings in historic districts
# compared to those not in historic districts
t.test(x = inHD$BldgArea, y = outHD$BldgArea)
##
## Welch Two Sample t-test
##
## data: inHD$BldgArea and outHD$BldgArea
## t = -9.037, df = 15819, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -25103 -16154
## sample estimates:
## mean of x mean of y
## 22050 42678
The null hypothesis for test #2 is that there is no difference between the average size of buildings in historic districts and the average size of buildings outside of historic districts. We reject this null hypothesis based on result of the difference of means test; we are highly confident that there is a difference in the average size of homes in and out of historic districts, and that this difference is not due to chance (as evidenced by the very high t-statistic of -9 and the very low p-value). Specifically, the 95% confidence interval for the true difference shows that buildings outside of historic districts have an area about twice as large as those in historic districts (16,154 to 25,103 square feet larger).
Q3: After controlling for location, is the historic district designation associated with a difference in property values? Are the buildings in the historic district different from their non-historic neighbors?
# get a list of all blocks that contain historic districts
blocks <- inHD$Block
# select all obs from complete dataset that are on a block with least one
# building designated as a historic district
HDB <- MN[MN$Block %in% blocks, ]
# place all buildings that are NOT designated as historic but which are
# located on blocks that have at least one building designated as historic
# in the object HDB_out
HDB_out <- HDB[HDB$HD == 0, ]
# place all buildings that are designated as historic and which are
# located on blocks that have at least one building designated as historic
# in the object HDB_in
HDB_in <- HDB[HDB$HD == 1, ]
# hypothesis test #3, two-sided difference of means test Ho: there is no
# difference in the mean value of buildings in historic districts compared
# to those not in historic districts (controlling for location)
t.test(x = HDB_in$AssessTot, y = HDB_out$AssessTot)
##
## Welch Two Sample t-test
##
## data: HDB_in$AssessTot and HDB_out$AssessTot
## t = -9.728, df = 4349, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1507426 -1001727
## sample estimates:
## mean of x mean of y
## 1233743 2488319
We have attempted to create an appropriate comparison group (counterfactual) for buildings designated as historic, namely, buildings that do not have the historic designation but which are located very near (e.g., on the same block as) other buildings that do have the historic designation. The null hypothesis for test #3 is that after “controlling” for location in this way there is no difference between the average value of buildings in historic districts and the average value of buildings outside of historic districts. We reject this null hypothesis based on result of the difference of means test; we are highly confident that there is a difference in the average value of homes in and out of historic districts, and that this difference is not due to chance (as evidenced by the very high t-statistic of almost -10 and the very low p-value). Specifically, the 95% confidence interval for the true difference shows that buildings outside of historic districts are worth 1.0 to 1.5 million dollars more than those within historic districts.
Q4: After controlling for location and building size, do historic and non-historic buildings have significantly different values?
# calculate price/sqft for historic buildings with area>0
HDB_in_sqft <- HDB_in[HDB_in$BldgArea > 0, "AssessTot"]/HDB_in[HDB_in$BldgArea >
0, "BldgArea"]
# cacluate price/sqft for non-historic building with area>0
HDB_out_sqft <- HDB_out[HDB_out$BldgArea > 0, "AssessTot"]/HDB_out[HDB_out$BldgArea >
0, "BldgArea"]
# hypothesis test #4, two-sided difference of means test Ho: there is no
# difference in the mean value between buildings in historic districts
# compared to those not in historic districts (controlling for location
# and size)
t.test(HDB_in_sqft, HDB_out_sqft)
##
## Welch Two Sample t-test
##
## data: HDB_in_sqft and HDB_out_sqft
## t = -1.664, df = 4521, p-value = 0.09614
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -36.413 2.976
## sample estimates:
## mean of x mean of y
## 66.76 83.48
We have attempted to create an appropriate comparison group (counterfactual) for buildings designated as historic, namely, buildings that do not have the historic designation but which are located very near (e.g., on the same block as) other buildings that do have the historic designation. Additionally, we now normalize value to a price per square foot of area for each building. The null hypothesis for test #4 is that after “controlling” for location and building size in the aforementioned ways there is no difference between the average value of buildings in historic districts and the average value of buildings outside of historic districts. We cannot reject this null hypothesis based on result of the difference of means test at the conventional 95% level; we are insufficiently confident that the observed difference in values between two group of buildings is not due to chance. This is evidenced by the relatively low t-statistic of -1.6 and the high p-value of almost 0.10. Furthermore, the 95% confidence interval for the true difference contains the value 0 (as expected given the p-value) reinforcing the conclusion that average value of historic buildings is not different than the average value of non-historic buidings. This is the first time in this series of hypothesis tests that we have been unable to reject the null hypothesis and conlude that there is a difference in the values of the two types of buildings. The fact that “controlling” for location and size leads us to accept the null hypothesis, suggests that the difference in value between historic and non-historic buildings observed in the previous tests was not actually due to their historic character, but rather to differences in locations and sizes of building across the two groups (confounding factors) that we did not appropriately address when implementing the hypothesis tests.
Correlation:Q5: Is there a significant correlation between a building's north-south position on the island adn its total assessed value? Are downtown buildings worth more than uptown buildings?
Scatterplot of Building Value and North-South Position:
plot(MN$YCoord, MN$AssessTot)
abline(lm(MN$AssessTot ~ MN$YCoord), col = "red")
Correlation between Building Value and North-South Position:
# pearson correlation for continuous variables
cor(MN$YCoord, MN$AssessTot)
## [1] -0.05964
# spearman correlation for ordinal, interval, or ratio variables
cor(MN$YCoord, MN$AssessTot, method = "spearman")
## [1] -0.2947
# correlation test Ho: there is no correlation between building value and
# north-south location
cor.test(MN$YCoord, MN$AssessTot)
##
## Pearson's product-moment correlation
##
## data: MN$YCoord and MN$AssessTot
## t = -12.43, df = 43316, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.06902 -0.05025
## sample estimates:
## cor
## -0.05964
The null hypothesis for correlation test is that there is no correlation between the value of a building and its north-south position (as indciated by its y-coordinate). We reject this null hypothesis based on result of the correlation test; we are highly confident that there is a correlation between a building's value and its north-south position. Specifically, the 95% confidence interval for the true correlation shows a small correlation of -0.05 to -0.07 between a building's value and its north-south position, indicating that building value decreases slighly as one move northward along the island. It does appear that downtown building are slightly more valuable than uptown buildings (without adjusting the dataset to address potential confounding factors).
Figure showing plots of buildings on the island and correlation between building value and north-south position:
layout(matrix(c(1, 2), 1, 2, byrow = TRUE))
plot(MN$YCoord ~ MN$XCoord, main = "Buildings in Manhatten", ylab = "Y Coordinate",
xlab = "X Coordinate")
plot(MN$YCoord, MN$AssessTot, main = "Building Value and N-S Position: -0.06 Corr.",
ylab = "Total Assessed Value", xlab = "Y Coordinate")
abline(lm(MN$AssessTot ~ MN$YCoord), col = "red")