Homework 2 Hypothesis Testing – Li Xu

GEOG 5023: Quantitative Methods In Geography

Historic Preservation in New York City

Cities all over north America contain “historic” neighborhoods. Historic neighborhoods are areas where the buildings have historical importance not because of their individual significance but because as a collection they represent a the architectural sensibilities of a particular time period. In Boulder, for example, the Martin Acres subdivision was recently surveyed for possible inclusion in a historic district. Establishing the “significance” of a neighborhood's historical character is a matter for historians not statisticians. However, there is an important question about historic preservation - does it help or hurt property values?

Once an area is designated a historic district development restrictions come into force. Typically, these restrictions aim to preserve the historic “look and feel” of the buildings in a neighborhood and thus restrict major modifications to the buildings. The owners of property in historic districts often face significant expenses and restrictions. For example, they might have to maintain the original facade on their building as opposed to replacing it with something more economical like vinyl-siding. As a result of these restrictions the historic districts are often opposed by residents.

In Manhattan there are a number of historic districts. In this exercise we will exploit a database describing every building on the island of Manhattan. Your job is to answer the question - does the designation of historic districts affect the value of buildings? This question has important policy significance, the 5th Amendment to the U.S. Constitution, states, “. nor shall private property be taken for public use without just compensation.” If designating an area as a historic district reduces the values of a property owners might be able to sue the government for compensation.

To answer this question you will have to identify all of the buildings that are a part of a historic district and compare the value of those historic properties to properties outside of a historic district. You have to be careful to construct a meaningful comparison. Many factors influence the value of a property, as best as you can you must take those factors into account.

Loading and preparing the data

# install.packages('foreign') for the first time using

# Load the 'foreign' library in order to read dbase files
library(foreign)

# Import the dbf. data
MN <- read.dbf("C:/Users/Li Xu/Documents/aaa/CU Boulder/GEOG5023/Homework 2 Hypothesis Testing/mnmappluto.dbf")
# or MN<-read.dbf(choose.files())

# Remove data with incorrect location information
MN <- MN[MN$YCoord > 0 & MN$XCoord > 0, ]

# Draw data by locations
plot(MN$YCoord ~ MN$XCoord)

plot of chunk unnamed-chunk-1

Identifying Historic Districts

# Create a dummy variable HD(1=in a historic district, 0=not in a historic
# district)
MN$HD <- ifelse(is.na(MN[, "HistDist"]), 0, 1)

# convert MN$HD to a factor
MN$HD <- as.factor(MN$HD)

# note how the summary changes after changing the 'HD' column to a factor
summary(MN$HD)

##     0     1 
## 34024  9294

# Draw a map of historic districts.(Red=in a HD, Black=not in a HD)
#'col' changes the color of dots depending upon the value in the 'HD' column
#'pch' sets the symbol to a solid dot
#'cex'  makes the dot .5 the normal size
plot(y = MN$YCoord, x = MN$XCoord, col = MN$HD, pch = 16, cex = 0.5)

plot of chunk unnamed-chunk-3

# split the 'MN' object based on whether a building is in or out of a
# historic district inHD stores all buildings in a historic distric (MN$HD
# = 1)
inHD <- MN[MN$HD == 1, ]

# outHD stores all buildings outside of a historic distric (MN$HD = 0)
outHD <- MN[MN$HD == 0, ]

Hypothesis Testing

# Run a t-test with a null hypothesis: there is no significant difference
# between the values of the buildings in historic districts and those
# outside of historic districts
t.test(x = inHD$AssessTot, y = outHD$AssessTot)  #Hypothesis Test 1

## 
##  Welch Two Sample t-test
## 
## data:  inHD$AssessTot and outHD$AssessTot 
## t = -15.05, df = 43286, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -1724546 -1327117 
## sample estimates:
## mean of x mean of y 
##   1233743   2759574

Question 1. What does hypothesis test 1 tell you?

The very low p-value tells that we should reject the null hypothesis, which indicates that there is a significant difference between the two groups. If we reject the null hypothesis, the probability we make a mistake is lower than 2.2e-16. The 95 percent confidence interval (-1724546~-1327117) also suggests that the values of buildings in historic districts are always lower than those outside of the historic districts. However, there are some other factors not included in this analysis yet. For example, if historic buildings tend to be smaller (because they are old and built before skyscrapers were common) they may not be worth as much as newer buildings simply because they are smaller. Thus we run Hypothesis Test 2 here.

# Run a t-test with the null hypothesis: there is no significant
# difference between the building sizes in historic districts and those
# outside of historic districts
t.test(x = inHD$BldgArea, y = outHD$BldgArea)  #Hypothesis Test 2

## 
##  Welch Two Sample t-test
## 
## data:  inHD$BldgArea and outHD$BldgArea 
## t = -9.037, df = 15819, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -25103 -16154 
## sample estimates:
## mean of x mean of y 
##     22050     42678

Question 2. What does hypothesis test 2 tell you about the size of the buildings inside and outside of historic districts?

Again, the low p-value suggests us to reject the null hypothesis. It means that there is a significant difference between the two groups in terms of building sizes. The chance we make a mistake is lower than 2.2e-16 if we accept that alternative hypothesis that the building sizes of buildings in HDs are not equivalent to those outside of HDs. The 95 percent confidence interval suggests that the building size in HDs are -25102.69 ~ -16153.85 smaller than those outside of HDs, even providing us more confidence to reject the null hypothesis.

Location is an important component of a property's value. To test the impact of a historic district designation we should revise our test to examine only buildings that have similar locations. One way to do this is to identify buildings that are close to but outside of historic districts. Each building in the database has a block number. Lets revise outHD so that it only includes buildings which are on the same block as a historic district but outside of the district boundaries.

# Get a list of all blocks that contain historic buildings
blocks <- inHD$Block

# Select all buildings (from MN) that are on the same block as historic
# buildings. The line below selects all rows where the block column
# contains values in our list of blocks Save the result as a new object
HDB <- MN[MN$Block %in% blocks, ]

# Create the object HDB_out to include buildings outside of HDs but in the
# same block with any buildings in HDs.
HDB_out <- HDB[HDB$HD == 0, ]

# Create the object HDB_in to include buildings in HDs and in historic
# district blocks.
HDB_in <- HDB[HDB$HD == 1, ]

# Run a t-test after controlling for location factor.
t.test(x = HDB_in$AssessTot, y = HDB_out$AssessTot)  #Hypothesis Test 3

## 
##  Welch Two Sample t-test
## 
## data:  HDB_in$AssessTot and HDB_out$AssessTot 
## t = -9.728, df = 4349, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -1507426 -1001727 
## sample estimates:
## mean of x mean of y 
##   1233743   2488319

Question 3. After controlling for location is the historic district designation associated with a difference in property values? Are the buildings in the historic district different from their non-historic neighbors? Use the p-value from hypothesis test 3 to support your conclusions.

The p-value of this t-test is still very low(< 2.2e-16). Thus it suggests that the historic district designation impacts property values even after controlling for location. The buildings in the historic district is different from their non-historic neighbors.

The size of the building is an important determinant of its value. In hypothesis test 3 we did not control for the size of the building, we can do this by calculating the price per square foot:

# We have a problem.  Some buildings have 0 area (square footage).
summary(HDB_in$BldgArea)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0     4160     6370    22100    13000 17600000


# this could mean the lot is vacant, it could be an error.  either way it
# makes it hard to compute the price per square foot.  We need to exlude
# these zero area buildings from out t-test

# Calcuate price per square foot for historic buildings Only for buildings
# with an area greater than 0
HDB_in_sqft <- HDB_in[HDB_in$BldgArea > 0, "AssessTot"]/HDB_in[HDB_in$BldgArea > 
    0, "BldgArea"]

# Calcuate price per square foot for non-historic buildings
HDB_out_sqft <- HDB_out[HDB_out$BldgArea > 0, "AssessTot"]/HDB_out[HDB_out$BldgArea > 
    0, "BldgArea"]

# Perform the t-test
t.test(x = HDB_in_sqft, y = HDB_out_sqft)  #Hypothesis Test 4

## 
##  Welch Two Sample t-test
## 
## data:  HDB_in_sqft and HDB_out_sqft 
## t = -1.664, df = 4521, p-value = 0.09614
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -36.413   2.976 
## sample estimates:
## mean of x mean of y 
##     66.76     83.48

Question 4. After controlling for location and building size, do historic and non-historic buildings have significantly different values? If your conclusion has changed between the 1st and 4th hypothesis tests explain why.

The p-value (0.09614) of Test 4 is greater than 0.05, which tells we should probably accept the null hypothesis that there is no significant difference between historic and non-historic buildings in values after controlling for both location and building size. Although the mean value of buildings in historic districts is still smaller than of those outside, the 95% percent confidence interval goes across 0. So the difference is very likely due to ramdon chance. Conclusion has changed between Test 1 and Test 4 because building size is indeed an important component in determining building value, which is not accounted for in Test 1.

Correlation

Question 5: Factors other than historic designation affect the price of real estate in MN. Is there a significant correlation between a buildings north-south position on the island (YCoord) and its total assessed value (AssessTot)? Are downtown buildings worth more than uptown buildings? Use cor.test() to answer the question.

First of all, we need to know if the variable 'YCoord' is numeric to be used for cor(), and how it is related to location.

summary(MN$YCoord)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  189000  208000  220000  220000  231000  259000

It seems that 'YCoord' is a numeric variable that can be used for cor() function. Although we don't know the coordinate system and projection used in this dataset, it is apparent that the 'YCoord' is not latitude, instead it is very likely to be a Cartesian coordinate such as in Mercator. Thus the greater value is associated with a location farther north.

By plotting YCoords against Building values, we cannot see a apparent correlation between the two variables. However, there is a peak in buidling values around the Ycoords of 210000~220000, which is the mid-south of the whole area.

plot(MN$YCoord, MN$AssessTot)

plot of chunk unnamed-chunk-10

Now perform a correlation test to have a closer inspection.

cor.test(MN$YCoord, MN$AssessTot)

## 
##  Pearson's product-moment correlation
## 
## data:  MN$YCoord and MN$AssessTot 
## t = -12.43, df = 43316, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0 
## 95 percent confidence interval:
##  -0.06902 -0.05025 
## sample estimates:
##      cor 
## -0.05964

The results (p<2.2e-16) suggest that the observed correlation between north-south position and building value is significantly different from 0. Although the correlation is relatively close to zero (-0.05964), it is not unusal in such a large dataset. According to previous analysis (downtown=smaller YCoord, and uptown=larger YCoord), we can say that the negative correlation of -0.05964 indicates that the prices go up from uptown building to downtown building.

Since we already know from Test 4 that building size is an important component in determining building value, it might be helpful to exclude the effect of building size before making a conclusion.

# create an object 'MN_sqft' to store price per square foot for all
# buildings with size>0
MN_sqft <- MN[MN$BldgArea > 0, "AssessTot"]/MN[MN$BldgArea > 0, "BldgArea"]

MN_sqft <- MN[MN$BldgArea > 0, "AssessTot"]/MN[MN$BldgArea > 0, "BldgArea"]

# create an object 'MN_Y' to store YCoords for all buildings with size>0
MN_Y <- MN[MN$BldgArea > 0, "YCoord"]

# run the correlation test with controlling for building size
cor.test(MN_Y, MN_sqft)

## 
##  Pearson's product-moment correlation
## 
## data:  MN_Y and MN_sqft 
## t = 0.2044, df = 40826, p-value = 0.8381
## alternative hypothesis: true correlation is not equal to 0 
## 95 percent confidence interval:
##  -0.008689  0.010711 
## sample estimates:
##      cor 
## 0.001011

The p-value suggests that we cannot reject the null hypothesis that the observed correlation is equal to 0.

Question 6 (optional, programmers only): Use the layout() function to make a chart showing buildings on the island of Manhattan in one panel and a scatter plot of the correlation between YCoord and AssessTot in the other panel. If you need help type ?layout.

layout(matrix(c(1, 2, 0, 0), 2, 2, byrow = TRUE))
plot(y = MN$YCoord, x = MN$XCoord, col = MN$HD, pch = 16, cex = 0.5, main = "Map of Manhattan Buildings", 
    xlab = "X Coordinate", ylab = "Y Coordinate")
plot(y = MN$AssessTot, x = MN$YCoord, main = "Building Value vs Y Coordinate(North-South\n Position)", 
    xlab = "Y Coordinate", ylab = "Building Value")

plot of chunk unnamed-chunk-13

Created by: Li Xu; Created on: 02/05/2013; Updated on: 02/06/2013