Hypothesis Testing

Historic Preservation in New York City

Cities all over north America contain “historic” neighborhoods. Historic neighborhoods are areas where the buildings have historical importance not because of their individual significance but because as a collection they represent a the architectural sensibilities of a particular time period. In Boulder, for example, the Martin Acres subdivision was recently surveyed for possible inclusion in a historic district. Establishing the “significance” of a neighborhood's historical character is a matter for historians not statisticians. However, there is an important question about historic preservation - does it help or hurt property values?

Once an area is designated a historic district development restrictions come into force. Typically, these restrictions aim to preserve the historic “look and feel” of the buildings in a neighborhood and thus restrict major modifications to the buildings. The owners of property in historic districts often face significant expenses and restrictions. For example, they might have to maintain the original facade on their building as opposed to replacing it with something more economical like vinyl-siding. As a result of these restrictions the historic districts are often opposed by residents.

In Manhattan there are a number of historic districts. In this exercise we will exploit a database describing every building on the island of Manhattan. Your job is to answer the question - does the designation of historic districts affect the value of buildings? This question has important policy significance, the 5th Amendment to the U.S. Constitution, states, “… nor shall private property be taken for public use without just compensation.” If designating an area as a historic district reduces the values of a property owners might be able to sue the government for compensation.

To answer this question you will have to identify all of the buildings that are a part of a historic district and compare the value of those historic properties to properties outside of a historic district. You have to be careful to construct a meaningful comparison. Many factors influence the value of a property, as best as you can you must take those factors into account.

Loading and preparing the data

We will be working with a dbase file. A dbase file is just another way of storing tabular data. There is little substantive difference between a dbase, csv, or an excel file. In spite of their functional similarity they each get handled differently in R. ArcGIS likes dbase files for some reason, so when you are doing GIS stuff later in your academic career you'll probably use dbase files. To read dbase files into R you have to load a library called “foreign” - its purpose is to read dbase files:

install.packages("foreign")


##Load the "foreign" library in order to read dbase files
library(foreign)

MN<-read.dbf("/Users/seth/Dropbox/GEOG 5023/GEOG 5023 - Spring 2013/Data/mnmappluto.dbf")

The data includes information about every building in Manhattan, over 40,000 buildings. Most buildings have X, Y coordinates allowing you to map them simply by using the plot command.

## Map all of the buildings in Manhattan
plot(MN$YCoord ~ MN$XCoord)

plot of chunk unnamed-chunk-2

Something is wrong. The resulting plot doesn't look at all like a map of Manhattan. Some buildings have not been assigned coordinates and have a X, Y position of 0,0 - These must be removed from the data:

##Look at the range of building x any y coordinates
##The line below uses the slice notation to select all rows
##in the columns "YCoord", "XCoord".
##Note that the coordinate range includes zero.

summary(MN[ ,c("YCoord", "XCoord")])

##      YCoord           XCoord       
##  Min.   :     0   Min.   :      0  
##  1st Qu.:207645   1st Qu.: 986617  
##  Median :219274   Median : 991591  
##  Mean   :217078   Mean   : 977235  
##  3rd Qu.:231006   3rd Qu.: 998354  
##  Max.   :259301   Max.   :1009761

Logical expressions check a value and return “TRUE” or “FALSE.” For example, the logical expression “1 > 3”, if typed into R would return “FALSE.” We can use a logical expression to identify all buildings that have coordinates greater than zero:

#Check each row to see if its Y coordinates are greater than zero
MN$YCoord > 0

#Check each row to see if its X coordinates are greater than zero
MN$XCoord > 0

#Combine the above to identify rows that have both X and Y coordinates above zero 
#The line below is a double logical expression, both parts have to be true
##for R to return "TRUE"
MN$YCoord > 0 & MN$XCoord > 0

Now use “MN$YCoord > 0 & MN$XCoord > 0” to create a new object that includes only the rows that meet both criteria, that is only rows that have non-zero X AND Y coordinates. This is accomplished using the slice notation. We've used this before, MN[rows, columns] where “rows” is a logical expression like the one above used to select rows that meet some criterion. Complete the line below to select only the rows that returned “TRUE” when you entered “MN$YCoord > 0 & MN$XCoord > 0”:

MN <- MN[______________________,]

You got the answer right if your “MN” object looks like this:

##dim() returns the number of rows and comlumns in a table
dim(MN)

## [1] 43318    78

Your new map should look like this:

plot(MN$YCoord ~ MN$XCoord)

plot of chunk unnamed-chunk-4

Now that the data is fixed we can do some fun stuff:

##Find buildings owned by someone famous who lives in MN
##Grep() is a text searching function
##in the line below grep() will identify rows where the owner name includes "trump"
##You can replace "trump" with the last name of anyone who owns a building in Manhattan

trumpBldgs <- grep("trump", MN$OwnerName, ignore.case = TRUE)

#print the address, and value of the building (note the output includes the row #)

MN[trumpBldgs,c("Address", "AssessTot", "OwnerName")]

##                  Address AssessTot             OwnerName
## 13619 200 EAST 69 STREET  53710629  TRUMP PALACE COMPANY
## 39891      1030 3 AVENUE  23940000 TRUMP PLAZA OWNERS IN

Identifying Historic Districts

We are interested in working with buildings that are in a historic district. We need to be able to identify these buildings, but the way the data file is coded is awkward:

summary(MN$HistDist)

If a building is in a historic district we know the name of the district. Notice that over 34000 buildings are a group called “NA's”, these appear near the bottom of the list of districts. These “NA” buildings are not in a historic district, when a building is not in a district, the HistDist variable is left blank - NA is how R indicates that a value is blank or missing. We don't really care which district a building is in, only that it is in a historic district. We'll now re-code the “HistDist” column to make a dummy variable which takes the value of “1” if a building is in a historic district and a value of “0” if it is not in a historic district.

To do this we will use some new functions, is.na(), and ifelse().

##is.na() is a logical expression, it returns true if a value is missing (NA)
##check the first 100 rows of the "HistDist" column to see if they are blank
is.na(MN[1:100,"HistDist"])

In the output the “TRUE” values indicate rows where the HistDist column is missing a value- these are places that are not in a historic district. The function ifelse() takes three arguments, a logical expression, a value for true, and a value for false. If the logical expression evaluates to “TRUE” the first value is used, if not the second value is used. The logical phrase “is.na()” tests each row in the table to see if the HistDist column is empty (NA). If the column is NA then the entry in a new column “HD” is 0 otherwise its 1. The rows where MN$HD == 1 represent buildings that are in a historic district.

##take the previous line and replace the "TRUE" with 0
ifelse(is.na(MN[1:100,"HistDist"]), 0, 1)

Modify the line below so that it runs the ifelse() function on all rows, not the first 100 rows.

##Note: this line will return an error when you try to run it.
##Correct the line by making it evaluate all rows in the MN table
MN$HD<- ifelse(is.na(MN[1:100,"HistDist"]), 0, 1)

After you have corrected the column your output should match the line below:

summary(MN$HD)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.215   0.000   1.000

A variable that is coded 0,1 to refer to absence or presence, respectively, is sometimes called a “dummy variable” or an “indicator variable.” Since MN$HD measures if a building is in or out of a historic district, we should tell R that the numbers in the column are a code for historic districts, we do this by creating a “factor,” a factor is any column that contains categorical data.

# convert MN$HD to a factor
MN$HD <- as.factor(MN$HD)
## note how the summary changes after changing the 'HD' column to a factor
summary(MN$HD)

##     0     1 
## 34024  9294

Now we can draw a very crude map of historic districts.

#'col' changes the color of dots depending upon the value in the 'HD' column
#'pch' sets the symbol to a solid dot
#'cex'  makes the dot .5 the normal size

plot(y = MN$YCoord, x = MN$XCoord, col = MN$HD, pch = 16, cex = 0.5)

plot of chunk unnamed-chunk-13

Finally, split the “MN” object into two tables, one for the historic buildings (inHD) and one for the buildings outside a historic district (outHD):

## YOU SHOULD ADD A COMMENT DESCRIBING THIS LINE
inHD <- MN[MN$HD == 1, ]

## YOU SHOULD ADD A COMMENT DESCRIBING THIS LINE
outHD <- MN[MN$HD == 0, ]

Hypothesis Testing

In this exercise our goal is to explore the effect of historic districts on property values in New York City. Our null hypothesis is that the designation of historic districts has no effect on property values, the buildings in a historic district have the same value as those outside of a historic district, and difference between the two groups is due to random chance.

In R we can test this hypothesis with a function called t.test(), remember that when sample sizes are large z and t-tests are equivalent. The function t.test can take several arguments:

x: the data being tested
y: a comparison group (used for a two sample test)
mu: a fixed value (used for a one sample test)
alternative: can be “two.sided”, “less”, or “greater” this corresponds to different kinds of hypothesis tests. In class we have been doing mostly two-sided tests that aim to establish if a value is not equal to mu or y. A one sided test can be used to test the hypothesis that mu or y is greater or less than x.
conf.level: allows you adjust the significance threshold. By default the conf.level =.95. The function is used by typing “t.test(x,y)”, it evaluates the null hypothesis that the x and y groups have the same mean:

#Insert a comment describing the line below
#In the comment state the null hypothesis
t.test(x = inHD$AssessTot, y = outHD$AssessTot)    #Hypothesis Test 1

## 
##  Welch Two Sample t-test
## 
## data:  inHD$AssessTot and outHD$AssessTot 
## t = -15.05, df = 43286, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -1724546 -1327117 
## sample estimates:
## mean of x mean of y 
##   1233743   2759574

The test above, hypothesis test 1, provides you with a p-value and a t-statistic. In this test the t statistic was very large indicating that the difference between historic and non-historic properties was very large, much larger than we would expect due to random chance (if the two types of properties actually had the same value). The 95 percent confidence interval, reported in the output from the t.test is the confidence interval for the difference between the x and y groups. Notice that the confidence interval does not include zero, this provides further support for your conclusion. Finally, the last line shows you the mean of the x and y groups:

###Question 1. What does hypothesis test 1 tell you?

Remember if the p-value is greater than .05 we accept the null hypothesis that the two groups are the same. If the p-value is less than .05 we reject the null hypothesis. The p-value represents the probability of observing x due to chance if the null hypothesis is true, when this value is less than .05 we say the difference between x and y is “statistically significant”. The t-test is just a formula designed to tell you if two quantities are different. It will not tell you the quantities you have chosen to test are an appropriate way to answer your research question.

Hypothesis test 1 is not a good test of the null hypothesis that buildings in and out of historic districts have the same average value. Hypothesis test 1 compared all of the non-historic buildings in Manhattan to those in a historic district. The non-historic buildings included large high-rise luxury buildings located miles away from any historic district. If historic buildings tend to be smaller (because they are old and built before skyscrapers were common) they may not be worth as much as newer buildings simply because they are smaller.

Run the following test, the column “BldgArea” describes the square footage of each building:

# Add a comment describing this test
t.test(x = inHD$BldgArea, y = outHD$BldgArea)  #Hypothesis Test 2

## 
##  Welch Two Sample t-test
## 
## data:  inHD$BldgArea and outHD$BldgArea 
## t = -9.037, df = 15819, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -25103 -16154 
## sample estimates:
## mean of x mean of y 
##     22050     42678

Question 2. What does hypothesis test 2 tell you about the size of the buildings inside and outside of historic districts?

Location is an important component of a property's value. To test the impact of a historic district designation we should revise our test to examine only buildings that have similar locations. One way to do this is to identify buildings that are close to but outside of historic districts. Each building in the database has a block number. Lets revise outHD so that it only includes buildings which are on the same block as a historic district but outside of the district boundaries.

## Select buildings on the same block as a historic district Get a list of
## all blocks that contain historic buildings
blocks <- inHD$Block

# display the first 5 rows of blocks
head(blocks)

## Select all buildings (from MN) that are on the same block as historic
## buildings The line below selects all rows where the block column
## contains values in our list of blocks Save the result as a new object
HDB <- MN[MN$Block %in% blocks, ]

## Add a comment to describe the line.  What does the object HDB_out
## contain?
HDB_out <- HDB[HDB$HD == 0, ]

## Add a comment to describe the line.  What does the object HDB_in
## contain?
HDB_in <- HDB[HDB$HD == 1, ]

Now we have two files that contain buildings on blocks with contain historic districts, one file describes buildings in the district and the other describes those outside the district boundaries.

t.test(x = HDB_in$AssessTot, y = HDB_out$AssessTot)  #Hypothesis Test 3

## 
##  Welch Two Sample t-test
## 
## data:  HDB_in$AssessTot and HDB_out$AssessTot 
## t = -9.728, df = 4349, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -1507426 -1001727 
## sample estimates:
## mean of x mean of y 
##   1233743   2488319

Question 3. After controlling for location is the historic district designation associated with a difference in property values? Are the buildings in the historic district different from their non-historic neighbors? Use the p-value from hypothesis test 3 to support your conclusions.

The size of the building is an important determinant of its value. In hypothesis test 3 we did not control for the size of the building, we can do this by calculating the price per square foot:

#We have a problem.  Some buildings have 0 area (square footage).
summary(HDB_in$BldgArea)

#this could mean the lot is vacant, it could be an error.
#either way it makes it hard to compute the price per square foot.
#We need to exlude these zero area buildings from out t-test

#Calcuate price per square foot for historic buildings
#Only for buildings with an area greater than 0

HDB_in_sqft <- HDB_in[HDB_in$BldgArea > 0, "AssessTot"] / HDB_in[HDB_in$BldgArea > 0, "BldgArea"] 

#Calcuate price per square foot for non-historic buildings

HDB_out_sqft <- HDB_out[HDB_out$BldgArea > 0, "AssessTot"] / HDB_out[HDB_out$BldgArea > 0, "BldgArea"]

Now, use the objects “HDB_in_sqft” and “HDB_out_sqft” to construct a t-test using the t.test() function. If your output looks like the line below you have correctly constructed the t-test:

## 
##  Welch Two Sample t-test
## 
## data:  HDB_in_sqft and HDB_out_sqft 
## t = -1.664, df = 4521, p-value = 0.09614
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -36.413   2.976 
## sample estimates:
## mean of x mean of y 
##     66.76     83.48

Question 4. After controlling for location and building size, do historic and non-historic buildings have significantly different values? If your conclusion has changed between the 1st and 4th hypothesis tests explain why.

Correlation

Correlations are conceptually similar to the mean and standard deviation in that when measuring the correlation between two variables we have to account for the fact that our our measurements contain uncertainty. We have discussed the distinction between the sample mean, $ \bar{x} $, and the true population mean $ \mu $. Similarly, we use the letter $ r $ to denote the sample correlation, that is the correlation among the variables in our measurements, and the population correlation $ \rho $.

Calculating and testing correlations in R is really easy. There are two primary functions cor() and cor.test(), the former can produce a correlation matrix that tells you the correlation between all pairs of variables. Cor.test() does hypothesis test on observed correlations, it tests the null hypothesis that the observed correlation is zero. You'll find that with large complex data sets very small correlations are commonplace, it can be very useful to use cor.test() in order to see if the observed correlation is significantly different from zero.

## Using the data x and y generated above
##Given the plot is the result surprising?
cor(x,y)

## Error: object 'y' not found

To test the null hypothesis that $ \rho = 0 $ we use a t-test, $ t=r\sqrt{\frac{n-2}{1-r^2}} $. Conceptually, when either the observed correlation, $ r $, or the number of observations $ n $ is large, the t-statistic increases (making it more likely that we will reject the null hypothesis). The function, cor.test() computes this t-statistic and returns a confidence interval for $ \rho $ given the observed data.

There are several different methods for computing correlation. In the Pearson correlation coefficient is used to summarize the relationship between two continuous variables. For example, the relationship between temperature and elevation. Kendall's $ \tau $ (“tau”) and Spearman's rank correlation coefficient compute the association between two variables by sorting each variables and assigning each observation a rank. To specify the type of correlation:

cor(x, y, method = "pearson")  #Pearson is the default.
cor(x, y, method = "kendall")
cor(x, y, method = "spearman")

In this example you can use the Person correlation and therefore do not have to specify a method (Pearson is the default). Once you've prepared your data the function cor() can be used as follows:

###THIS IS JUST AN EXAMPLE IT WILL NOT WORK!
##TO USE THIS EXAMPLE:
##replace aTable with an object and aColumn with a column

cor(aTable$aColumn1, aTable$aColumn3)

#or

cor.test(aTable$aColumn1, aTable$aColumn3)

Examples of various Pearson correlation coefficients and the use of layout() to make complex figures. plot of chunk unnamed-chunk-24

Question 5: Factors other than historic designation affect the price of real estate in MN. Is there a significant correlation between a buildings north-south position on the island (YCoord) and its total assessed value (AssessTot)? Are downtown buildings worth more than uptown buildings? Use `cor.test()` to answer the question.

Question 6 (optional, programmers only): Use the `layout()` function to make a chart showing buildings on the island of Manhattan in one panel and a scatter plot of the correlation between YCoord and AssessTot in the other panel. If you need help type `?layout`.

Hypothesis Testing

Historic Preservation in New York City

Loading and preparing the data

Identifying Historic Districts

Hypothesis Testing

Question 2. What does hypothesis test 2 tell you about the size of the buildings inside and outside of historic districts?

Question 3. After controlling for location is the historic district designation associated with a difference in property values? Are the buildings in the historic district different from their non-historic neighbors? Use the p-value from hypothesis test 3 to support your conclusions.

Question 4. After controlling for location and building size, do historic and non-historic buildings have significantly different values? If your conclusion has changed between the 1st and 4th hypothesis tests explain why.

Correlation

Question 6 (optional, programmers only): Use the `layout()` function to make a chart showing buildings on the island of Manhattan in one panel and a scatter plot of the correlation between YCoord and AssessTot in the other panel. If you need help type `?layout`.

Deliverables

1. Upload your answers to questions 1-6 and all supporting code to RPubs

2. Update your R journal to include descriptions of every function used in the exercise.

Hypothesis Testing

Historic Preservation in New York City

Loading and preparing the data

Identifying Historic Districts

Hypothesis Testing

Question 2. What does hypothesis test 2 tell you about the size of the buildings inside and outside of historic districts?

Question 3. After controlling for location is the historic district designation associated with a difference in property values? Are the buildings in the historic district different from their non-historic neighbors? Use the p-value from hypothesis test 3 to support your conclusions.

Question 4. After controlling for location and building size, do historic and non-historic buildings have significantly different values? If your conclusion has changed between the 1st and 4th hypothesis tests explain why.

Correlation

Question 6 (optional, programmers only): Use the layout() function to make a chart showing buildings on the island of Manhattan in one panel and a scatter plot of the correlation between YCoord and AssessTot in the other panel. If you need help type ?layout.

Deliverables

1. Upload your answers to questions 1-6 and all supporting code to RPubs

2. Update your R journal to include descriptions of every function used in the exercise.

Question 6 (optional, programmers only): Use the `layout()` function to make a chart showing buildings on the island of Manhattan in one panel and a scatter plot of the correlation between YCoord and AssessTot in the other panel. If you need help type `?layout`.