Homework 2

Does the designation of historic districts affect the value of buildings?

Part 1: Setting up the data

To answer this question you will have to identify all of the buildings that are a part of a historic district and compare the value of those historic properties to properties outside of a historic district. You have to be careful to construct a meaningful comparison. Many factors influence the value of a property, as best as you can you must take those factors into account.

First, load file. The file is in .dbase form, which are quite similar to .csv but preferred by GIS programs. To load this kind of file, install a library called “foreign” that will read dbase files:

## Install the library 'foreign' that can read dbase files
install.packages("foreign")

## Installing package(s) into
## '/Applications/RStudio.app/Contents/Resources/R/library' (as 'lib' is
## unspecified)

## Error: trying to use CRAN without setting a mirror


## Load the 'foreign' library
library(foreign)

MN <- read.dbf("/Users/caitlin/Dropbox/CAITLINS DOCUMENTS/CU Boulder/Courses/GEOG 5023 Quant methods - Spielman/Homework problem sets/Homework 2/mnmappluto.dbf")

## Explore the data a bit
names(MN)  # Gives you column headers

##  [1] "Borough"    "Block"      "Lot"        "CD"         "CT2000"    
##  [6] "CB2000"     "SchoolDist" "InstRegion" "Council"    "ZipCode"   
## [11] "FireComp"   "HealthArea" "HealthCtr"  "PolicePrct" "Address"   
## [16] "ZoneDist1"  "ZoneDist2"  "Overlay1"   "Overlay2"   "SPDist1"   
## [21] "SPDist2"    "AllZoning1" "AllZoning2" "SplitZone"  "BldgClass" 
## [26] "LandUse"    "Easements"  "OwnerType"  "OwnerName"  "LotArea"   
## [31] "BldgArea"   "ComArea"    "ResArea"    "OfficeArea" "RetailArea"
## [36] "GarageArea" "StrgeArea"  "FactryArea" "OtherArea"  "AreaSource"
## [41] "NumBldgs"   "NumFloors"  "UnitsRes"   "UnitsTotal" "LotFront"  
## [46] "LotDepth"   "BldgFront"  "BldgDepth"  "ProxCode"   "IrrLotCode"
## [51] "LotType"    "BsmtCode"   "AssessLand" "AssessTot"  "ExemptLand"
## [56] "ExemptTot"  "YearBuilt"  "BuiltCode"  "YearAlter1" "YearAlter2"
## [61] "HistDist"   "Landmark"   "BuiltFAR"   "MaxAllwFAR" "BoroCode"  
## [66] "BBL"        "CondoNo"    "Tract2000"  "XCoord"     "YCoord"    
## [71] "ZoneMap"    "Sanborn"    "TaxMap"     "EDesigNum"  "PLUTOMapID"
## [76] "Version"    "shape_area" "shape_len"

plot(MN$YCoord ~ MN$XCoord)  # Plots the X and Y coordinates to see what kind of data we are dealing with.

plot of chunk unnamed-chunk-2

Something is wrong, as you can see by the outlying dots in the bottom left of the plot. These seem to have 0s for their coordinates.

We can also see that there are problems in the data by looking at a summary of just the X and Y coordinates - not that the Mins for both columns are 0s:

summary(MN[, c("YCoord", "XCoord")])

##      YCoord           XCoord       
##  Min.   :     0   Min.   :      0  
##  1st Qu.:207645   1st Qu.: 986617  
##  Median :219274   Median : 991591  
##  Mean   :217078   Mean   : 977235  
##  3rd Qu.:231006   3rd Qu.: 998354  
##  Max.   :259301   Max.   :1009761

We need to get rid of these points by using a logical expression that selects all coordinates with 0s in both the X and Y, such as:

MN$YCoord > 0 & MN$XCoord > 0

(If you run this, you'll see you get a long list of TRUEs and FALSEs)

Use the slice function with the logical expression above to create a new object that excludes the rows that have 0s for both X and Y. To do this, use the command where MN[rows, columns] and insert the logical experssion for the rows:

select only the rows that returned “TRUE” when you entered “MN$YCoord > 0 & MN$XCoord > 0

MN <- MN[MN$YCoord > 0 & MN$XCoord > 0, ]  # Creates a new object, MN (overwrites original MN object created at start?) for which all rows that have a 0 for both the X and Y coordinate are left out, e.g. ONLY rows that don't have 0 for both X and Y are included.
dim(MN)  #returns the # of rows and columns in a table

## [1] 43318    78

Plot the new object, MN, which should no longer include the rows with 0s for both X and Y:

plot(MN$YCoord ~ MN$XCoord)

plot of chunk unnamed-chunk-5

Now, we are ready to work with the data. When we ran names(MN) we saw that one of the column names is “OwnerName”. We can search for all properties owned by Trump using the “Grep()” function, which does a text search.

trumpBldgs <- grep("trump", MN$OwnerName, ignore.case = TRUE)
MN[trumpBldgs, c("Address", "AssessTot", "OwnerName")]  #print the address and value of the buildings (note the output includes the row #)

##                  Address AssessTot             OwnerName
## 13619 200 EAST 69 STREET  53710629  TRUMP PALACE COMPANY
## 39891      1030 3 AVENUE  23940000 TRUMP PLAZA OWNERS IN

Let's take a look at the way the dataset records Historical District:

summary(MN$HistDist)

Running this command displays the names of all the historical districts. Also, it shows that 34,024 values don't have a historical district assigned (NAs). That is, these buildings are not in historical districts.

Since we only care about those buildings with historical districts, we'll re-code the “HistDist” column by creating a new column in MN called HD and setting up a dummy variable, which takes the value of “1” if a building is in a historic district and a value of “0” if it is not. Do this using an “ifelse() command”

ifelse(is.na(MN[1:100, "HistDist"]), 0, 1)  # coding 1s and 0s for the first 100 rows and returning the result

##   [1] 1 0 1 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0

MN$HD <- ifelse(is.na(MN[, "HistDist"]), 0, 1)  # since the above looked good, creating the dummy variable in a new column for ALL rows (this is done by leaving a blank before the comma)
summary(MN$HD)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.215   0.000   1.000

Let R know that the data in MN$HD is categorical, not numerical

MN$HD <- as.factor(MN$HD)
summary(MN$HD)

##     0     1 
## 34024  9294

Plot the historic districts on a map, color coding the 1s and 0s:

#'col' changes the color of dots depending upon the value in the 'HD' column
#'pch' sets the symbol to a solid dot
#'cex'  makes the dot .5 the normal size

plot(y = MN$YCoord, x = MN$XCoord, col = MN$HD, pch = 16, cex = 0.5)

plot of chunk unnamed-chunk-9

Split the MN data set into two tables, one with all the historic sites and the other with non-historic sites:

inHD <- MN[MN$HD == 1, ]  ## Creates a new object that contains only the rows in the object MN that equal 1, e.g a new data set that contains only the historic buildings. The blank after the comma indicates that we want to do this for ALL columns.
outHD <- MN[MN$HD == 0, ]

Part 2: Hypothesis Testing

TEST 1 Null hypothesis: the designation of historic districts has no effect on property values, e.g. the buildings in a historic district have the same value as those outside of a historic district and any difference between the two groups is due to random chance.

t.test(x = inHD$AssessTot, y = outHD$AssessTot)

## 
##  Welch Two Sample t-test
## 
## data:  inHD$AssessTot and outHD$AssessTot 
## t = -15.05, df = 43286, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -1724546 -1327117 
## sample estimates:
## mean of x mean of y 
##   1233743   2759574

# Testing the two objects we created according to the column w/ total
# property value

The test above, hypothesis test 1, provides you with a p-value and a t-statistic. In this test the t statistic was very large indicating that the difference between historic and non-historic properties was very large, much larger than we would expect due to random chance (if the two types of properties actually had the same value). The 95 percent confidence interval, reported in the output from the t.test is the confidence interval for the difference between the x and y groups. Notice that the confidence interval does not include zero, this provides further support for your conclusion. Finally, the last line shows you the mean of the x and y groups.

QUESTION 1: What does this hypothesis test tell you?

Remember, if the p-value is greater than .05 then we accept the null hypothesis that the two groups are the same. If the p-value is less than .05 we reject the null hypothesis. The p-value represents the probability of observing x due to chance if the null hypothesis is true, when this value is less than .05 we say the difference between x and y is “statistically significant”. The t-test is just a formula designed to tell you if two quantities are different. It will not tell you if the quantities you have chosen to test are an appropriate way to answer your research question.

But, hypothesis test 1 is not a good test of the null hypothesis that buildings in and out of historic districts have the same average value. Hypothesis test 1 compared all of the non-historic buildings in Manhattan to those in a historic district. The non-historic buildings include large high-rise luxury buildings located miles away from any historic district. If historic buildings tend to be smaller (because they are old and built before skyscrapers were common) they may not be worth as much as newer buildings simply because they are smaller.

So, let's run a test on buildings of comparable size using the column “BldgArea”, which describes the square footage of each building:

TEST 2 Null hypothesis: there is no difference between the size of properties inside and outside historic districts.

t.test(x = inHD$BldgArea, y = outHD$BldgArea)  #Hypothesis Test 2

## 
##  Welch Two Sample t-test
## 
## data:  inHD$BldgArea and outHD$BldgArea 
## t = -9.037, df = 15819, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -25103 -16154 
## sample estimates:
## mean of x mean of y 
##     22050     42678

Interpretation of results There IS a statistically significant chance that the total area (sq. ft) of properties inside historic districts are different from the total area of buildings outside historic districts. Since the mean of x, properties inside historic districts, is 22,050 and the mean of y is 42,678, we known that properties in historic districts are smaller.

QUESTION 2: Is there a significant difference in property value between buildings inside a historic district and those just outside, but on the same block?

Location is an important component of a property's value. To test the impact of a historic district designation we should revise our test to examine only buildings that have similar locations. One way to do this is to identify buildings that are close to but outside of historic districts. Each building in the database has a block number. Lets revise outHD so that it only includes buildings which are on the same block as a historic district but outside of the district boundaries.

Null hypothesis: There is no difference between property value of buildings inside a historic district and those just outside, on the same block.

Revise “outHD” to include only buildings on the same block as historic buildings:

## Select buildings on the same block as a historic district. Get a list
## of all blocks that contain historic buildings
blocks <- inHD$Block

# display the first 5 rows of blocks
head(blocks)

## [1] 1 1 7 7 7 7


## Select all buildings (from MN) that are on the same block as historic
## buildings. The line below selects all rows where the block column
## contains values in our list of blocks. Save the result as a new object.
## **QUESTION: I'm having a hard time understanding this step with the
## %in% command...**
HDB <- MN[MN$Block %in% blocks, ]

## The object HDB_out contains all buildings that are not within the
## historical district, but which are on the same block as other
## historical buildings.
HDB_out <- HDB[HDB$HD == 0, ]

## The object HDB_in contains all buildings that are inside a historical
## district.  QUESTION: Why do we need to create a new object here? Don't
## we already have this in the object called 'inHD?'
HDB_in <- HDB[HDB$HD == 1, ]

Hypothesis Test 3: Null hypothesis: there is no significant difference in the value of a property inside or outside (but on the same block as) the historic district. (Difference in means is 0)

t.test(x = HDB_in$AssessTot, y = HDB_out$AssessTot)  #Hypothesis Test 3

## 
##  Welch Two Sample t-test
## 
## data:  HDB_in$AssessTot and HDB_out$AssessTot 
## t = -9.728, df = 4349, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -1507426 -1001727 
## sample estimates:
## mean of x mean of y 
##   1233743   2488319

Results interpertation: with such a small p-value, we reject the null hypothesis and conclude that there is a statistically significant differece in property values between those properties within and those just outside of (i.e. on the same block) as the historic district. The mean property value of homes inside a historic district is $1,233,743, while the mean of properties outside the district is $2,488,319.

However, the size of the building is an important determinant of its value. In hypothesis test 3 we did not control for the size of the building. We can do this by calculating the price per square foot:

# We have a problem.  Some buildings have 0 area (square footage).
summary(HDB_in$BldgArea)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0     4160     6370    22100    13000 17600000


# this could mean the lot is vacant, it could be an error. Either way it
# makes it hard to compute the price per square foot. We need to exlude
# these zero area buildings from our t-test

# Calcuate price per square foot for historic buildings Only for buildings
# with an area greater than 0

HDB_in_sqft <- HDB_in[HDB_in$BldgArea > 0, "AssessTot"]/HDB_in[HDB_in$BldgArea > 
    0, "BldgArea"]

# Calcuate price per square foot for non-historic buildings

HDB_out_sqft <- HDB_out[HDB_out$BldgArea > 0, "AssessTot"]/HDB_out[HDB_out$BldgArea > 
    0, "BldgArea"]

Now, use the objects “HDB_in_sqft” and “HDB_out_sqft” to construct a t-test using the t.test() function.

Hypothesis Test 4 Null hypothesis: There is no difference in property price per square foot between buildings within and buildings on the same block as historic districts. (Difference in means is 0)

t.test(x = HDB_in_sqft, y = HDB_out_sqft)  #Hypothesis Test 4

## 
##  Welch Two Sample t-test
## 
## data:  HDB_in_sqft and HDB_out_sqft 
## t = -1.664, df = 4521, p-value = 0.09614
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -36.413   2.976 
## sample estimates:
## mean of x mean of y 
##     66.76     83.48

Results interpretation: The p-value of .096 is larger than .05, so we ACCEPT the null hypothesis. There is no statistically significant difference in property value between buildings within the historic district and those on the HD's periphery. This conclusion is the opposite of the one made after conducting hypothesis test 1 because we have controlled for two things:

1) for neighborhood (by comparing the buildings in historic districts to those just outside, and therefore excluding buildings that are far away from historic districts), and

2) for size of property (by comparing price per square foot instead of total property value)

Part 3: Correlation

Correlations are conceptually similar to the mean and standard deviation in that when measuring the correlation between two variables we have to account for the fact that our our measurements contain uncertainty. We have discussed the distinction between the sample mean, xˉ, and the true population mean μ. Similarly, we use the letter r to denote the sample correlation, that is the correlation among the variables in our measurements, and the population correlation ρ.

Calculating and testing correlations in R is really easy. There are two primary functions cor() and cor.test(), the former can produce a correlation matrix that tells you the correlation between all pairs of variables. Cor.test() does hypothesis test on observed correlations, it tests the null hypothesis that the observed correlation is zero. You'll find that with large complex data sets very small correlations are commonplace, it can be very useful to use cor.test() in order to see if the observed correlation is significantly different from zero.

Using the data x and y generated in hypothesis test #4 above, x = HDB_in_sqft and y = HDB_out_sqft Given the plot is the result surprising?

cor(x, y)

## Error: object 'y' not found

To test the null hypothesis that ρ=0 we use a t-test:

t = r * [sqrt ((n−2)/(1−r^2))]

Conceptually, when either the observed correlation, r, or the number of observations n is large, the t-statistic increases (making it more likely that we will reject the null hypothesis). The function, cor.test() computes this t-statistic and returns a confidence interval for ρ given the observed data.

There are several different methods for computing correlation. The Pearson correlation coefficient is used to summarize the relationship between two continuous variables. For example, the relationship between temperature and elevation. Kendall's τ (“tau”) and Spearman's rank correlation coefficient compute the association between two variables by sorting each variable and assigning each observation a rank. To specify the type of correlation:

cor(x, y, method = “pearson”) #DEFAULT cor(x, y, method = “kendall”) cor(x, y, method = “spearman”)

cor.test(x,y) for any of the above gives a bit more test info; cor(x,y) gives only the correlation value, r

In this example you can use the Person correlation and therefore do not have to specify a method (Pearson is the default). Once you've prepared your data the function cor() can be used as follows:

a) cor(aTable$aColumn1, aTable$aColumn3) Returns only the correlation coefficient, r

b) cor.test(aTable$aColumn1, aTable$aColumn3) Returns more info associated with the correlation test

Question 5:

Factors other than historic designation affect the price of real estate in MN. Is there a significant correlation between a building's north-south position on the island (YCoord) and its total assessed value (AssessTot)? Are downtown buildings worth more than uptown buildings? Use cor.test() to answer the question.

Use a correlation test on the two variables, YCoord and AssessTot.

Null hypothesis: rho (p) = 0. In other words, the observed correlation is zero.

cor.test(MN$YCoord, MN$AssessTot)

## 
##  Pearson's product-moment correlation
## 
## data:  MN$YCoord and MN$AssessTot 
## t = -12.43, df = 43316, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0 
## 95 percent confidence interval:
##  -0.06902 -0.05025 
## sample estimates:
##      cor 
## -0.05964

Test interpretation: The p-value is very small, so we reject the null hypothesis and conclude that there is some correlation. Since cor = -.059, it seems that there is a slight negative correlation between a building's north-south position and it's value. Buildings farther north are not worth as much as buildings farther south.

Question 6 (optional, programmers only):

Use the layout() function to make a chart showing buildings on the island of Manhattan in one panel and a scatter plot of the correlation between YCoord and AssessTot in the other panel. If you need help type ?layout.

(Skipped)