Homework 2 – Alex Crawford

GEOG 5023: Quantitative Methods In Geography

Loading & Prepping Data

Load Manhattan Building dbf File

library("foreign", lib.loc = "/Library/Frameworks/R.framework/Versions/2.15/Resources/library")
MN <- read.dbf("/Users/telekineticturtle/Desktop/Colorado 13/Quant Methods/Data/mnmappluto.dbf")

Removal of buildings without location information

summary(MN[, c("YCoord", "XCoord")])

##      YCoord           XCoord       
##  Min.   :     0   Min.   :      0  
##  1st Qu.:207645   1st Qu.: 986617  
##  Median :219274   Median : 991591  
##  Mean   :217078   Mean   : 977235  
##  3rd Qu.:231006   3rd Qu.: 998354  
##  Max.   :259301   Max.   :1009761

MN <- MN[MN$YCoord > 0 & MN$XCoord > 0, ]
# SS: dim() returns the number of rows and comlumns in a table
dim(MN)  # AC: If this returns 43318 rows, the above operation was successful.

## [1] 43318    78

plot(MN$YCoord ~ MN$XCoord, xlab = "X Coordinate", ylab = "Y Coordinate", main = "Manhattan Building Locations")

plot of chunk unnamed-chunk-2

Identifying Historic Districts

Creating a Dummy Variable

# AC: Using the ifelse() and is.na() functions to create a dummy variable
# to indicate whether a building is in an historic district (1) or not
# (0).
MN$HD <- ifelse(is.na(MN[, "HistDist"]), 0, 1)
# AC: We convert MN$HD to a factor because it is a dummy variable for
# categories, not a ratio level variable
MN$HD <- as.factor(MN$HD)
# AC: Replotting with historic district buildings highlighted.
plot(y = MN$YCoord, x = MN$XCoord, col = MN$HD, pch = 16, cex = 0.5, xlab = "X Coordinate", 
    ylab = "Y Coordinate", main = "Manhattan Building Locations")

plot of chunk unnamed-chunk-3

Making subsets of the data

# AC: Make a subset of Manhattan Buildings that are in an historic
# district using a logic operater and bracket notation.
inHD <- MN[MN$HD == 1, ]
# AC: Now do the same, but for buildings NOT in an historic district.
outHD <- MN[MN$HD == 0, ]

Hypothesis Testing

SS: Desired Null Hypothesis: The designation of historic districts has no effect on property values, the buildings in a historic district have the same value as those outside of a historic district, and difference between the two groups is due to random chance.

Test 1 - Property Value Split by Historic Designation

# AC: Two sample t-test (two-sided, variance treated as equal) comparing
# assessment values for buildings in and out of an historic district.
# Null Hypothesis: The buildings in an historic district have the same
# value as those outside of an historic district, and difference between
# the two groups is due to random chance.
t.test(x = inHD$AssessTot, y = outHD$AssessTot)

## 
##  Welch Two Sample t-test
## 
## data:  inHD$AssessTot and outHD$AssessTot 
## t = -15.05, df = 43286, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -1724546 -1327117 
## sample estimates:
## mean of x mean of y 
##   1233743   2759574

Question 1: What does hypothesis test 1 tell you?

Firstly, hypothesis test 1 yielded a t value with a large magnitude, indicating that the difference in assessment values between historic and non-historic districts is much greater than would be expected if the variation was due only to random chance. Secondly, the p-value indicates that this difference is statistically significant. If in reality the two subsets had the same mean value, then it would be very very very unlikely for us to receive a t statistic as or more extreme than -15. Thirdly, the sign of the t value is negative and the mean value for property in historic districts is lower than the mean for property outside historic districts. Assuming the test was appropriate, it confirms that historic districts are associated with lower property values. It does not, of course, say why.

SS: That test was not appropriate for our desired null hypothesis because of a conflating variable: building size. Older buildings may be smaller than newer buildings (on average).

Test 2 - Building Area Split by Historic Designation

# AC: Two sample t-test (two-sided, variance treated as equal) comparing
# building area for buildings in and out of an historic district.  Null
# Hypothesis: The buildings in a historic district have the same area as
# those outside of a historic district, and difference between the two
# groups is due to random chance.
t.test(x = inHD$BldgArea, y = outHD$BldgArea)

## 
##  Welch Two Sample t-test
## 
## data:  inHD$BldgArea and outHD$BldgArea 
## t = -9.037, df = 15819, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -25103 -16154 
## sample estimates:
## mean of x mean of y 
##     22050     42678

Question 2. What does hypothesis test 2 tell you about the size of the buildings inside and outside of historic districts?

Hypothesis test 2 yielded another large t value, and the p-value indicated that there is a significant size difference between buildings in historic districts and those outside of historic districts. On average, historic district buildings are smaller than non-histroic district buildings (about half the size, actually). These findings suggest that to answer the original question correctly, we should control for building size.

Test 3 - Property Value Split by Historic Designation after Controlling for Location
SS: Select buildings on the same block as a historic district

# SS: Get a list of all blocks that contain historic buildings.
blocks <- inHD$Block
# SS: Select all buildings (from MN) that are on the same block as
# historic buildings. Save the result as a new object.  AC: the %in%
# operator selects all rows where the block column contains values in our
# list of blocks.
HDB <- MN[MN$Block %in% blocks, ]
# AC: Select all rows (buildings) with the same block number as an
# historical district building that are NOT themselves in an historic
# district.
HDB_out <- HDB[HDB$HD == 0, ]
# AC: Select all rows (buildings) with the same block number as an
# historical district building that are themselves in an historic
# district.  Note that this data frame will be equivalent to inHD.
HDB_in <- HDB[HDB$HD == 1, ]

Hypothesis Test 3

# AC: Two sample t-test (two-sided, variance treated as equal) comparing
# property value for buildings in and out of an historic district but on
# the same block as buildings in an historic district.  Null Hypothesis:
# For blocks that contain buildings within an historic district, the
# buildings in an historic district have the same property value as those
# outside of an historic district, and difference between the two groups
# is due to random chance.
t.test(x = HDB_in$AssessTot, y = HDB_out$AssessTot)

## 
##  Welch Two Sample t-test
## 
## data:  HDB_in$AssessTot and HDB_out$AssessTot 
## t = -9.728, df = 4349, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -1507426 -1001727 
## sample estimates:
## mean of x mean of y 
##   1233743   2488319

Question 3. After controlling for location is the historic district designation associated with a difference in property values? Are the buildings in the historic district different from their non-historic neighbors? Use the p-value from hypothesis test 3 to support your conclusions.

After controlling for location, the historic designation is still associated with different (lower) property values. The p-value is much less than 0.05; this indicates that the difference is still statistically significant. Relative to hypothesis test 1, the confidence interval has shifted by about +$0.2 million (from -$1.7 - -$1.2 million to -$1.5 - -$1.0 million).

Test 4 - Property Value per Area Split by Historic Designation after Controlling for Location

SS: Remove properties with a building area of 0.

# SS: Calcuate price per square foot for historic buildings and then
# nonhistoric buildings, only for buildings with an area greater than 0.
HDB_in_sqft <- HDB_in[HDB_in$BldgArea > 0, "AssessTot"]/HDB_in[HDB_in$BldgArea > 
    0, "BldgArea"]
HDB_out_sqft <- HDB_out[HDB_out$BldgArea > 0, "AssessTot"]/HDB_out[HDB_out$BldgArea > 
    0, "BldgArea"]
# AC: Two sample t-test (two-sided, variance treated as equal) comparing
# property value per square footage for buildings in and out of an
# historic district but on the same block as buildings in an historic
# district.  Null Hypothesis: For blocks that contain buildings wihtin an
# historic district, buildings in an historic district have the same value
# per square foot as those outside of an historic district, and difference
# between the two groups is due to random chance.
t.test(HDB_in_sqft, HDB_out_sqft)

## 
##  Welch Two Sample t-test
## 
## data:  HDB_in_sqft and HDB_out_sqft 
## t = -1.664, df = 4521, p-value = 0.09614
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  -36.413   2.976 
## sample estimates:
## mean of x mean of y 
##     66.76     83.48

Question 4. After controlling for location and building size, do historic and non-historic buildings have significantly different values? If your conclusion has changed between the 1st and 4th hypothesis tests explain why.

After controlling for both location and building size, the p-value is greater than 0.05, so the null hypothesis cannot be rejected. Although the mean value for buildings in historic districts is less than for those outside, this could due entirely to random chance. Note also that the 95% confidence interval crosses 0. My conclusion did not change between test 1 and test 3 (which controlled for location), so the difference maker is most likely building size. Test 2 showed that buildings in historic districts are smaller on average than other buildings. Combined, Tests 1 through 4 suggest that the property value discrepancy between buildings inside and outside historic districts exists because buildings in historic districts are smaller rather than because they are designated historic. Based on these findings and implications, further research could be conducted on the relationship between building size and property value.

Correlation

AC: Test to see if there is a correlation between N-S location (YCoord) and property value (AssessTot)
First plot the data.

qqnorm(MN$AssessTot, ylim = c(0, 5e+08), main = "Normal Q-Q Plot for Property Assessment Values")  # One outlier omitted
qqline(MN$AssessTot)

plot of chunk unnamed-chunk-10

qqnorm(MN$YCoord, main = "Normal Q-Q Plot for Y Coordinate")
qqline(MN$YCoord)

plot of chunk unnamed-chunk-10

plot(MN$AssessTot, MN$YCoord, xlim = c(0, 5e+08), main = "Property Assessment Value by Y Coordinate", 
    ylab = "Y Coordinate (Larger = Further North)", xlab = "Property Assessment Value ($)")  # Some outliers omitted

plot of chunk unnamed-chunk-10

Perform Correlation Tests

cor.test(MN$AssessTot, MN$YCoord)

## 
##  Pearson's product-moment correlation
## 
## data:  MN$AssessTot and MN$YCoord 
## t = -12.43, df = 43316, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0 
## 95 percent confidence interval:
##  -0.06902 -0.05025 
## sample estimates:
##      cor 
## -0.05964

cor.test(MN$AssessTot, MN$YCoord, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  MN$AssessTot and MN$YCoord 
## z = -60.93, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0 
## sample estimates:
##     tau 
## -0.1952

cor.test(MN$AssessTot, MN$YCoord, method = "spearman")

## Warning: Cannot compute exact p-values with ties

## 
##  Spearman's rank correlation rho
## 
## data:  MN$AssessTot and MN$YCoord 
## S = 1.754e+13, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0 
## sample estimates:
##     rho 
## -0.2947

Question 5: Factors other than historic designation affect the price of real estate in MN. Is there a significant correlation between a buildings north-south position on the island (YCoord) and its total assessed value (AssessTot)? Are downtown buildings worth more than uptown buildings? Use cor.test() to answer the question.

The large number of buildings in this sample makes a significant result very likely. Therefore, although there is only a minor correlation between y coordinate and and assessment value (-0.060 using Pearson's and -0.120 using Kendall's). A ranked correlation (e.g. Kendall's) is better in this case because the data are not normally distributed (see the normal q-q plots). Since the two variables are negatively correlated, the higher property values are associated with more southerly (downtown) properties. Closer inspection of the scatter plot shows that the spatial pattern of property values is more complicated. Visually, it is apparent that the greatest property values are located between 210,000 and 220,000. A second maximum is located around the southernmost properties.

Two other notes worth making in this document are 1) that none of the data examined appears normally distributed when looking at normal q-q plots or histograms and 2) We seemed to be treating buildings and properties interchangably, but some properties have multiple buildings. The median and both quartiles are 1 building and the mean is 1.078, but there are some outliers, and several properties have 2 buildings in the first 100 rows. Multiple buildings (NumBldgs) should be reflected by larger building areas (BldgArea), but there may be an impact on total assessment of property (AssessTot). In fact, I found a greater correlation between number of buildings and assessment total than between y coordinate and assessment total.

summary(MN$NumBldgs)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    1.00    1.08    1.00  163.00

MN[1:100, "NumBldgs"]

##   [1] 163  12   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1
##  [18]   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
##  [35]   1   1   1   1   6   1   2   1   1   1   1   1   1   1   0   1   1
##  [52]   1   1   1   2   1   2   1   1   2   1   1   1   1   1   1   1   1
##  [69]   1   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1   0
##  [86]   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1

cor.test(MN$NumBldgs, MN$AssessTot, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  MN$NumBldgs and MN$AssessTot 
## z = 52.41, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0 
## sample estimates:
##   tau 
## 0.202

Question 6: Use the layout() function to make a chart showing buildings on the island of Manhattan in one panel and a scatter plot of the correlation between YCoord and AssessTot in the other panel.

matrx <- matrix(data = c(1, 2), 1, 2)
layout(matrx)
plot(MN$YCoord ~ MN$XCoord, xlab = "X Coordinate", ylab = "Y Coordinate", main = "Manhattan Building Locations")
plot(MN$AssessTot, MN$YCoord, xlim = c(0, 5e+08), main = "Assessment Value by Y Coord", 
    ylab = "Y Coordinate (Larger = Further North)", xlab = "Property Assessment Value ($)")

plot of chunk unnamed-chunk-13