This lab has two parts (each for 100 points), and you have two weeks of time to work on it. Please complete the following questions and submit the finished Rmd and HTML file onto Canvas. In order for this file to knit, it needs a file called “lab.css” to be in the same folder as the .Rmd file. The lab is rendered as an .html file for improved readability, but we recommend working through the .Rmd file and running the code as you go.

Don’t forget to change name field on line 3 to your first and last name.

Introduction

Cities all over North America contain “historic” neighborhoods. Historic neighborhoods are areas where the buildings have historical importance not because of their individual significance but because, as a collection, they represent the architectural sensibilities of a particular time period. In Boulder, for example, the Martin Acres subdivision was not-so recently surveyed for possible inclusion in a historic district. Establishing the ‘significance’ of a neighborhood’s historical character is a matter for historians not statisticians. However, there is an important question about historic preservation – does it help or hurt property values?

Part I (100 pts)

Working with real data

This lab deals with “real” data that is not necessarily clean. This is a reminder that the data that you find for your final projects. You will have to load your data, check that all of the variables are in the correct form (i.e., do some need to be factors?, do you need to create new variables from the variables that are in the data?), check for missingness in your data (will you remove all missing values? will you explore multiple imputation to deal with missingness?), and also check for values that are impossible or improbable (what will you do with those values? how can you defend what you do with those values?). Make sure to get started on your final projects early so that the cleaning gets done with plenty of time for the fun part - the analysis!

Today, we will be working with a dbase file. A dbase file is just another way of storing tabular data. There is little substantive difference between a dbase, csv, or an excel file. In spite of their functional similarity they each get handled differently in R. ArcGIS likes dbase files for some reason, so when you end up doing GIS stuff in your academic career, you’ll may come across dbase files. To read dbase files into R you have to load a library called foreign – its purpose is to read dbase files and other files foreign to R. First, we’ll check if it’s installed. Run this code chunk on your computer, and install the necessary packages.

# Here's an easy way to check if a package is installed. There
# are additional packages we'll need for this lab, so we'll check those too

packages <- c("foreign", "tidyverse", "sf", "kableExtra", "agricolae")
packages %in% rownames(installed.packages())
## [1] TRUE TRUE TRUE TRUE TRUE
# If necessary

# install.packages("foreign")
# install.packages("sf")

Now that everything is installed we’ll load our libraries

library(foreign) # for reading in a dbf
library(sf)
library(agricolae) # has functions for 2+ hypothesis tests
library(tidyverse) # all the good stuff is in here
library(kableExtra)

Now, we need to get the dbase file. We are working with a shapefile and associated dbase file for Manhattan, New York. The file is included in the lab folder under the data subdirectory. You can load the file like this:

MN <- foreign::read.dbf("data/MNMapPLUTO.dbf")

The data include information on just about every building (circa 2010) in Manhattan – over 40,000 buildings. Most buildings have X, Y coordinates allowing you to map them simply by using the ggplot command, specifying those columns to ggplot’s helper function, aes():

# Map all of the buildings in Manhattan
ggplot(data = MN, aes(y = YCoord, x = XCoord)) +
  geom_point() 

Something is wrong. The resulting plot doesn’t look at all like a map of Manhattan. This is because some buildings have not been assigned coordinates and have a X, Y position of (0, 0). These must be removed from the data first:

Q1 (10 pts): Please use the filter function to exclude the rows with the buildings that are not properly geo-referenced.

filt_sites <- MN %>%
  filter(XCoord != 0 & YCoord != 0)
filt_sites

Q2 (15 pts): Please recreate the map using the code from above to check and make sure you got rid of all the problem rows.

ggplot(data = filt_sites, aes(y = YCoord, x = XCoord)) +
  geom_point() 

Your new map should look something like the output from the following code chunk. Here I just used the xlim and ylim functions to restrict the map extent.

ggplot(data = MN, aes(y=YCoord, x= XCoord))+
  geom_point() +
  xlim(c(970000,1010000))+
  ylim(c(190000, 260000))

Identifying Historic Districts

For this lab, we will be working with buildings that are in a historic district. Take a look at those data to see what variables are included in the dataset. ESRI Shapefiles have a peculiar requirement that column names can not be more than 10 characters. Thus, most columns have abbreviated names that can be a little tricky to parse. Here, we are most interested in AssessTot which is the total assessed property value, HistDist which is the name of the designated historic district that building is in, and later on we’ll be looking at BldgArea, the area of the building (probably in square feet).

Q3 (10 pts): Use glimpse() to identify how the variables are coded in R. A possibly easier way to see how the variables are coded is to look in the “Environment” tab to the right and click on the blue arrow to the left of the MN object.

glimpse(MN)
glimpse(filt_sites) 

We need to be able to identify these buildings, but the way the data file is coded is awkward.

MN$HistDist[1:10] %>% kable %>% kableExtra::kable_styling()
x
NA
NA
NA
NA
NA
Fraunces Tavern Block
NA
South Street Seaport
NA
NA

If a building is in a historic district, we know the name of the district. Notice that over 32,000 buildings are a group called NA's, these appear near the bottom of the list of districts. These NA buildings are not in a historic district. When a building is not in a district, the HistDist variable is left blank – NA is how R indicates that a value is blank or missing. We don’t really care which district a building is in, only that it is in a historic district. We’ll now re-code the HistDist column to make a ‘categorical’ variable which takes the value of "In" if a building is in a historic district and a value of "Out" if it is not in a historic district.

To do this we will use some new functions, is.na(), and if_else() from the dplyr package (note, this function is slightly different from the ifelse() function in base R). is.na() is a logical expression, it returns TRUE if a value is missing (NA), and FALSE otherwise.

# Check the first 100 rows of the "HistDist" column to
# see if they are blank
is.na(MN[1:10, "HistDist"])
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE

In the output the TRUE values indicate rows where the HistDist column is missing a value – these are places that are not in a historic district. The function if_else() takes three arguments, a logical expression, a value for TRUE, and a value for FALSE. If the logical expression evaluates to TRUE, the first value is used, if not the second value is used. The logical phrase is.na() tests each row in the table to see if the HistDist column is empty (NA). If the column is NA, then the entry in a new column HD is "Out" otherwise its "In". The rows where MN$HD == "In" represent buildings that are in a historic district.

if_else(is.na(MN[1:10, "HistDist"]), "Out", "In")
##  [1] "Out" "Out" "Out" "Out" "Out" "In"  "Out" "In"  "Out" "Out"

We don’t just want to print out on the screen whether a building is “In” or “Out” of a historic district. We want to create a new variable in the MN data.frame that codes this information. We can do that with the dplyr function mutate.

Q4 (15 pts): Please add codes to create a new variable HD using the if_else() and is.na functions.

MN$HD <- if_else(is.na(MN$HistDist), "Out", "In") 

Since MN$HD measures if a building is in or out of a historic district, and we have used ‘character’ strings to represent "Out" and "In", R treats this variable as a character vector. What we actually want is a factor – which is a column that contains categorical data. Each category in the factor is called a level, so that, in our case, we want a factor with two levels:

Q5 (10 pt): Please use the mutate function to change HD to a factor.

?mutate
MN <- mutate(.data = MN, 
             HD = factor(HD))
print(MN$HD) 

Now we can draw a very crude map of historic districts. Uncomment the last line and run the code chunk to create the plot.

R tip: you can uncomment or comment many lines of code by highlighting the code and pressing CTRL + SHIFT + C (may be command instead of ctrl for macs)

# "col" changes the color of dots depending upon the value in the "HD" column
# "pch" sets the symbol to a solid dot
# "cex"  makes the dot .5 the normal size
# Note that setting asp=1 will set the aspect ratio to 1

# plot(YCoord ~ XCoord, data=MN, col=HD, pch=16, cex=.5, asp=1)

You might notice that this map is… not beautiful, and the arguments to the function are kind of cryptic. This is an issue that comes up quite frequently with base R plots. Luckily, the same people that made tidyverse and dplyr also came up with a visualization package called ggplot2 which we’ve already seen upstream in this tutorial and in lectures. ggplot2 is very powerful, looks a lot better in the default settings, and looks a LOT better once you get good at fiddling with the settings. It is a bit more complicated than just typing plot() but once you get over the learning curve you won’t regret it.

Extensive documentation on how to use ggplot2 is here: https://ggplot2.tidyverse.org/index.html

There is also a ggplot2 cheat sheet that is really good! At the top of your rstudio window go to the help menu, then go down to Cheat Sheets.

# uncomment the lines below and run the code chunk to create the plot

# ggplot(data = MN, aes(x = XCoord, y = YCoord, color = HD)) +
#   geom_point() +
#   coord_fixed()

This is all well and good, but we are in a geography class, and a lot of you might be interested in learning how to use R as a GIS. So we’re going to quickly dip our toes into spatial data structures with the sf library.

First, we convert these points into a projection and plot as a spatial object using the st_as_sf function. The key pieces of information that st_as_sf needs are the x and y coordinates with the coords argument, and the projection with the crs argument. I’m not sure what the coordinate system is, to be honest, but I guessed that it was Albers Equal Area (EPSG: 5070, for lat/long the EPSG is 4326) and was absolutely wrong, but we’ll just roll with it.

MN_spatial <- MN %>%
  st_as_sf(coords = c("XCoord", "YCoord"), crs=5070)

Now, we map using ggplot, and geom_sf. Note that because I did not input the correct coordinate reference system, these are not the correct lat/longs for New York City.

# uncomment the lines below and run the code chunk to create the plot

# ggplot(MN_spatial) +
#   geom_sf() +
#   ggtitle("How neat is that???")

There are endless ways to modify your data visualization with ggplot. Here’s some further modification, just to show a few options. At each step, I assigned the ggplot to an object. we can call the object to look at the plot, or we can add more components to the plot by using the + sign followed by another ggplot function. you can combine as many ggplot functions together as you want when you’re designing a plot.

# uncomment the lines below and run the code chunk to create the plot

# p <- ggplot(data = MN, aes(x = XCoord, y = YCoord, color = HD)) +
#   geom_point()
# p
# uncomment the lines below and run the code chunk to create the plot

# p1 <- p + facet_wrap(~HD)
# p1
# uncomment the lines below and run the code chunk to create the plot

# p2 <- p1 + coord_fixed()
# p2
# uncomment the lines below and run the code chunk to create the plot

# p3<- p2 + theme_bw()
# p3
# uncomment the lines below and run the code chunk to create the plot

# p4 <- p3 +
#   scale_color_manual(values = c("burlywood", "peachpuff"),
#                      name="Historical Districts")
# p4
# uncomment the lines below and run the code chunk to create the plot

# p5<- p4 +
#   xlab("I love geography")+
#   ggtitle(label ="Geog 4023/5023 is my JAM!",
#           subtitle = "A+++ subtitle right here") +
#   ylab("space and place")
# p5
# uncomment the lines below and run the code chunk to create the plot

# p6 <- p5 +
#   theme(legend.position = c(0.5,.5), # this is one way to move the legend around
#         legend.justification = c(0.5,.5), # values go from 0 to 1
#         legend.background = element_rect(colour = "plum", fill = "thistle"),
#         panel.background = element_rect(fill = "rosybrown2"))
# 
# p6

Q6 (15 pts): Make your own sweet map (5 pts). Go to the ggplot2 cheatsheet or online documentation to find one additional component to add to the map (e.g. a different pre-made theme) that is not included in the previous demonstration (5 pts). Make sure everything is properly labelled so that the reader can easily interpret what is going on in the map (5 pts).

Now that we’ve created the crude map, and demonstrated some of the functionality of ggplot2, use your readings and the internet to produce a slightly nicer map using ggplot2, with a legend, title, axis labels, and better colors (the colors() function can come in handy for manual color scaling):

?ggplot
pp <- ggplot(data = MN, aes(y=YCoord, x= XCoord))+
  geom_point(color = 'red') +
  xlim(c(970000,1010000))+
  ylim(c(190000, 260000)) +
  ggtitle("Map of Historic Buildings in New York City") 
pp <- pp + theme_minimal()
pp

Simple Hypothesis Testing

Our goal here is to explore the effect of historic districts (HD) on property values (AssessTot) in New York City.

Q7 (15 pts): Go through the typical workflow (e.g. the steps listed in the lecture slides) for a hypothesis test - write your null and alternative hypotheses (5 pts), determine the correct test via visualization or a shapiro test (5 pts, shaprio test is to check normality of the data that can be done with shapiro.test()), and conduct the test (5 pts). For this lab, you can just assume \(\alpha = 0.05\). (Hint: t-test if Gaussian/Normal distributed data (with R function t.test()) and wilcoxon test if the data are not Gaussian (with R function wilcox.test()))

\(H_0\):

\(H_A\):

m1 = lm(AssessTot~HD, data = MN)
summary(m1)
par(mfrow=c(1,2))
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)
length(m1$residuals)
#shapiro.test(m1$residuals) ## Data too big for shapiro test, determined it to be not Gaussian distribution from graphs
wilcox.test(m1$residuals)
## Based on the result of the wilcox test( p-value < 2.2e-16), we can reject the null hypothesis and accept the alternative hypothesis
## This is because it is signifigantly lower than .05 and means that the property values do depend on if it is in a historic district or not 

Q8 (10 pts): Does historic district designation affect property value? If yes, how?

Yes, the historic district designation does affect property value. The score we received from the previous test tells us that homes within the areas designated to be a historic district are generally more expensive.

Part II (100 pts)

Testing Hypotheses on more than two groups

We could also do tests of whether the assessed value differs by a variable that is not dichotomous (like all of the ones done here) but rather categorical (nominal) with more than two levels. For example, what if all Historic Districts are not created equal? Maybe there is a lot of variation between historic districts that we’re not really paying attention to. Consider the following table:

hd_summary <- MN %>%
  filter(AssessTot>0)%>%
  group_by(HistDist) %>%
  summarise(n=n()) %>%
  arrange(desc(n)) 

hd_summary %>%
  slice(1:10)%>%
  kable(caption = paste("There are", nrow(hd_summary)-1, "Historic Districts")) %>%
  kable_styling()
There are 70 Historic Districts
HistDist n
NA 32295
Greenwich Village 1876
Upper West Side/Central Park West 1860
Upper East Side 969
SoHo-Cast Iron 445
Ladies’ Mile 351
East Village/ Lower East Side 303
Expanded Carnegie Hill 268
Riverside-West End 260
Mount Morris Park 256

Not only are there a lot of historic districts, there is a considerable variation in sample size, 1, 32295. There are 4 historic districts with more than 400 buildings, So maybe we’ll focus on those.

Q9 (10 pts): Create a new data frame with only the historical districts that have over 400 buildings in them (4 pts). Hint, create a vector from the hd_summary object first, then use that to filter from MN using the %in% operator. Remove the rows that have NAs as well (3 pts)! Remember that NA in this column means not in a historical district. Also remove rows where AssessTot == 0 because that’s obviously silly (3 pts).

big_dist <- hd_summary %>%
  filter(n > 400, !is.na(HistDist)) %>%
  pull(HistDist)

hist_400 <- MN %>% 
  filter(HistDist %in% big_dist,
         !is.na(HistDist),
         AssessTot > 0)
hist_400
 

Q10 (5 pts): Create a boxplot to visualize the data

ggplot(data = hist_400, aes(x = HistDist, y = AssessTot)) + 
  geom_boxplot() 

Wow that’s some non-normal data, is it not? We can transform it just to get a better look at the variation.

Q11 (5 pts): Make another plot with the AssessTot logged. Hint: you can use scale_y_log10().

ggplot(data = hist_400, aes(x = HistDist, y = AssessTot)) + 
  geom_boxplot() +
  scale_y_log10() 

It looks like maybe you could log-transform and achieve normality?

Q12 (10 pts): Wrangle the data down so that you have only one of our HistDists of interest (5 pts), then run a Shapiro-Wilk Normality Test on the log of the AssessTot column (5 pts), just to check if transformation is an option. If log comes back as a NAN, you need to omit the zeros.

single_dist <- hist_400 %>%
  filter(HistDist == big_dist[1])

shapiro.test(log(single_dist$AssessTot)) 

Q12a (5 pts): Is it normal?

no, this is not normally distributed as indicated by the p-value. It is much much lower than .5.

So now, we know that data are not normal, and we know that the means do look different after transforming the variables for visualization purposes. Are they different?

Q13 (10 pts): Conduct a Kruskall-Wallis test to see if the property values (AssessTot) are different from one historic district to the next for our four districts of interest (5 pts). Use the agricolae package so that you can automatically do the contrasts. Don’t forget to adjust the p-value for multiple comparisons (5 pts).

krusk_tst <- kruskal(hist_400$AssessTot,
                     hist_400$HistDist,
                     p.adj = "bonferroni",
                     group = TRUE)
krusk_tst

Q13a (5 pts): How do the property values vary from district to district? Let’s say we want to be hip and live in a historic building, but we don’t want to spend a ton of money. Which neighborhood should we try to live in?

We should try to live in Greenwich Village because it has the lowest average price with it being 2029.758.

But, you know, there might be some other stuff going on. What if the buildings in some historic districts are bigger on average, and so the property values are higher because they’re just bigger, instead of being due to the prestige (or lack thereof) of being in a particular historic district?

Q14 (10 pts): Please create a variable that adjusts the buildings value (AssessTot) by the building’s area (BldgArea) using the mutate function (4 pts). Remember, this is real data, so it’s messy. Make sure to check for 0’s or NA’s in the BldgArea column and filter out those rows as well (3 pts). Then explore the data with boxplot (3 pts).

hist_400 <- hist_400 %>%
  filter(BldgArea > 0, !is.na(BldgArea)) %>%
  mutate(AssessPerArea = AssessTot / BldgArea)
ggplot(data = hist_400, aes(x= HistDist, y = AssessPerArea)) +
  geom_boxplot() +
  scale_y_log10() 

Q15 (10 pts): Which test are you going to use, and why (5 pts)? Please also write the codes to conduct the test (5 pts).

I am planning to use the Kruskal-Wallis test because the data is not normally distributed.

# code here 

Q15a (5 pts): How does this change things?

This result slightly changes things as the average space in Greenwich Village is the second smallest but I beleive the cheaper cost offsets this disadvantage and I'll still choose to live there.

Sometimes, it makes sense to create a categorical variable out of continuous variable in order to test how those categories affect your variable of interest. For example, maybe you have a hunch that before 1900 was a golden age in high quality building construction, and even though those buildings are old now, they just don’t do construction like they used to. But then since the 1970’s, the effect of “back in the good-old days” construction practices on price would be offset by… new-ness.

Q16 (10 pts): Create a new variable for whether a building was built before 1900, between 1900-1970, or after 1970. Think about what kind of variable this should be in R (i.e., integer, continuous, factor, logical, character). Hint: ifelse() is good, and case_when() is great for this.

hist_400 <- hist_400 %>%
  mutate(building_era = ifelse(YearBuilt < 1900, "Before 1900",
                              ifelse(YearBuilt <= 1970, "1900-1970",
                                     "After 1970")),
         building_era = factor(building_era)) 

Now, lets go through that workflow one more time. Determine the proper test, make sure to state your hypotheses clearly, conduct the test, do a post-hoc test, account for multiple comparisons, and analyze the contrasts. Show your code and output, and explain your interpretation of the results. We can default to \(\alpha\) = 0.05.

Q17 (3 pts): Visualize data with boxplot (or use shapiro tests) to determine the appropriate test

ggplot(data = hist_400, aes(x= building_era, y = AssessTot))+
  geom_boxplot() 

Q17a (3 pts): What is an appropriate test?

I beleive a kruskal wallis test is the appropriate test because the data is clearly skewed.

Q17b (3 pts): State the hypotheses:

$H_0$: The median property values are the same across all three building eras $H_A$: The median property values are different between the three building eras

\(H_0\):

\(H_A\):

Q17c (3 pts): Please write the codes to conduct the test.

krusk_area_result <- kruskal(hist_400$AssessTot,
                             hist_400$building_era,
                             p.adj = "bonferroni",
                             group = TRUE)
krusk_area_result 

Q17d (3 pts): Please Interpret your results of the above test.

The results show us that there is in fact a difference in price between the three areas so we can reject the null hypothesis and accept our alternative hypothesis.

# code if necessary

That’s it folks. We used a lot of non-parametric tests due to the nature of the data we were using, but please remember that if the data actually are normal, you really do want to use an ANOVA or other such test since they are more statistically powerful.

Acknowledgments

Thanks to Seth Speilman, Carson Farmer, Colleen Reid and other professors and graduate students in the CU Boulder Geography department who contributed to earlier iterations of this lab.