This lab has two parts (each for 100 points), and you have two weeks of time to work on it. Please complete the following questions and submit the finished Rmd and HTML file onto Canvas. In order for this file to knit, it needs a file called “lab.css” to be in the same folder as the .Rmd file. The lab is rendered as an .html file for improved readability, but we recommend working through the .Rmd file and running the code as you go.
Don’t forget to change name field on line 3 to your first and last name.
Cities all over North America contain “historic” neighborhoods. Historic neighborhoods are areas where the buildings have historical importance not because of their individual significance but because, as a collection, they represent the architectural sensibilities of a particular time period. In Boulder, for example, the Martin Acres subdivision was not-so recently surveyed for possible inclusion in a historic district. Establishing the ‘significance’ of a neighborhood’s historical character is a matter for historians not statisticians. However, there is an important question about historic preservation – does it help or hurt property values?
This lab deals with “real” data that is not necessarily clean. This is a reminder that the data that you find for your final projects. You will have to load your data, check that all of the variables are in the correct form (i.e., do some need to be factors?, do you need to create new variables from the variables that are in the data?), check for missingness in your data (will you remove all missing values? will you explore multiple imputation to deal with missingness?), and also check for values that are impossible or improbable (what will you do with those values? how can you defend what you do with those values?). Make sure to get started on your final projects early so that the cleaning gets done with plenty of time for the fun part - the analysis!
Today, we will be working with a dbase file. A dbase file is just
another way of storing tabular data. There is little substantive
difference between a dbase, csv, or an excel file. In spite of their
functional similarity they each get handled differently in
R. ArcGIS likes dbase files for some reason, so when you
end up doing GIS stuff in your academic career, you’ll may come across
dbase files. To read dbase files into R you have to load a
library called foreign – its purpose is to read dbase files
and other files foreign to R. First, we’ll check
if it’s installed. Run this code chunk on your computer, and install the
necessary packages.
# Here's an easy way to check if a package is installed. There
# are additional packages we'll need for this lab, so we'll check those too
packages <- c("foreign", "tidyverse", "sf", "kableExtra", "agricolae")
packages %in% rownames(installed.packages())
## [1] TRUE TRUE TRUE TRUE TRUE
# If necessary
# install.packages("foreign")
# install.packages("sf")
Now that everything is installed we’ll load our libraries
library(foreign) # for reading in a dbf
library(sf)
library(agricolae) # has functions for 2+ hypothesis tests
library(tidyverse) # all the good stuff is in here
library(kableExtra)
Now, we need to get the dbase file. We are working with a shapefile
and associated dbase file for Manhattan,
New York. The file is included in the lab folder under the
data subdirectory. You can load the file like this:
MN <- foreign::read.dbf("data/MNMapPLUTO.dbf")
The data include information on just about every building (circa
2010) in Manhattan – over 40,000 buildings. Most buildings have X, Y
coordinates allowing you to map them simply by using the ggplot command,
specifying those columns to ggplot’s helper function,
aes():
# Map all of the buildings in Manhattan
ggplot(data = MN, aes(y = YCoord, x = XCoord)) +
geom_point()
Something is wrong. The resulting plot doesn’t look at all like a map
of Manhattan. This is because some buildings have not been assigned
coordinates and have a X, Y position of (0, 0). These must
be removed from the data first:
Q1 (10 pts): Please use the filter function to exclude the rows with the buildings that are not properly geo-referenced.
filt_sites <- MN %>%
filter(XCoord != 0 & YCoord != 0)
filt_sites
Q2 (15 pts): Please recreate the map using the code from above to check and make sure you got rid of all the problem rows.
ggplot(data = filt_sites, aes(y = YCoord, x = XCoord)) +
geom_point()
Your new map should look something like the output from the following
code chunk. Here I just used the xlim and ylim
functions to restrict the map extent.
ggplot(data = MN, aes(y=YCoord, x= XCoord))+
geom_point() +
xlim(c(970000,1010000))+
ylim(c(190000, 260000))
For this lab, we will be working with buildings that are in a
historic district. Take a look at those data to see what variables are
included in the dataset. ESRI Shapefiles have a peculiar requirement
that column names can not be more than 10 characters. Thus, most columns
have abbreviated names that can be a little tricky to parse. Here, we
are most interested in AssessTot which is the total
assessed property value, HistDist which is the name of the
designated historic district that building is in, and later on we’ll be
looking at BldgArea, the area of the building (probably in
square feet).
Q3 (10 pts): Use glimpse() to identify how the
variables are coded in R. A possibly easier way to see how
the variables are coded is to look in the “Environment” tab to the right
and click on the blue arrow to the left of the MN object.
glimpse(MN)
glimpse(filt_sites)
We need to be able to identify these buildings, but the way the data file is coded is awkward.
MN$HistDist[1:10] %>% kable %>% kableExtra::kable_styling()
| x |
|---|
| NA |
| NA |
| NA |
| NA |
| NA |
| Fraunces Tavern Block |
| NA |
| South Street Seaport |
| NA |
| NA |
If a building is in a historic district, we know the name of the
district. Notice that over 32,000 buildings are a group called
NA's, these appear near the bottom of the list of
districts. These NA buildings are not in a historic
district. When a building is not in a district, the
HistDist variable is left blank – NA is how
R indicates that a value is blank or missing. We don’t
really care which district a building is in, only that it is in a
historic district. We’ll now re-code the HistDist column to
make a ‘categorical’ variable which takes the value of "In"
if a building is in a historic district and a value of
"Out" if it is not in a historic district.
To do this we will use some new functions, is.na(), and
if_else() from the dplyr package (note, this
function is slightly different from the ifelse()
function in base R). is.na() is a logical
expression, it returns TRUE if a value is missing
(NA), and FALSE otherwise.
# Check the first 100 rows of the "HistDist" column to
# see if they are blank
is.na(MN[1:10, "HistDist"])
## [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE
In the output the TRUE values indicate rows where the
HistDist column is missing a value – these are places that
are not in a historic district. The function if_else()
takes three arguments, a logical expression, a value for
TRUE, and a value for FALSE. If the logical
expression evaluates to TRUE, the first value is used, if
not the second value is used. The logical phrase is.na()
tests each row in the table to see if the HistDist column
is empty (NA). If the column is NA, then the
entry in a new column HD is "Out" otherwise
its "In". The rows where MN$HD == "In"
represent buildings that are in a historic district.
if_else(is.na(MN[1:10, "HistDist"]), "Out", "In")
## [1] "Out" "Out" "Out" "Out" "Out" "In" "Out" "In" "Out" "Out"
We don’t just want to print out on the screen whether a building is
“In” or “Out” of a historic district. We want to create a new variable
in the MN data.frame that codes this information. We can do that with
the dplyr function mutate.
Q4 (15 pts): Please add codes to create a new
variable HD using the if_else() and
is.na functions.
MN$HD <- if_else(is.na(MN$HistDist), "Out", "In")
Since MN$HD measures if a building is in or out of a
historic district, and we have used ‘character’ strings to represent
"Out" and "In", R treats this
variable as a character vector. What we actually
want is a factor – which is a column that contains
categorical data. Each category in the factor is called a
level, so that, in our case, we want a factor
with two levels:
Q5 (10 pt): Please use the mutate
function to change HD to a factor.
?mutate
MN <- mutate(.data = MN,
HD = factor(HD))
print(MN$HD)
Now we can draw a very crude map of historic districts. Uncomment the last line and run the code chunk to create the plot.
R tip: you can uncomment or comment many lines of code by highlighting the code and pressing CTRL + SHIFT + C (may be command instead of ctrl for macs)
# "col" changes the color of dots depending upon the value in the "HD" column
# "pch" sets the symbol to a solid dot
# "cex" makes the dot .5 the normal size
# Note that setting asp=1 will set the aspect ratio to 1
# plot(YCoord ~ XCoord, data=MN, col=HD, pch=16, cex=.5, asp=1)
You might notice that this map is… not beautiful, and the arguments
to the function are kind of cryptic. This is an issue that comes up
quite frequently with base R plots. Luckily, the same
people that made tidyverse and dplyr also came up with a visualization
package called ggplot2 which we’ve already seen upstream in
this tutorial and in lectures. ggplot2 is very powerful, looks a lot
better in the default settings, and looks a LOT better once you get good
at fiddling with the settings. It is a bit more complicated than just
typing plot() but once you get over the learning curve you
won’t regret it.
Extensive documentation on how to use ggplot2 is here: https://ggplot2.tidyverse.org/index.html
There is also a ggplot2 cheat sheet that is really good! At the top of your rstudio window go to the help menu, then go down to Cheat Sheets.
# uncomment the lines below and run the code chunk to create the plot
# ggplot(data = MN, aes(x = XCoord, y = YCoord, color = HD)) +
# geom_point() +
# coord_fixed()
This is all well and good, but we are in a geography class, and a lot
of you might be interested in learning how to use R as a GIS. So we’re
going to quickly dip our toes into spatial data structures with the
sf library.
First, we convert these points into a projection and plot as a
spatial object using the st_as_sf function. The key pieces
of information that st_as_sf needs are the x and y
coordinates with the coords argument, and the projection
with the crs argument. I’m not sure what the coordinate
system is, to be honest, but I guessed that it was Albers Equal Area
(EPSG: 5070, for lat/long the EPSG is 4326) and was absolutely wrong,
but we’ll just roll with it.
MN_spatial <- MN %>%
st_as_sf(coords = c("XCoord", "YCoord"), crs=5070)
Now, we map using ggplot, and geom_sf. Note that because I did not input the correct coordinate reference system, these are not the correct lat/longs for New York City.
# uncomment the lines below and run the code chunk to create the plot
# ggplot(MN_spatial) +
# geom_sf() +
# ggtitle("How neat is that???")
There are endless ways to modify your data visualization with ggplot. Here’s some further modification, just to show a few options. At each step, I assigned the ggplot to an object. we can call the object to look at the plot, or we can add more components to the plot by using the + sign followed by another ggplot function. you can combine as many ggplot functions together as you want when you’re designing a plot.
# uncomment the lines below and run the code chunk to create the plot
# p <- ggplot(data = MN, aes(x = XCoord, y = YCoord, color = HD)) +
# geom_point()
# p
facet_wrap splits it into separate panels for each
factor in the HD column. This is handy when there are many overlapping
points, for example.# uncomment the lines below and run the code chunk to create the plot
# p1 <- p + facet_wrap(~HD)
# p1
coord_fixed keeps the coordinates fixed, so that a unit
of x is the same size on the screen as a unit of y.# uncomment the lines below and run the code chunk to create the plot
# p2 <- p1 + coord_fixed()
# p2
theme_bw and other theme_xyz functions are
premade themes (there are several)# uncomment the lines below and run the code chunk to create the plot
# p3<- p2 + theme_bw()
# p3
values argument sets the colors, and
name sets the name for the color section of the legend# uncomment the lines below and run the code chunk to create the plot
# p4 <- p3 +
# scale_color_manual(values = c("burlywood", "peachpuff"),
# name="Historical Districts")
# p4
xlab, ylab and ggtitle allow
you to name the axes and title# uncomment the lines below and run the code chunk to create the plot
# p5<- p4 +
# xlab("I love geography")+
# ggtitle(label ="Geog 4023/5023 is my JAM!",
# subtitle = "A+++ subtitle right here") +
# ylab("space and place")
# p5
theme function allows us to modify almost anything
manually. legend position is particularly useful here.# uncomment the lines below and run the code chunk to create the plot
# p6 <- p5 +
# theme(legend.position = c(0.5,.5), # this is one way to move the legend around
# legend.justification = c(0.5,.5), # values go from 0 to 1
# legend.background = element_rect(colour = "plum", fill = "thistle"),
# panel.background = element_rect(fill = "rosybrown2"))
#
# p6
Q6 (15 pts): Make your own sweet map (5 pts). Go to the ggplot2 cheatsheet or online documentation to find one additional component to add to the map (e.g. a different pre-made theme) that is not included in the previous demonstration (5 pts). Make sure everything is properly labelled so that the reader can easily interpret what is going on in the map (5 pts).
Now that we’ve created the crude map, and demonstrated some of the
functionality of ggplot2, use your readings and the internet to produce
a slightly nicer map using ggplot2, with a legend, title, axis
labels, and better colors (the colors() function can come
in handy for manual color scaling):
?ggplot
pp <- ggplot(data = MN, aes(y=YCoord, x= XCoord))+
geom_point(color = 'red') +
xlim(c(970000,1010000))+
ylim(c(190000, 260000)) +
ggtitle("Map of Historic Buildings in New York City")
pp <- pp + theme_minimal()
pp
Our goal here is to explore the effect of historic districts
(HD) on property values (AssessTot) in New
York City.
Q7 (15 pts): Go through the typical workflow
(e.g. the steps listed in the lecture slides) for a hypothesis test -
write your null and alternative hypotheses (5 pts), determine the
correct test via visualization or a shapiro test (5 pts, shaprio test is
to check normality of the data that can be done with
shapiro.test()), and conduct the test (5 pts). For this
lab, you can just assume \(\alpha =
0.05\). (Hint: t-test if
Gaussian/Normal distributed data (with R function t.test())
and wilcoxon test if the data are not Gaussian (with R
function wilcox.test()))
\(H_0\):
\(H_A\):
m1 = lm(AssessTot~HD, data = MN)
summary(m1)
par(mfrow=c(1,2))
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)
length(m1$residuals)
#shapiro.test(m1$residuals) ## Data too big for shapiro test, determined it to be not Gaussian distribution from graphs
wilcox.test(m1$residuals)
## Based on the result of the wilcox test( p-value < 2.2e-16), we can reject the null hypothesis and accept the alternative hypothesis
## This is because it is signifigantly lower than .05 and means that the property values do depend on if it is in a historic district or not
Q8 (10 pts): Does historic district designation affect property value? If yes, how?
Yes, the historic district designation does affect property value. The score we received from the previous test tells us that homes within the areas designated to be a historic district are generally more expensive.
We could also do tests of whether the assessed value differs by a variable that is not dichotomous (like all of the ones done here) but rather categorical (nominal) with more than two levels. For example, what if all Historic Districts are not created equal? Maybe there is a lot of variation between historic districts that we’re not really paying attention to. Consider the following table:
hd_summary <- MN %>%
filter(AssessTot>0)%>%
group_by(HistDist) %>%
summarise(n=n()) %>%
arrange(desc(n))
hd_summary %>%
slice(1:10)%>%
kable(caption = paste("There are", nrow(hd_summary)-1, "Historic Districts")) %>%
kable_styling()
| HistDist | n |
|---|---|
| NA | 32295 |
| Greenwich Village | 1876 |
| Upper West Side/Central Park West | 1860 |
| Upper East Side | 969 |
| SoHo-Cast Iron | 445 |
| Ladies’ Mile | 351 |
| East Village/ Lower East Side | 303 |
| Expanded Carnegie Hill | 268 |
| Riverside-West End | 260 |
| Mount Morris Park | 256 |
Not only are there a lot of historic districts, there is a considerable variation in sample size, 1, 32295. There are 4 historic districts with more than 400 buildings, So maybe we’ll focus on those.
Q9 (10 pts): Create a new data frame with only the
historical districts that have over 400 buildings in them (4 pts).
Hint, create a vector from the hd_summary object first, then
use that to filter from MN using the %in% operator. Remove the rows that
have NAs as well (3 pts)! Remember that NA in this column means not in a
historical district. Also remove rows where AssessTot == 0
because that’s obviously silly (3 pts).
big_dist <- hd_summary %>%
filter(n > 400, !is.na(HistDist)) %>%
pull(HistDist)
hist_400 <- MN %>%
filter(HistDist %in% big_dist,
!is.na(HistDist),
AssessTot > 0)
hist_400
Q10 (5 pts): Create a boxplot to visualize the data
ggplot(data = hist_400, aes(x = HistDist, y = AssessTot)) +
geom_boxplot()
Wow that’s some non-normal data, is it not? We can transform it just to get a better look at the variation.
Q11 (5 pts): Make another plot with the
AssessTot logged. Hint: you can use
scale_y_log10().
ggplot(data = hist_400, aes(x = HistDist, y = AssessTot)) +
geom_boxplot() +
scale_y_log10()
It looks like maybe you could log-transform and achieve normality?
Q12 (10 pts): Wrangle the data down so that you have
only one of our HistDists of interest (5 pts), then run a
Shapiro-Wilk Normality Test on the log of the AssessTot
column (5 pts), just to check if transformation is an option. If log
comes back as a NAN, you need to omit the zeros.
single_dist <- hist_400 %>%
filter(HistDist == big_dist[1])
shapiro.test(log(single_dist$AssessTot))
Q12a (5 pts): Is it normal?
no, this is not normally distributed as indicated by the p-value. It is much much lower than .5.
So now, we know that data are not normal, and we know that the means do look different after transforming the variables for visualization purposes. Are they different?
Q13 (10 pts): Conduct a Kruskall-Wallis test to see
if the property values (AssessTot) are different from one
historic district to the next for our four districts of interest (5
pts). Use the agricolae package so that you can
automatically do the contrasts. Don’t forget to adjust the p-value for
multiple comparisons (5 pts).
krusk_tst <- kruskal(hist_400$AssessTot,
hist_400$HistDist,
p.adj = "bonferroni",
group = TRUE)
krusk_tst
Q13a (5 pts): How do the property values vary from district to district? Let’s say we want to be hip and live in a historic building, but we don’t want to spend a ton of money. Which neighborhood should we try to live in?
We should try to live in Greenwich Village because it has the lowest average price with it being 2029.758.
But, you know, there might be some other stuff going on. What if the buildings in some historic districts are bigger on average, and so the property values are higher because they’re just bigger, instead of being due to the prestige (or lack thereof) of being in a particular historic district?
Q14 (10 pts): Please create a variable that adjusts
the buildings value (AssessTot) by the building’s area
(BldgArea) using the mutate function (4 pts).
Remember, this is real data, so it’s messy. Make sure to check for 0’s
or NA’s in the BldgArea column and filter out those rows as
well (3 pts). Then explore the data with boxplot (3 pts).
hist_400 <- hist_400 %>%
filter(BldgArea > 0, !is.na(BldgArea)) %>%
mutate(AssessPerArea = AssessTot / BldgArea)
ggplot(data = hist_400, aes(x= HistDist, y = AssessPerArea)) +
geom_boxplot() +
scale_y_log10()
Q15 (10 pts): Which test are you going to use, and why (5 pts)? Please also write the codes to conduct the test (5 pts).
I am planning to use the Kruskal-Wallis test because the data is not normally distributed.
# code here
Q15a (5 pts): How does this change things?
This result slightly changes things as the average space in Greenwich Village is the second smallest but I beleive the cheaper cost offsets this disadvantage and I'll still choose to live there.
Sometimes, it makes sense to create a categorical variable out of continuous variable in order to test how those categories affect your variable of interest. For example, maybe you have a hunch that before 1900 was a golden age in high quality building construction, and even though those buildings are old now, they just don’t do construction like they used to. But then since the 1970’s, the effect of “back in the good-old days” construction practices on price would be offset by… new-ness.
Q16 (10 pts): Create a new variable for whether a
building was built before 1900, between 1900-1970, or after 1970. Think
about what kind of variable this should be in R (i.e., integer,
continuous, factor, logical, character). Hint:
ifelse() is good, and case_when() is great for
this.
hist_400 <- hist_400 %>%
mutate(building_era = ifelse(YearBuilt < 1900, "Before 1900",
ifelse(YearBuilt <= 1970, "1900-1970",
"After 1970")),
building_era = factor(building_era))
Now, lets go through that workflow one more time. Determine the proper test, make sure to state your hypotheses clearly, conduct the test, do a post-hoc test, account for multiple comparisons, and analyze the contrasts. Show your code and output, and explain your interpretation of the results. We can default to \(\alpha\) = 0.05.
Q17 (3 pts): Visualize data with boxplot (or use shapiro tests) to determine the appropriate test
ggplot(data = hist_400, aes(x= building_era, y = AssessTot))+
geom_boxplot()
Q17a (3 pts): What is an appropriate test?
I beleive a kruskal wallis test is the appropriate test because the data is clearly skewed.
Q17b (3 pts): State the hypotheses:
$H_0$: The median property values are the same across all three building eras $H_A$: The median property values are different between the three building eras
\(H_0\):
\(H_A\):
Q17c (3 pts): Please write the codes to conduct the test.
krusk_area_result <- kruskal(hist_400$AssessTot,
hist_400$building_era,
p.adj = "bonferroni",
group = TRUE)
krusk_area_result
Q17d (3 pts): Please Interpret your results of the above test.
The results show us that there is in fact a difference in price between the three areas so we can reject the null hypothesis and accept our alternative hypothesis.
# code if necessary
That’s it folks. We used a lot of non-parametric tests due to the nature of the data we were using, but please remember that if the data actually are normal, you really do want to use an ANOVA or other such test since they are more statistically powerful.
Thanks to Seth Speilman, Carson Farmer, Colleen Reid and other professors and graduate students in the CU Boulder Geography department who contributed to earlier iterations of this lab.