require(tidyverse)
require(sf)
require(ggspatial)
require(maptiles)
require(tidyterra)
require(osmdata)Advanced Geoprocessing in R: Long-Term Legacies of Redlining in Chicago
Introduction: Redlining
As we have discussed in several of our classes, the current structure of cities reflects very long-term forms of exclusion and inequality. In the United States, one of the most important of these is the practice of redlining, which substantially restricted access to capital for communities of color while funnelling subsidized loans into predominantly white neighborhoods, which, through a variety of mechanisms, were made inaccessible to non-whites.
Several researchers have argued that these practices were an essential component of institutionalized racism that has continued to reproduce structural inequalities in wealth and urban segregation long after the redlining maps were drawn. In this tutorial, we will look at the relationship between areas that were redlined in Chicago and present inequalities in the city. To be clear, correlation does not mean causation, but to find substantial present differences across neighborhoods that were redlined approximately 80 years ago would be quite compelling evidence that redlining has had a long-term legacy.
Following a heroic data collection and digitization effort by the members of the Mapping Inequality Project at the University of Richmond, shapefiles showing the federal government’s Home Owners’ Loan Corporation (HOLC) lending maps are now available for most large and medium-sized cities in the United States. This is an unprecedented resource allowing us to investigate the legacies of redlining at present.
Before beginning the tutorial, I would like you to refresh your memory about redlining and, in particular, the HOLC program. There is a very clear introduction available at the Mapping Inequality Project site.
Task 1. After thoroughly investigating the website, answer these questions. What was the HOLC program? Why have many researchers in the past several decades suggested that it has been a major force in perpetuating urban segregation in US cities?
Objectives
- Get further practice with projections, intersections, and mapping
- Learn how to construct basic data visualizations with
ggplot() - Learn how to access features from OpenStreetMap with the
osmdatapackage - Learn how to estimate intensive variables using areal interpolation
- Practice interpreting the result of simple geospatial analyses
The Data
- Redlines.gpkg and its associated files show areas marked under the HOLC program. The data were digitized by the Mapping Inequality Project.
- chicago_boundary.kml is the city boundary of Chicago, sourced from the Chicago Data Portal
- AirBnBPricesPerBedroom.csv is the price per bedroom of AirBnB listings in Chicago in 2020, sourced from insideairbnb.com
- GroceryStores.csv contains location information for all grocery stores in Chicago as of 2020, sourced from the Chicago Data Portal
- cps_demographics_2024.csv contains demographic information on students at all primary and secondary schools in Chicago as of 2024, sourced from the Chicago Data Portal
- BloodPBLevelsCommAreas.csv is a table with results from survey studies of the percentage of children residing in Chicago’s official community areas, sourced from the Chicago Data Portal (basically, neighborhoods) with elevated blood lead levels as of 1999 and 2013
- chicago_community_areas.gpkg contains the boundaries of Chicago’s official community areas, sourced from the Chicago Data Portal
- We will also be downloading data on the locations of various urban amenities from OpenStreetMap using the
osmdatapackage
The data you will need for this tutorial can be found in this zip file.
Packages
tidyversefor data wranglingsffor vector data geoprocessingggspatialfor mappingmaptilesfor map tilestidyterrafor mapping those tilesosmdatato access data from OpenStreetMap
HOLC Data
Let’s start with the boundaries of the HOLC-graded areas, along with the city limits of Chicago:
setwd("G:/My Drive/My Classes/World Cities/Labs/Redlining")
holc <- st_read("redlines.gpkg")Reading layer `redlines' from data source
`G:\My Drive\My Classes\World Cities\Labs\Redlining\redlines.gpkg'
using driver `GPKG'
Simple feature collection with 5 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94011 ymin: 41.60414 xmax: -87.52414 ymax: 42.28303
Geodetic CRS: GCS_unknown
chicago <- st_read("chicago_boundary.kml")Reading layer `Layer #0' from data source
`G:\My Drive\My Classes\World Cities\Labs\Redlining\chicago_boundary.kml'
using driver `KML'
Simple feature collection with 1 feature and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS: WGS 84
plot(holc)plot(st_geometry(chicago))In order to assess the evidence for long-term impacts of HOLC’s maps, we need to use some tricks from the previous tutorial to estimate values within HOLC-graded areas using data for other types of areal units.
First, however, because some of the datasets we’ll be using cover different extents, we should use st_intersection() to make sure everything is clipped to the Chicago city limits:
chicago_holc <- chicago %>%
st_transform(crs=st_crs(holc))
holc <- holc %>%
st_intersection(chicago_holc)Warning: attribute variables are assumed to be spatially constant throughout
all geometries
plot(st_geometry(holc))Summarizing data collected at point locations with sf
We’re going to start with data on the price per bedroom of full airbnb rentals across the city. The openly available website insideairbnb.com tracks AirBnB listings in dozens of cities around the world, providing data to support policy debates about the platform’s effects on cities. Here, I’m providing you a prepared data set that has pared down the data to focus only on full-unit or full-house rentals, with the rental price standardized by dividing it by the number of bedrooms.
Thinking back to our discussion of rent gap theory, these airbnb prices can be thought of as a very literal indicator of potential rent from properties in different parts of the city. That means they reflect both land values and public interest in short-term stays in different parts of town.
We’ll start by reading in the data, which is in a CSV spreadsheet:
airbnb <- read.csv("AirBnBPricesPerBedroom.csv", stringsAsFactors = FALSE)
head(airbnb) latitude longitude price
1 41.85495 -87.69696 35.0
2 41.90289 -87.68182 60.0
3 41.91769 -87.63788 65.0
4 41.91183 -87.64000 57.5
5 41.89617 -87.66041 49.5
6 41.92679 -87.65521 112.0
Then we’ll use a pipeline to convert the spreadsheet into an sf object, reproject it to the same projection as the holc layer, and then conduct a join by location operation to add data from the red layer to the airbnb layer. Join by location is similar to intersection, except that instead of creating a new layer with the combined geometries of the intersected layers, it just adds the data from one layer to the other.
We can conduct a join by location using the st_join() function:
airbnb <- airbnb %>%
st_as_sf(coords=c("longitude", "latitude"),
crs=st_crs(4326)) %>%
#Remember from before that the above lines convert the data frame into an sf
#object. The coords argument allows you to specify which columns hold the
#x and y coordinates. Remember the EPSG number 4326 is just WGS 1984.
st_transform(crs=st_crs(holc)) %>%
st_join(holc)
#The st_join() function performs an operation called join by location.
#all it does is take variables from features in one layer and adds
#them to features in another layer that intersects them. in this case,
#we're adding information on the holc grade as of 1940 to every one
#of our airbnb points
head(airbnb)Simple feature collection with 6 features and 4 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -87.69696 ymin: 41.85495 xmax: -87.63788 ymax: 41.92679
Geodetic CRS: GCS_unknown
price holc_grade Name Description geometry
1 35.0 C CHICAGO POINT (-87.69696 41.85495)
2 60.0 D CHICAGO POINT (-87.68182 41.90289)
3 65.0 D CHICAGO POINT (-87.63788 41.91769)
4 57.5 D CHICAGO POINT (-87.64 41.91183)
5 49.5 D CHICAGO POINT (-87.66041 41.89617)
6 112.0 D CHICAGO POINT (-87.65521 41.92679)
If a feature in the object we are joining data to using st_join() does not intersect any feature in the other object, R will assign that feature a value of NA for all the variables from the other object. That means we can use missing values for holc_grade, for example, to filter out all features that are outside the Chicago city boundaries:
airbnb <- airbnb %>%
filter(!is.na(holc_grade))Now let’s have a quick look at AirBnB prices across Chicago:
osm_basemap <- get_tiles(chicago, provider = "OpenStreetMap")
ggplot(data = chicago)+
geom_spatraster_rgb(data=osm_basemap) +
#st_cast() allows you to convert sf objects to different formats; here,
#we're turning the Chicago boundary into a line, as opposed to a polygon,
#so that the resulting map will have just the outline of the city, without
#a colored fill:
geom_sf(data = chicago %>% st_cast("MULTILINESTRING"), color = "black", linewidth=1.5)+
geom_sf(data=airbnb, aes(color = price), size=0.35)+
scale_color_viridis_c() +
annotation_scale(location = "bl") +
theme_minimal()+
ggtitle("AirBnb Prices in Chicago in 2020")+
labs(
color = "Price per Bedroom",
x = "",
y = ""
)There seems to be some important geographic variation here, but it’s hard to be sure from just a visual inspection. to look at this more closely, we’ll just ggplot to make a boxplot comparing the price distributions across the different holc grades.
If you aren’t familiar with boxplots, please look at the explanation at this link
ggplot(data=airbnb, aes(x=holc_grade, y=price))+
#Just like we've gotten used to specifying SpatRaster and sf geometries,
#we can specify the plot types we want using different geom_ functions:
geom_boxplot(color="gray30", fill="black", size=0.75)+
#Because we've been working with maps up to now, we haven't had to worry
#much about the x- and y-axes, but now we need to tell ggplot2 the type of
#data we are using to define these coordinates. The format is similar to
#what we've done with fill and color, just applied to locations on the plot:
scale_x_discrete("HOLC Grade in 1940")+
#You can use the \n escape code to insert a line-break in text in R:
scale_y_continuous("Air BnB Price Per\nBedroom, 2020")+
theme_bw()+
theme(axis.title.x=element_text(size=18, colour="black", face="bold"),
axis.text.x=element_text(size=16, colour="black", face="bold"),
axis.ticks.x=element_blank(),
axis.title.y=element_text(size=18, colour="black", face="bold"),
axis.text.y=element_text(size=16, colour="black", face="bold"),
axis.ticks.y=element_blank(),
plot.title = element_text(size=28, colour="black", face="bold"),
strip.text=element_text(size=16, colour="black", face="bold"),
legend.title=element_text(size=18, colour="black", face="bold"),
legend.text=element_text(size=16, colour="black", face="bold"),
legend.position = "bottom")Warning: Removed 382 rows containing non-finite outside the scale range
(`stat_boxplot()`).
#This last set of lines set general features of the plot arrangement.
#They're just a set of configurations I often use for publications, which
#you can play around with.Task 2: Change the fill and color for the boxplot, as well as some features in the theme() argument and redo the plot. What patterns do you see in the results? How might you explain them (it might help to look at the map and consider wehre the Ungraded areas are)?
Task 3: Using the techniques you have learned above, read in the cps_demographics_2024.csv file and convert it to an sf points layer (remember tha the projection will be WGS 1984, since this is latitude and longitude). Have a look at the data and use it to create a boxplot showing the distribution of the proportion of each school’s population that is low income by HOLC grade.
Counting points in areas in sf
Now we’re going to do something similar, but this time, instead of summarizing a numerical variable, we’ll be computing the density of points in the different HOLC graded areas.
Because we’ll be doing calculations of distances and areas, we’ll need to project our layers into a local projection. In the United States, the most common projections used for city- and state-scale maps are probably those from the state plane system. The state plane system, like the UTM system we’ve used before, is a set of projections optimized for specific locations, in this case in the United States. Most states in the us have two or three state plane projections, optimized for different parts of the state depending on its size and shape.
Task 4: For Chicago, we should be using State Plane Illinois East, and to keep things consistent, let’s have the units in meters. Find the EPSG number for this projection at spatialreference.org. Then use the st_crs() function to create a CRS object called il_east that we can use to refer to this projection in the rest of the code. If you did things correctly, running the il_east on its own line should produce the following output:
Coordinate Reference System:
User input: EPSG:2790
wkt:
PROJCRS["NAD83(HARN) / Illinois East",
BASEGEOGCRS["NAD83(HARN)",
DATUM["NAD83 (High Accuracy Reference Network)",
ELLIPSOID["GRS 1980",6378137,298.257222101,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4152]],
CONVERSION["SPCS83 Illinois East zone (meters)",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",36.6666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-88.3333333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.999975,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",300000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["easting (X)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["northing (Y)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Engineering survey, topographic mapping."],
AREA["United States (USA) - Illinois - counties of Boone; Champaign; Clark; Clay; Coles; Cook; Crawford; Cumberland; De Kalb; De Witt; Douglas; Du Page; Edgar; Edwards; Effingham; Fayette; Ford; Franklin; Gallatin; Grundy; Hamilton; Hardin; Iroquois; Jasper; Jefferson; Johnson; Kane; Kankakee; Kendall; La Salle; Lake; Lawrence; Livingston; Macon; Marion; Massac; McHenry; McLean; Moultrie; Piatt; Pope; Richland; Saline; Shelby; Vermilion; Wabash; Wayne; White; Will; Williamson."],
BBOX[37.06,-89.28,42.5,-87.02]],
ID["EPSG",2790]]
Now we can read in the points layer that we’ll be summarizing. it includes the locations of grocery stores across the city as of 2020:
grocery <- read.csv("GroceryStores.csv", stringsAsFactors = FALSE)
grocery <- grocery %>%
st_as_sf(coords=c("Longitude", "Latitude"),
crs=st_crs(4326)) %>%
st_transform(crs=il_east)
plot(st_geometry(grocery))Note that the only real difference here from what we did with the airbnb data is that we are reprojecting into the state plane illinois east projection.
Now we’ll reproject our HOLC areas to match:
holc_il <- holc %>%
st_transform(crs = il_east) Because we are going to look at the density of grocery stores in each area, we also need to calculate the area of each HOLC grade. Remember that st_area uses map units, so everything is in meters. Below, we’re dividing by 1,000,000 to convert from square meters to square kilometers:
holc_il <- holc_il %>%
mutate(area = st_area(holc_il)/1000000)
plot(holc_il[,c("geom", "area")])Now we join the data on the HOLC area to the grocery layer:
grocery <- grocery %>%
st_join(holc_il)Then we can use group_by() to group by HOLC grade and then summarise() to add up the points and the area in each grade:
grocery_holc <- grocery %>%
filter(!is.na(holc_grade)) %>%
group_by(holc_grade) %>%
summarise(stores=n(),
area=min(area)) %>%
# n() function is used to simply
#count the number (n) of observations in each group when summarizing.
#because each HOLC grade in holc_il is only one multipolygon object, area is
#the entire area under each HOLC grade, so we don't want
#to add them up, and instead just take the minimum value. the areas within
#each grade will all be equal, so we could just as easily use max, median,
#or mean to summarize them.
mutate(storesperkm = as.numeric(stores/area))
#the as.numeric() part is to turn the result into a plain number; otherwise,
#R will treat the value as if it's in meters.Now we can plot our results:
ggplot(grocery_holc,
aes(x=holc_grade, y=storesperkm))+
geom_bar(stat="identity", fill="black", color="gray", linewidth=1)+
#This is the only major difference from the boxplot - here we are
#using a bar graph format. stat="identity" means that we already have
#defined y-values. Without that, geom_bar() will just count up the number
#of observations of each value of a categorical variable supplied as the
#x-axis variable.
scale_x_discrete("HOLC Grade in 1940")+
scale_y_continuous("Grocery Stores per Square KM, 2020")+
theme_minimal()+
theme(axis.title.x=element_text(size=18, colour="black", face="bold"),
axis.text.x=element_text(size=16, colour="black", face="bold"),
axis.ticks.x=element_blank(),
axis.title.y=element_text(size=18, colour="black", face="bold"),
axis.text.y=element_text(size=16, colour="black", face="bold"),
axis.ticks.y=element_blank(),
plot.title = element_text(size=28, colour="black", face="bold"),
strip.text=element_text(size=16, colour="black", face="bold"),
legend.title=element_text(size=18, colour="black", face="bold"),
legend.text=element_text(size=16, colour="black", face="bold"),
legend.position = "bottom")Task 5: Use the above technique to conduct your own analysis of the point locations of AirBnBs (regardless of price), producing a polished bar graph. What patterns do you see? What might explain these patterns?
Adding data from OpenStreetMap with osmdata
Even though the Chicago Data Portal has a massive catalog of data about the city (no, really, you should check it out!), there are still many more urban amenities that we might be interested in investigating.
The cool thing is that we can get up-to-date location data for a huge range of feature times for free from OpenStreetMap using the osmdata package. This package is super cool because it gives you access to what is probably the largest and most accurate global GIS database, allowing you to extract detailled features (sometimes individual buildings) in cities around the world for free.
There’s a bit of a workflow that you have to use to extract OpenStreetMap data, but it’s quite manageable: * Define a bounding box for the area you want to get data from (a bounding box is just a rectangle defining the boundaries of an area) * Open a connection to the Overpass API, which allows you to extract objects from OpenSteetMap * Define what kinds of objects you want using OpenStreetMap’s key/value system (see a table with a complete list of keys and values here) * Download the objects, which will be automatically converted into sf
Okay, so first we’ll get the bounding box. We can either build one ourselves, extract one from an existing sf object, or use a placename to get one from OpenStreetMap.
Let’s demo the last two of those options, since we’re unlikely to need to use the first option very much:
#Option 1: Get a bounding box from an sf object using the st_bbox() function:
bbox_1 <- holc %>%
#If you have a layer that's not in WGS 1984, you need to reproject it, b/c
#that's what OSM uses. We're already in WGS 1984, so we don't need to worry
#about this next line:
#st_transform("epsg:4326") %>%
st_bbox()
#Option 2: Use osmdata's get_bb() function with a placename
bbox_2 <- getbb("Chicago, IL, USA")
#Let's quickly compare them:
bbox_1 xmin ymin xmax ymax
-87.94011 41.64454 -87.52414 42.02304
bbox_2 min max
x -87.94009 -87.52408
y 41.64453 42.02304
In this case, the methods are practically identical, so we would probably be safe using either. Just for consistency, let’s keep with one we got from our previous objects. Now we need to open a connection to the Overpass API and specify what we want to get using OSM’s key/value system.
To figure out what we want, we can use osmdata’s available_features() function to list all the keys and available_tags() function to list the values for a specific key:
available_features() [1] "4wd_only" "abandoned"
[3] "abutters" "access"
[5] "addr" "addr:city"
[7] "addr:conscriptionnumber" "addr:country"
[9] "addr:county" "addr:district"
[11] "addr:flats" "addr:full"
[13] "addr:hamlet" "addr:housename"
[15] "addr:housenumber" "addr:inclusion"
[17] "addr:interpolation" "addr:place"
[19] "addr:postbox" "addr:postcode"
[21] "addr:province" "addr:state"
[23] "addr:street" "addr:subdistrict"
[25] "addr:suburb" "addr:unit"
[27] "admin_level" "aeroway"
[29] "agricultural" "alcohol"
[31] "alt_name" "amenity"
[33] "area" "atv"
[35] "backward" "barrier"
[37] "basin" "bdouble"
[39] "bicycle" "bicycle_road"
[41] "biergarten" "boat"
[43] "border_type" "boundary"
[45] "brand" "bridge"
[47] "building" "building:architecture"
[49] "building:colour" "building:fireproof"
[51] "building:flats" "building:material"
[53] "building:min_level" "building:part"
[55] "building:soft_storey" "bus_bay"
[57] "busway" "capacity"
[59] "castle_type" "change"
[61] "charge" "clothes"
[63] "construction" "construction#Railways"
[65] "construction_date" "covered"
[67] "craft" "crossing"
[69] "crossing:island" "cuisine"
[71] "cutting" "cycleway"
[73] "cycleway:left" "cycleway:left:oneway"
[75] "cycleway:right" "cycleway:right:oneway"
[77] "denomination" "destination"
[79] "diet:*" "direction"
[81] "dispensing" "disused"
[83] "dog" "drinking_water"
[85] "drinking_water:legal" "drive_in"
[87] "drive_through" "ele"
[89] "electric_bicycle" "electrified"
[91] "embankment" "embedded_rails"
[93] "emergency" "end_date"
[95] "entrance" "est_width"
[97] "fee" "female"
[99] "fire_object:type" "fire_operator"
[101] "fire_rank" "food"
[103] "foot" "footway"
[105] "ford" "forestry"
[107] "forward" "frequency"
[109] "fuel" "gauge"
[111] "golf_cart" "goods"
[113] "hazard" "hazmat"
[115] "healthcare" "healthcare:counselling"
[117] "healthcare:speciality" "height"
[119] "hgv" "highway"
[121] "historic" "horse"
[123] "hot_water" "ice_road"
[125] "incline" "industrial"
[127] "inline_skates" "inscription"
[129] "int_name" "internet_access"
[131] "junction" "kerb"
[133] "landuse" "lane_markings"
[135] "lanes" "lanes:bus"
[137] "lanes:psv" "layer"
[139] "leaf_cycle" "leaf_type"
[141] "leisure" "lhv"
[143] "lit" "loc_name"
[145] "location" "male"
[147] "man_made" "max_age"
[149] "max_level" "maxaxleload"
[151] "maxheight" "maxlength"
[153] "maxspeed" "maxstay"
[155] "maxweight" "maxwidth"
[157] "military" "min_age"
[159] "min_level" "minspeed"
[161] "mofa" "moped"
[163] "motor_vehicle" "motorboat"
[165] "motorcar" "motorcycle"
[167] "motorroad" "mountain_pass"
[169] "mtb:description" "mtb:scale"
[171] "name" "name:left"
[173] "name:right" "name_1"
[175] "name_2" "narrow"
[177] "nat_name" "natural"
[179] "nickname" "noexit"
[181] "non_existent_levels" "nudism"
[183] "office" "official_name"
[185] "old_name" "oneway"
[187] "oneway:bicycle" "openfire"
[189] "opening_hours" "opening_hours:drive_through"
[191] "operator" "orientation"
[193] "oven" "overtaking"
[195] "parking" "parking:condition"
[197] "parking:lane" "passing_places"
[199] "place" "power"
[201] "power_supply" "priority"
[203] "priority_road" "produce"
[205] "proposed" "protected_area"
[207] "psv" "public_transport"
[209] "railway" "railway:preserved"
[211] "railway:track_ref" "recycling_type"
[213] "ref" "ref_name"
[215] "reg_name" "religion"
[217] "religious_level" "rental"
[219] "residential" "roadtrain"
[221] "route" "sac_scale"
[223] "sauna" "service"
[225] "service_times" "shelter_type"
[227] "shop" "short_name"
[229] "shoulder" "shower"
[231] "sidewalk" "site"
[233] "ski" "smoking"
[235] "smoothness" "social_facility"
[237] "sorting_name" "speed_pedelec"
[239] "sport" "start_date"
[241] "step_count" "substation"
[243] "surface" "tactile_paving"
[245] "tank" "tidal"
[247] "toilets" "toilets:wheelchair"
[249] "toll" "topless"
[251] "tourism" "tracks"
[253] "tracktype" "traffic_calming"
[255] "traffic_sign" "trail_visibility"
[257] "trailblazed" "trailblazed:visibility"
[259] "tunnel" "turn"
[261] "type" "unisex"
[263] "usage" "vehicle"
[265] "vending" "voltage"
[267] "water" "wheelchair"
[269] "wholesale" "width"
[271] "winter_road" "wood"
Okay, that’s a lot to wade through, but a lot of the more interesting point locations we might be looking for are likely to be classified as either “amenity” or “shop”. Let’s have a look at the available values for amenity features in Chicago:
available_tags("amenity")# A tibble: 136 × 2
Key Value
<chr> <chr>
1 amenity animal_boarding
2 amenity animal_breeding
3 amenity animal_shelter
4 amenity animal_training
5 amenity arts_centre
6 amenity atm
7 amenity baby_hatch
8 amenity baking_oven
9 amenity bank
10 amenity bar
# ℹ 126 more rows
So we have plenty of options (and you’ll probably want to look up some of these to see exactly how OSM defines the different categories, as some are a bit ambiguous).
For demonstration purposes, let’s just use clinics, as a way of seeing if we have inequalities in health accessibility across HOLC grades:
#Remember the steps: 1) set the bounding box; 2) connect to Overpass (that's
#what opq() is doing); 3) set the features you want; 4) convert to sf objects
clinics <- bbox_1 %>%
opq() %>%
add_osm_feature(key = "amenity", value = "clinic") %>%
osmdata_sf()
clinicsObject of class 'osmdata' with:
$bbox : 41.644543121783,-87.9401140825235,42.0230385861476,-87.5241371038952
$overpass_call : The call submitted to the overpass API
$meta : metadata including timestamp and version numbers
$osm_points : 'sf' Simple Features Collection with 1338 points
$osm_lines : NULL
$osm_polygons : 'sf' Simple Features Collection with 107 polygons
$osm_multilines : NULL
$osm_multipolygons : 'sf' Simple Features Collection with 1 multipolygons
The resulting object has everything we got from the Overpass API for the key/value combination we supplied in the area defined by our bounding box. The output above tells us all the stuff in the object.
Notice that we’re getting all the different types of geometries that are in the OpenStreetMap database that fit the criteria, including polygons for those clinics that actually have their building footprints in the OSM database. It’s up to us to extract the parts we want. Let’s just get the point locations and see how we did:
clinics <- clinics$osm_points
plot(st_geometry(clinics))Now we can apply the method from above to compare clinic access across HOLC grades:
clinics %>%
st_transform(il_east) %>%
st_join(holc_il) %>%
filter(!is.na(holc_grade)) %>%
group_by(holc_grade) %>%
summarise(area = min(area),
clinics = n()) %>%
mutate(clinics_sq_km = as.numeric(clinics/area)) %>%
ggplot(aes(x = holc_grade, y = clinics_sq_km)) +
geom_bar(stat = "identity") +
scale_x_discrete("HOLC Grade in 1940") +
scale_y_continuous("Clinics Per Sq. KM") +
theme_linedraw()Task 6: Download a new set of amenity features from OSM and create a bar graph comparing their concentration across HOLC grades.
Areal interpolation with intensive variables
The walkshed analysis in our tutorial on Manila introduced you to one example of a broader set of techniques that we call areal interpolation, which is the fancy name for how we estimated the population within various transportation walksheds based on their spatial overlap with areas with measured population numbers.
Now we’re going to extend those techniques to look at how we can estimate these values not just for extensive variables (extensive variables are variables that are really just a count of something, like the number of people or the total GDP of an area), but also for intensive variables (intensive variables are rates or relative measures, like median income, proportion of the population that identifies as female, etc.).
Because they are already normalized for the size of units in which they are measured, we can approach intensive variables in a different way from the extensive variable (population) we were working with last time.
To use areal interpolation with intensive variables, we basically just need to compute a weighted mean of the value/s of interest, with the weights defined by the proportion of the total area in our target polygons accounted for by the polygons for which we have the intensive measures of interest.
here, we’re going to use this technique to estimate the percentage of children under 6 with elevated blood lead levels in each of the different HOLC grade areas. The City of Chicago conducted studies of children’s blood lead levels as part of a remediation program in 1999 and 2013 and released these data at the scale of community areas (which can roughly be understood as neighborhoods).
Just like we needed to translate transportation walksheds into districts and municipalities in the previous tutorial, this time we’ll need to translate HOLC grade areas into community areas.
First, we’ll read in the data from the blood lead (PB) monitoring:
pb <- read.csv("BloodPBLevelsCommAreas.csv", stringsAsFactors = FALSE)
head(pb) CommArea CommAreaName ElevBloodPb1999Pct ElevBloodPb2013Pct
1 1 Rogers Park 10.5 0.8
2 2 West Ridge 6.0 1.1
3 3 Uptown 11.3 0.6
4 4 Lincoln Square 7.2 0.9
5 5 North Center 6.4 0.1
6 6 Lake View 4.0 0.2
These data contain, for each community area, the estimated percentage of children under 6 with elevated blood lead levels, based on the 1999 and 2013 surveys. By 2013, thankfully, the numbers were quite low, but they were very problematic in 1999.
Conducting table joins in tidyverse
You’ll notice that pb has no spatial information other than the names of the community areas. This is actually a pretty common situation when working with GIS data - it’s not at all rare to have datasets in which the data are measured for specific areal units, like cities, counties, or, in this case, neighborhoods, while nevertheless lacking information on these locations’ boundaries.
What we generally need to do in these cases is to find a separate dataset that has location information, along with some common column identifying the locations that we can use to perform a table join. Table joins match data from one table to another based on a common column that can be used to identify which rows in Table 1 correspond to which rows in Table 2.
For example:
Table 1:
| Store | Zip Code |
|---|---|
| A | 18042 |
| B | 43210 |
| C | 43260 |
| D | 64850 |
| E | 64865 |
Table 2:
| Zip Code | Median HH Income ($) |
|---|---|
| 64850 | 105000 |
| 18042 | 55000 |
| 43260 | 45000 |
| 43210 | 40000 |
| 64865 | 80000 |
Table 1 after table joining Table 2:
| Store | Zip Code | Median HH Income ($) |
|---|---|---|
| A | 18042 | 55000 |
| B | 43210 | 40000 |
| C | 43260 | 45000 |
| D | 64850 | 105000 |
| E | 64865 | 80000 |
Fortunately for us, the Chicago Data Portal also has a file with the boundaries of the official community areas. We’ll need to read in those data and table join them to our blood lead level data:
commareas <- st_read("chicago_community_areas.gpkg") Reading layer `ChicagoCommunityAreas' from data source
`G:\My Drive\My Classes\World Cities\Labs\Redlining\chicago_community_areas.gpkg'
using driver `GPKG'
Simple feature collection with 77 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS: WGS 84
plot(st_geometry(commareas))head(commareas)Simple feature collection with 6 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.7069 ymin: 41.79448 xmax: -87.58001 ymax: 41.99076
Geodetic CRS: WGS 84
area area_num_1 area_numbe comarea comarea_id community perimeter
1 0 35 35 0 0 DOUGLAS 0
2 0 36 36 0 0 OAKLAND 0
3 0 37 37 0 0 FULLER PARK 0
4 0 38 38 0 0 GRAND BOULEVARD 0
5 0 39 39 0 0 KENWOOD 0
6 0 4 4 0 0 LINCOLN SQUARE 0
shape_area shape_len geom
1 46004621 31027.05 MULTIPOLYGON (((-87.60914 4...
2 16913961 19565.51 MULTIPOLYGON (((-87.59215 4...
3 19916705 25339.09 MULTIPOLYGON (((-87.6288 41...
4 48492503 28196.84 MULTIPOLYGON (((-87.60671 4...
5 29071742 23325.17 MULTIPOLYGON (((-87.59215 4...
6 71352328 36624.60 MULTIPOLYGON (((-87.67441 4...
You can see from here that the commareas layer has an entry for the community area name, so we should be able to match it up, but it’s not in the same case as the community area names in the dataset with the blood lead estimates.
unique(commareas$community) [1] "DOUGLAS" "OAKLAND" "FULLER PARK"
[4] "GRAND BOULEVARD" "KENWOOD" "LINCOLN SQUARE"
[7] "WASHINGTON PARK" "HYDE PARK" "WOODLAWN"
[10] "ROGERS PARK" "JEFFERSON PARK" "FOREST GLEN"
[13] "NORTH PARK" "ALBANY PARK" "PORTAGE PARK"
[16] "IRVING PARK" "DUNNING" "MONTCLARE"
[19] "BELMONT CRAGIN" "WEST RIDGE" "HERMOSA"
[22] "AVONDALE" "LOGAN SQUARE" "HUMBOLDT PARK"
[25] "WEST TOWN" "AUSTIN" "WEST GARFIELD PARK"
[28] "EAST GARFIELD PARK" "NEAR WEST SIDE" "NORTH LAWNDALE"
[31] "UPTOWN" "SOUTH LAWNDALE" "LOWER WEST SIDE"
[34] "NEAR SOUTH SIDE" "ARMOUR SQUARE" "NORWOOD PARK"
[37] "NEAR NORTH SIDE" "LOOP" "SOUTH SHORE"
[40] "CHATHAM" "AVALON PARK" "SOUTH CHICAGO"
[43] "BURNSIDE" "MCKINLEY PARK" "LAKE VIEW"
[46] "CALUMET HEIGHTS" "ROSELAND" "NORTH CENTER"
[49] "PULLMAN" "SOUTH DEERING" "EAST SIDE"
[52] "WEST PULLMAN" "RIVERDALE" "HEGEWISCH"
[55] "GARFIELD RIDGE" "ARCHER HEIGHTS" "BRIGHTON PARK"
[58] "BRIDGEPORT" "NEW CITY" "WEST ELSDON"
[61] "GAGE PARK" "CLEARING" "WEST LAWN"
[64] "CHICAGO LAWN" "WEST ENGLEWOOD" "ENGLEWOOD"
[67] "GREATER GRAND CROSSING" "LINCOLN PARK" "ASHBURN"
[70] "AUBURN GRESHAM" "BEVERLY" "WASHINGTON HEIGHTS"
[73] "MOUNT GREENWOOD" "MORGAN PARK" "OHARE"
[76] "EDGEWATER" "EDISON PARK"
unique(pb$CommAreaName) [1] "Rogers Park" "West Ridge" "Uptown"
[4] "Lincoln Square" "North Center" "Lake View"
[7] "Lincoln Park" "Near North Side" "Edison Park"
[10] "Norwood Park" "Jefferson Park" "Forest Glen"
[13] "North Park" "Albany Park" "Portage Park"
[16] "Irving Park" "Dunning" "Montclaire"
[19] "Belmont Cragin" "Hermosa" "Avondale"
[22] "Logan Square" "Humboldt park" "West Town"
[25] "Austin" "West Garfield Park" "East Garfield Park"
[28] "Near West Side" "North Lawndale" "South Lawndale"
[31] "Lower West Side" "Loop" "Near South Side"
[34] "Armour Square" "Douglas" "Oakland"
[37] "Fuller Park" "Grand Boulevard" "Kenwood"
[40] "Washington Park" "Hyde Park" "Woodlawn"
[43] "South Shore" "Chatham" "Avalon Park"
[46] "South Chicago" "Burnside" "Calumet Heights"
[49] "Roseland" "Pullman" "South Deering"
[52] "East Side" "West Pullman" "Riverdale"
[55] "Hegewisch" "Garfield Ridge" "Archer Heights"
[58] "O'Hare" "Brighton Park" "McKinley Park"
[61] "Bridgeport" "New City" "West Elsdon"
[64] "Gage Park" "Clearing" "West Lawn"
[67] "Chicago Lawn" "West Englewood" "Englewood"
[70] "Greater Grand Crossing" "Ashburn" "Auburn Gresham"
[73] "Beverly" "Washington Heights" "Mount Greenwood"
[76] "Morgan Park" "Edgewater" "Chicago"
[79] "Unknown"
One option here would be to use R functions to convert everthing into either uppercase or lowercase. Because that’s a pretty handy trick when cleaning data, here’s how we’d to it:
commareas <- commareas %>%
mutate(community_lower_case = tolower(community))
pb <- pb %>%
mutate(community_upper_case = toupper(CommAreaName))
pb$community_upper_case [1] "ROGERS PARK" "WEST RIDGE" "UPTOWN"
[4] "LINCOLN SQUARE" "NORTH CENTER" "LAKE VIEW"
[7] "LINCOLN PARK" "NEAR NORTH SIDE" "EDISON PARK"
[10] "NORWOOD PARK" "JEFFERSON PARK" "FOREST GLEN"
[13] "NORTH PARK" "ALBANY PARK" "PORTAGE PARK"
[16] "IRVING PARK" "DUNNING" "MONTCLAIRE"
[19] "BELMONT CRAGIN" "HERMOSA" "AVONDALE"
[22] "LOGAN SQUARE" "HUMBOLDT PARK" "WEST TOWN"
[25] "AUSTIN" "WEST GARFIELD PARK" "EAST GARFIELD PARK"
[28] "NEAR WEST SIDE" "NORTH LAWNDALE" "SOUTH LAWNDALE"
[31] "LOWER WEST SIDE" "LOOP" "NEAR SOUTH SIDE"
[34] "ARMOUR SQUARE" "DOUGLAS" "OAKLAND"
[37] "FULLER PARK" "GRAND BOULEVARD" "KENWOOD"
[40] "WASHINGTON PARK" "HYDE PARK" "WOODLAWN"
[43] "SOUTH SHORE" "CHATHAM" "AVALON PARK"
[46] "SOUTH CHICAGO" "BURNSIDE" "CALUMET HEIGHTS"
[49] "ROSELAND" "PULLMAN" "SOUTH DEERING"
[52] "EAST SIDE" "WEST PULLMAN" "RIVERDALE"
[55] "HEGEWISCH" "GARFIELD RIDGE" "ARCHER HEIGHTS"
[58] "O'HARE" "BRIGHTON PARK" "MCKINLEY PARK"
[61] "BRIDGEPORT" "NEW CITY" "WEST ELSDON"
[64] "GAGE PARK" "CLEARING" "WEST LAWN"
[67] "CHICAGO LAWN" "WEST ENGLEWOOD" "ENGLEWOOD"
[70] "GREATER GRAND CROSSING" "ASHBURN" "AUBURN GRESHAM"
[73] "BEVERLY" "WASHINGTON HEIGHTS" "MOUNT GREENWOOD"
[76] "MORGAN PARK" "EDGEWATER" "CHICAGO"
[79] "UNKNOWN"
Those functions are pretty easy - they do exactly what they say.
You might have noticed that there’s another option, however. Both objects also have columns that appear to be assigning numerical codes to the community areas:
unique(commareas$area_num_1) [1] "35" "36" "37" "38" "39" "4" "40" "41" "42" "1" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "2" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
[31] "3" "30" "31" "33" "34" "10" "8" "32" "43" "44" "45" "46" "47" "59" "6"
[46] "48" "49" "5" "50" "51" "52" "53" "54" "55" "56" "57" "58" "60" "61" "62"
[61] "63" "64" "65" "66" "67" "68" "69" "7" "70" "71" "72" "73" "74" "75" "76"
[76] "77" "9"
unique(pb$CommArea) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56 57 76 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
[76] 75 77 0 99
So now we could use any of three options to join the two tables: 1) community area names in uppercase; 2) community area names in lowercase; or 3) community area numbers.
In any case (pun intended) we are ready to do the table join and then intersect with the redline areas layer. We’ll use the left_join() function, which conducts a table join and keeps all observations from the table to which you are joining (right_join(), not surprisingly, does the opposite, while inner_join() keeps all observations from both tables, whether they match to a row in the other table or not).
All the _join() functions other than st_join() automatically look for columns that have the same name in each table and then try to match those up. If the columns are named something different in the different tables, we have to use a special syntax to tell R which columns to use.
commareas <- commareas %>%
left_join(pb, by = join_by(community == community_upper_case))
plot(commareas[,c("geom", "ElevBloodPb1999Pct")])Now we’re ready to reproject this and intersect it with the HOLC grades:
commareas_holc <- commareas %>%
st_transform(il_east) %>%
st_intersection(holc_il)Warning: attribute variables are assumed to be spatially constant throughout
all geometries
Now, just like in when we were conducting our walkshed analysis, we need to compute the area for the intersected polygons:
commareas_holc <- commareas_holc %>%
mutate(area_int = st_area(commareas_holc)/1000000)Here’s where things get a bit different because we’re working with an intensive variable (in this case, a percentage). Instead of computing the intersected polygons’ share of the area of the original community areas, we multiply our intensive variables by the intersected polygon area and then add them up for all the different HOLC grades. We also add up the total area of intersection between each HOLC grade and the community areas layer:
commareas_holc <- commareas_holc %>%
mutate(ebpb1999_wt = ElevBloodPb1999Pct * area_int,
ebpb2013_wt = ElevBloodPb2013Pct * area_int) %>%
filter(!is.na(holc_grade)) %>%
group_by(holc_grade) %>%
summarise(ebpb1999_wt = sum(ebpb1999_wt, na.rm=TRUE),
ebpb2013_wt = sum(ebpb2013_wt, na.rm=TRUE),
area_int = sum(area_int))Now we need to divide the sum of the weighted intensive variables by the total intersected area in each HOLC group to get our estimate of the percentage of children under 6 with elevated blood levels in the two survey years:
commareas_holc <- commareas_holc %>%
mutate(ebpb1999 = ebpb1999_wt/area_int,
ebpb2013 = ebpb2013_wt/area_int)Task 6: Create a boxplot comparing the differences in children’s blood lead levels across HOLC grades in Chicago in 1999 and 2013. You will need to use the pivot_longer() function that we encountered last time to accomplish this. Once you have your plot, interpret your results. What might account for the patterns that you see?
Task 7: Use the st_interpolation_aw() function to reproduce these results. Remember that in this case you are working with intensive, rather than extensive, data.