require(tidyverse)
require(sf)
require(ggspatial)
require(maptiles)
require(tidyterra)Advanced Geoprocessing in R: Long-Term Legacies of Redlining in Chicago
Chicago
Introduction: Redlining
The current structure of cities reflects very long-term forms of exclusion and inequality. In the United States, one of the most important of these has been redlining, which substantially restricted access to capital for communities of color while funneling subsidized loans into predominantly white neighborhoods, which, through a variety of mechanisms, were made inaccessible to non-whites.
Several researchers have argued that these practices were an essential component of institutionalized racism that has continued to reproduce structural inequalities in wealth and urban segregation long after the redlining maps were drawn. In this tutorial, we will look at the relationship between areas that were redlined in Chicago and present inequalities in the city, some of which have substantial environmental justice implications. To be clear, correlation does not mean causation, but to find substantial present differences across neighborhoods that were redlined approximately 80 years ago would be quite compelling evidence that redlining has had a long-term legacy.
Following a heroic data collection and digitization effort by the members of the Mapping Inequality Project at the University of Richmond, shapefiles showing the federal government’s Home Owners’ Loan Corporation (HOLC) lending maps are now available for most large and medium-sized cities in the United States. This is an unprecedented resource allowing us to investigate the legacies of redlining at present.
Before beginning the tutorial, I would like you to refresh your memory about redlining and, in particular, the HOLC program. There is a very clear introduction available at the Mapping Inequality Project site.
After thoroughly investigating the website, answer these questions. What was the HOLC program? Why have many researchers in the past several decades suggested that it has been a major force in perpetuating urban segregation in US cities?
Objectives
- Get further practice with projections, intersections, and mapping with vectors, as well as constructing data visualizations with
ggplot() - Learn how to estimate intensive and extensive variables using areal interpolation
- Practice interpreting the results of simple geospatial analyses
The Data
- Redlines.gpkg and its associated files show areas marked under the HOLC program. The data were digitized by the Mapping Inequality Project.
- chicago_boundary.kml is the city boundary of Chicago, sourced from the Chicago Data Portal
- GroceryStores.csv contains location information for all grocery stores in Chicago as of 2020, sourced from the Chicago Data Portal
- BloodPBLevelsCommAreas.csv is a table with results from survey studies of the percentage of children residing in Chicago’s official community areas, sourced from the Chicago Data Portal (basically, neighborhoods) with elevated blood lead levels as of 1999 and 2013
- chicago_community_areas.gpkg contains the boundaries of Chicago’s official community areas, sourced from the Chicago Data Portal
- ACS_5_year_comm_areas.csv is a table with demographic and income data from the US Census Bureau’s representative sample survey, the American Community Survey, which estimates 5-year moving averages of important census variables for community areas as of 2023
- Building_Code_Scofflaw.csv contains the locations of buildings whose landlords own at least three buildings with unaddressed code violations
The data you will need for this tutorial can be found in this zip file.
Packages
tidyversefor data wranglingsffor vector data geoprocessingggspatialfor mappingmaptilesfor adding contextual basemaps to your mapstidyterrafor mapping those tiles
HOLC Data
Let’s start with the boundaries of the HOLC-graded areas, along with the city limits of Chicago. We’re using the same packages and functions that we used in our previous tutorial on the Amazon:
setwd("G:/My Drive/Classes/GIS for Global Environment/Labs/R_Tutorials_/Vector Geoprocessing in R")
holc <- st_read("redlines.gpkg")Reading layer `redlines' from data source
`G:\My Drive\Classes\GIS for Global Environment\Labs\R_Tutorials_\Vector Geoprocessing in R\redlines.gpkg'
using driver `GPKG'
Simple feature collection with 5 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94011 ymin: 41.60414 xmax: -87.52414 ymax: 42.28303
Geodetic CRS: +proj=longlat +datum=WGS84 +no_defs
chicago <- st_read("chicago_boundary.kml")Reading layer `Layer #0' from data source
`G:\My Drive\Classes\GIS for Global Environment\Labs\R_Tutorials_\Vector Geoprocessing in R\chicago_boundary.kml'
using driver `KML'
Simple feature collection with 1 feature and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS: WGS 84
plot(holc)plot(st_geometry(chicago))In order to assess the evidence for long-term impacts of HOLC’s maps, we need to use some tricks from the previous tutorial to estimate values within HOLC-graded areas using data for other types of areal units.
First, however, because some of the datasets we’ll be using cover different extents, we should use st_intersection() to make sure everything is clipped to the Chicago city limits:
chicago_holc <- chicago |>
st_transform(crs=st_crs(holc))
holc <- holc |>
st_intersection(chicago_holc)Warning: attribute variables are assumed to be spatially constant throughout
all geometries
plot(st_geometry(holc))Summarizing data collected at point locations with sf
We’re going to start with data on grocery store locations (where grocery store is pretty broadly defined and includes some things closer to convenience stores) as of 2020.
We’ll start by reading in the data, which is in a CSV spreadsheet, which is a very common format for point data, for which you only need latitude and longitude coordinates:
groceries <- read.csv("GroceryStores.csv", stringsAsFactors = FALSE)
head(groceries) Store.Name Longitude Latitude
1 Jewel - Osco -87.78556 41.92624
2 Mariano's Fresh Market -87.64738 41.88056
3 Mariano's Fresh Market -87.66564 41.83651
4 Aldi -87.58747 41.75278
5 Food 4 Less -87.66390 41.76517
6 Whole Foods Market -87.64733 41.88250
Then we’ll use a pipeline to convert the table we just read in into an sf object, reproject it to the same projection as the holc layer, and then conduct a join by location operation to add data from the red layer to the airbnb layer. Join by location is similar to intersection, except that instead of creating a new layer with the combined geometries of the intersected layers, it just adds the data from one layer to the other.
We can conduct a join by location using the st_join() function:
groceries <- groceries |>
st_as_sf(coords=c("Longitude", "Latitude"),
crs=st_crs(4326)) |>
#These lines convert the data frame into an sf
#object. The coords argument allows you to specify which columns hold the
#x and y coordinates. EPSG number 4326 is just WGS 1984. It is important to remember that the coordinates are in x, y order, where Longitude is the x coordinate and Latitude is the y coordinate.
st_transform(crs=st_crs(holc)) |>
st_join(holc)
#The st_join() function performs an operation called join by location.
#all it does is take variables from features in one layer and adds
#them to features in another layer that intersects them. in this case,
#we're adding information on the holc grade as of 1940 to every one
#of our grocery store points
head(groceries)Simple feature collection with 6 features and 4 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -87.78556 ymin: 41.75278 xmax: -87.58747 ymax: 41.92624
Geodetic CRS: +proj=longlat +datum=WGS84 +no_defs
Store.Name holc_grade Name Description
1 Jewel - Osco Ungraded CHICAGO
2 Mariano's Fresh Market D CHICAGO
3 Mariano's Fresh Market D CHICAGO
4 Aldi Ungraded CHICAGO
5 Food 4 Less C CHICAGO
6 Whole Foods Market Ungraded CHICAGO
geometry
1 POINT (-87.78556 41.92624)
2 POINT (-87.64738 41.88056)
3 POINT (-87.66564 41.83651)
4 POINT (-87.58747 41.75278)
5 POINT (-87.6639 41.76517)
6 POINT (-87.64733 41.8825)
If a feature in the object we are joining data to using st_join() does not intersect any feature in the other object, R will assign that feature a value of NA for all the variables from the other object. That means we can then fill in those missing values to designate grocery stores in areas that were not graded in the 1940s.
groceries <- groceries |>
mutate(holc_grade = case_when(is.na(holc_grade) ~ "Ungraded",
.default = holc_grade))Look up the case_when function’s help file. Read it carefully and then explain the arguments in the function as it is used in the previous code block. Describe how it is working in plain terms.
Now let’s have a quick look at grocery stores across Chicago:
osm_basemap <- get_tiles(chicago, provider = "OpenStreetMap")
ggplot(data = chicago)+
geom_spatraster_rgb(data=osm_basemap) +
#st_cast() allows you to convert sf objects to different formats; here,
#we're turning the Chicago boundary into a line, as opposed to a polygon,
#so that the resulting map will have just the outline of the city, without
#a colored fill:
geom_sf(data = chicago |> st_cast("MULTILINESTRING"), color = "black", linewidth=1.5)+
geom_sf(data=groceries, size=1,
aes(color = holc_grade)) +
scale_color_viridis_d("HOLC Grade") +
annotation_scale(location = "bl") +
theme_minimal()+
ggtitle("Grocery Stores in Chicago in 2020")+
labs(
x = "",
y = ""
)Counting points in areas in sf
For years, there has been a lot of discussion about the problem of food deserts - areas with very limited access to grocery stores. We might think that redlined areas could have been less attractive to investment and might go on to host fewer grocery stores.
Let’s see how many grocery stores are in each HOLC class:
#table is a helpful function that takes categorical
#columns (or multiple categorical columns) and
#shows you how many times each value (or each
#combination of values, if you have multiple
#columns) shows up.
table(groceries$holc_grade)
A B C D Ungraded
2 18 107 60 75
The raw totals of grocery stores are not particularly informative if we are interested in the long-term impacts of redlining. Explain why the values we just calculated aren’t a great way to address this question.
Now, let’s examine the density of points in the different HOLC graded areas.
Because we’ll be doing calculations of distances and areas, we’ll need to project our layers into a local projection. In the United States, the most common projections used for city- and state-scale maps are probably those from the state plane system. The state plane system, like the UTM system we’ve used before, is a set of projections optimized for specific locations, in this case in the United States. Most states in the us have two or three state plane projections, optimized for different parts of the state depending on its size and shape.
The most commonly used projection for Chicago is State Plane Illinois East. To keep things consistent, let’s have the units in meters. Find the EPSG number for this projection at spatialreference.org. Then use the st_crs() function to create a CRS object called il_east that we can use to refer to this projection in the rest of the code. Add the code you create to your tutorial report. If you did things correctly, running the il_east on its own line should produce the following output:
Coordinate Reference System:
User input: EPSG:2790
wkt:
PROJCRS["NAD83(HARN) / Illinois East",
BASEGEOGCRS["NAD83(HARN)",
DATUM["NAD83 (High Accuracy Reference Network)",
ELLIPSOID["GRS 1980",6378137,298.257222101,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4152]],
CONVERSION["SPCS83 Illinois East zone (meter)",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",36.6666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-88.3333333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.999975,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",300000,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["easting (X)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["northing (Y)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Engineering survey, topographic mapping."],
AREA["United States (USA) - Illinois - counties of Boone; Champaign; Clark; Clay; Coles; Cook; Crawford; Cumberland; De Kalb; De Witt; Douglas; Du Page; Edgar; Edwards; Effingham; Fayette; Ford; Franklin; Gallatin; Grundy; Hamilton; Hardin; Iroquois; Jasper; Jefferson; Johnson; Kane; Kankakee; Kendall; La Salle; Lake; Lawrence; Livingston; Macon; Marion; Massac; McHenry; McLean; Moultrie; Piatt; Pope; Richland; Saline; Shelby; Vermilion; Wabash; Wayne; White; Will; Williamson."],
BBOX[37.06,-89.27,42.5,-87.02]],
ID["EPSG",2790]]
Now we’ll reproject our layers into State Plane Illinois East:
holc_il <- holc |>
st_transform(crs = il_east)
groceries <- groceries |>
st_transform(crs = st_crs(holc_il))
# We're using a trick here, using st_crs() to get
# the CRS information for a layer that's already in
# the projection we want to set the parameters for
# our reprojection operation.Because we are going to look at the density of grocery stores in each area, we also need to calculate the area of each HOLC grade. Remember that st_area uses map units, so everything is in meters. Below, we’re dividing by 1,000,000 to convert from square meters to square kilometers:
holc_il <- holc_il |>
mutate(area = st_area(holc_il)/1000000)
plot(holc_il[,c("geom", "area")])Now we join the data on the HOLC area to the grocery layer:
groceries_holc <- groceries |>
st_join(holc_il[,"area"])Then we can use group_by() to group by HOLC grade and then summarise() to add up the points and the area in each grade:
groceries_holc <- groceries_holc |>
filter(!is.na(holc_grade)) |>
group_by(holc_grade) |>
summarise(stores=n(),
area=min(area)) |>
# n() function is used to simply
#count the number (n) of observations in each group when summarizing.
#because each HOLC grade in holc_il is only one multipolygon object, area is
#the entire area under each HOLC grade, so we don't want
#to add them up, and instead just take the minimum value. the areas within
#each grade will all be equal, so we could just as easily use max, median,
#or mean to summarize them.
mutate(storesperkm = as.numeric(stores/area))
#the as.numeric() part is to turn the result into a plain number; otherwise,
#R will treat the value as if it's in meters.Now we can plot our results:
ggplot(groceries_holc,
aes(x=holc_grade, y=storesperkm))+
geom_bar(stat="identity", fill="black", color="gray", linewidth=1)+
#Here we are
#using a bar graph format. stat="identity" means that we already have
#defined y-values. Without that, geom_bar() will just count up the number
#of observations of each value of a categorical variable supplied as the
#x-axis variable.
scale_x_discrete("HOLC Grade in 1940")+
scale_y_continuous("Grocery Stores per Square KM, 2020")+
theme_minimal()+
theme(axis.title.x=element_text(size=18, colour="black", face="bold"),
axis.text.x=element_text(size=16, colour="black", face="bold"),
axis.ticks.x=element_blank(),
axis.title.y=element_text(size=18, colour="black", face="bold"),
axis.text.y=element_text(size=16, colour="black", face="bold"),
axis.ticks.y=element_blank(),
plot.title = element_text(size=28, colour="black", face="bold"),
strip.text=element_text(size=16, colour="black", face="bold"),
legend.title=element_text(size=18, colour="black", face="bold"),
legend.text=element_text(size=16, colour="black", face="bold"),
legend.position = "bottom")Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).
Read in the Building_Code_Scofflaw.csv file and convert it to an sf object. This file shows the point locations of buildings whose landlords have at least three buildings with building code violations that they have failed to correct. Compute the density of these buildings and produce a figure like that above. Include your code and results and reflect on what you find in your tutorial report.
Areal interpolation
Overview of areal interpolation
Areal interpolation is a fancy name for estimating the values of variables in one set of polygons based on the variable values in an overlapping - but different - set of polygons. Why would you want to do that? Well, it’s quite common for spatial data - often for privacy reasons or because the data in question are collected only for a sample, not a population - to be available only as a summary for a particular area. The thing is, sometimes we want to know the values of these variables in another set of areas defined for a completely different purpose, whose boundaries differ from the areas for which we have data, even though both sets of boundaries overlap in actual space.
This is a bit hard to wrap your head around without some more visuals, so please take a moment to look through the logic behind areal interpolation at this link.
The strategy for areal interpolation depends on whether we’re interpolating extensive variables, which are really just a count of something, like the number of people or the total GDP of an area, or intensive variables, which are rates or relative measures, like median income, proportion of the population that identifies as female.
Table joins in the tidyverse
Chicago has its own official set of spatial subdivisions, called community areas, which you can think of as the city’s rough attempt to define crisp boundaries for distinct neighborhoods. Because community areas are central to how Chicago’s municipal government - and the city’s residents - think about and interact with the city, a lot of the information on Chicago’s City Data Portal is presented for community areas.
Let’s start by reading in and having a look at the community areas:
commareas <- st_read("chicago_community_areas.gpkg") Reading layer `ChicagoCommunityAreas' from data source
`G:\My Drive\Classes\GIS for Global Environment\Labs\R_Tutorials_\Vector Geoprocessing in R\chicago_community_areas.gpkg'
using driver `GPKG'
Simple feature collection with 77 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS: WGS 84
plot(st_geometry(commareas))commareasSimple feature collection with 77 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS: WGS 84
First 10 features:
area area_num_1 area_numbe comarea comarea_id community perimeter
1 0 35 35 0 0 DOUGLAS 0
2 0 36 36 0 0 OAKLAND 0
3 0 37 37 0 0 FULLER PARK 0
4 0 38 38 0 0 GRAND BOULEVARD 0
5 0 39 39 0 0 KENWOOD 0
6 0 4 4 0 0 LINCOLN SQUARE 0
7 0 40 40 0 0 WASHINGTON PARK 0
8 0 41 41 0 0 HYDE PARK 0
9 0 42 42 0 0 WOODLAWN 0
10 0 1 1 0 0 ROGERS PARK 0
shape_area shape_len geom
1 46004621 31027.05 MULTIPOLYGON (((-87.60914 4...
2 16913961 19565.51 MULTIPOLYGON (((-87.59215 4...
3 19916705 25339.09 MULTIPOLYGON (((-87.6288 41...
4 48492503 28196.84 MULTIPOLYGON (((-87.60671 4...
5 29071742 23325.17 MULTIPOLYGON (((-87.59215 4...
6 71352328 36624.60 MULTIPOLYGON (((-87.67441 4...
7 42373881 28175.32 MULTIPOLYGON (((-87.60604 4...
8 45105380 29746.71 MULTIPOLYGON (((-87.58038 4...
9 57815180 46936.96 MULTIPOLYGON (((-87.57714 4...
10 51259902 34052.40 MULTIPOLYGON (((-87.65456 4...
This geopackage file basically just has the boundaries of the community areas, along with some identifying information. This is actually a pretty common situation when working with GIS data - it’s not at all rare to have datasets in which the data are measured for specific areal units, like cities, counties, or, in this case, neighborhoods, while nevertheless lacking information on these locations’ boundaries.
What we generally need to do in these cases is to find a separate dataset that has location information, along with some common column identifying the locations that we can use to perform a table join. Table joins match data from one table to another based on a common column that can be used to identify which rows in Table 1 correspond to which rows in Table 2.
For example:
Table 1:
| Store | Zip Code |
|---|---|
| A | 18042 |
| B | 43210 |
| C | 43260 |
| D | 64850 |
| E | 64865 |
Table 2:
| Zip Code | Median HH Income ($) |
|---|---|
| 64850 | 105000 |
| 18042 | 55000 |
| 43260 | 45000 |
| 43210 | 40000 |
| 64865 | 80000 |
Table 1 after table joining Table 2:
| Store | Zip Code | Median HH Income ($) |
|---|---|---|
| A | 18042 | 55000 |
| B | 43210 | 40000 |
| C | 43260 | 45000 |
| D | 64850 | 105000 |
| E | 64865 | 80000 |
So let’s get some data that we can attach to these community areas. The Chicago City Data Portal hosts a dataset that presents some basic income and demographic data for community areas based on the US Census’s American Community Survey 5-year estimates, which is a representative sample survey used to estimate important demographic variables on a 5-year moving average between the full census years that take place every decade. Let’s read data for 2023 in now:
commareas_acs <-
read.csv("ACS_5_year_comm_areas.csv",
stringsAsFactors = FALSE)
head(commareas_acs) ACS.Year Community.Area Under..25000 X.25000.to..49999 X.50000.to..74999
1 2023 ALBANY PARK 1269 1916 1801
2 2023 ARCHER HEIGHTS 223 752 441
3 2023 ARMOUR SQUARE 701 798 370
4 2023 ASHBURN 797 1351 1985
5 2023 AUBURN GRESHAM 2541 2451 1592
6 2023 AUSTIN 5506 5084 3600
X.75000.to..125000 X.125000.. Male.0.to.17 Male.18.to.24 Male.25.to.34
1 2306 3379 4799 2955 4513
2 795 739 1927 732 1102
3 637 597 1300 487 871
4 3014 2735 5150 1964 2881
5 2202 1850 5803 1836 2964
6 4047 3725 12399 4668 6681
Male.35.to.49 Male.50.to.64 Male.65. Female.0.to.17 Female.18.to.24
1 5442 4354 2287 4913 2405
2 1240 1417 1035 1502 899
3 1174 1177 1032 1022 517
4 4178 4228 2473 5138 2248
5 3431 3469 3223 5305 2304
6 9920 8110 6175 12163 4575
Female.25.to.34 Female.35.to.49 Female.50.to.64 Female.65.. Total.Population
1 4116 5228 3764 3054 47830
2 978 1167 854 1021 13875
3 965 1163 1378 2063 13149
4 2774 4582 4044 3184 42842
5 3057 4522 5213 5357 46483
6 6643 10092 10040 8811 100278
White Black.or.African.American American.Indian.or.Alaska.Native Asian
1 21496 2228 759 7124
2 6232 10 108 679
3 2556 1487 107 8402
4 11297 18124 697 436
5 760 43414 119 399
6 10447 73602 531 678
Native.Hawaiian.or.Pacific.Islander Other.Race Multiracial
1 1 7888 8334
2 0 3705 3142
3 61 212 325
4 0 7772 4517
5 0 993 798
6 21 8512 6487
White.Not.Hispanic.or.Latino Hispanic.or.Latino Record.ID
1 16115 21108 2023_ALBANY PARK
2 2043 11097 2023_ARCHER HEIGHTS
3 2226 565 2023_ARMOUR SQUARE
4 3774 19917 2023_ASHBURN
5 491 1577 2023_AUBURN GRESHAM
6 5386 19591 2023_AUSTIN
Generally, the US Census presents data in terms of population counts of the number of households or people in an area with different characteristics. In this case, the numbers are broken down by income, gender, and race. For example, Under..25000 is the number of households in the community area with an income below $25,000. The non-income variables, on the other hand, are in terms of individuals. Hence, Male.0.to.17 designates the estimated number of male individuals under the age of 18 resident in the community area.
So how do we perform a table join with these two datasets? We’ll use the left_join() function, which conducts a table join and keeps all observations from the table to which you are joining (right_join(), not surprisingly, does the opposite, while inner_join() keeps all observations from both tables, whether they match to a row in the other table or not).
All the _join() functions other than st_join() automatically look for columns that have the same name in each table and then try to match those up. If the columns are named something different in the different tables, we have to use a special syntax to tell R which columns to use.
commareas <- commareas |>
left_join(commareas_acs,
by = join_by(community == Community.Area))
#The by = join_by() argument is what tells R which
#columns in each table should be matched. In this case,
#the commareas object has a column called community that
#has the community area names, while the commareas_acs
#object calls the same column Community.Area
commareasSimple feature collection with 77 features and 38 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS: WGS 84
First 10 features:
area area_num_1 area_numbe comarea comarea_id community perimeter
1 0 35 35 0 0 DOUGLAS 0
2 0 36 36 0 0 OAKLAND 0
3 0 37 37 0 0 FULLER PARK 0
4 0 38 38 0 0 GRAND BOULEVARD 0
5 0 39 39 0 0 KENWOOD 0
6 0 4 4 0 0 LINCOLN SQUARE 0
7 0 40 40 0 0 WASHINGTON PARK 0
8 0 41 41 0 0 HYDE PARK 0
9 0 42 42 0 0 WOODLAWN 0
10 0 1 1 0 0 ROGERS PARK 0
shape_area shape_len ACS.Year Under..25000 X.25000.to..49999
1 46004621 31027.05 2023 689 571
2 16913961 19565.51 2023 361 228
3 19916705 25339.09 2023 186 113
4 48492503 28196.84 2023 913 944
5 29071742 23325.17 2023 564 296
6 71352328 36624.60 2023 576 1022
7 42373881 28175.32 2023 1142 717
8 45105380 29746.71 2023 301 255
9 57815180 46936.96 2023 1039 1223
10 51259902 34052.40 2023 1038 1986
X.50000.to..74999 X.75000.to..125000 X.125000.. Male.0.to.17 Male.18.to.24
1 526 431 737 1630 1654
2 144 125 182 772 104
3 36 61 34 225 149
4 651 1406 1292 2160 807
5 116 646 1254 1502 480
6 1173 2162 4410 3588 1258
7 274 317 267 2178 497
8 309 731 2078 1599 3025
9 622 917 807 2564 1958
10 1477 2460 2802 4392 2431
Male.25.to.34 Male.35.to.49 Male.50.to.64 Male.65. Female.0.to.17
1 1445 1400 1276 1149 1464
2 185 236 236 142 565
3 75 87 134 365 212
4 2105 1857 2299 1347 2902
5 1052 1167 1308 1034 1548
6 4535 5778 3366 1739 3553
7 673 1005 749 465 1647
8 2476 1535 1527 1245 1619
9 1809 1861 2246 1019 2410
10 4440 6095 5495 2837 4879
Female.18.to.24 Female.25.to.34 Female.35.to.49 Female.50.to.64 Female.65..
1 1511 1994 1463 1626 2017
2 211 351 591 564 414
3 86 110 194 269 280
4 1080 2370 2930 2586 2365
5 450 1156 1542 1566 1564
6 1589 5379 5576 3357 2462
7 793 1189 1602 1001 564
8 3157 2614 1916 1683 1696
9 1801 1958 2533 2372 1905
10 4537 5128 5772 4080 3081
Total.Population White Black.or.African.American
1 18629 2278 12400
2 4373 159 3946
3 2186 148 1850
4 24808 1339 21947
5 14368 2958 9791
6 42180 28731 1623
7 12362 242 11691
8 24092 11509 6143
9 24437 2807 18965
10 53169 25129 13602
American.Indian.or.Alaska.Native Asian Native.Hawaiian.or.Pacific.Islander
1 214 2006 0
2 59 51 0
3 7 7 0
4 38 204 24
5 18 697 4
6 280 4345 0
7 10 0 0
8 49 3763 0
9 50 867 0
10 459 2750 0
Other.Race Multiracial White.Not.Hispanic.or.Latino Hispanic.or.Latino
1 797 936 2042 1416
2 14 143 155 131
3 22 151 86 232
4 241 1015 1296 809
5 124 775 2804 383
6 2662 4539 25776 8311
7 138 281 230 171
8 604 2024 10805 1762
9 863 886 2438 783
10 5278 5951 23158 10599
Record.ID geom
1 2023_DOUGLAS MULTIPOLYGON (((-87.60914 4...
2 2023_OAKLAND MULTIPOLYGON (((-87.59215 4...
3 2023_FULLER PARK MULTIPOLYGON (((-87.6288 41...
4 2023_GRAND BOULEVARD MULTIPOLYGON (((-87.60671 4...
5 2023_KENWOOD MULTIPOLYGON (((-87.59215 4...
6 2023_LINCOLN SQUARE MULTIPOLYGON (((-87.67441 4...
7 2023_WASHINGTON PARK MULTIPOLYGON (((-87.60604 4...
8 2023_HYDE PARK MULTIPOLYGON (((-87.58038 4...
9 2023_WOODLAWN MULTIPOLYGON (((-87.57714 4...
10 2023_ROGERS PARK MULTIPOLYGON (((-87.65456 4...
Areal interpolation with extensive variables
Because the variables from the census are counts of either individuals or households, they are extensive variables. Becuase areal interpolation involves intersecting the two vector layers in question, they have to be in the same projection. You can see from the output above that commareas is in WGS 1984, and you’ll remember that we reprojected the HOLC areas already. So our first step is to reproject commareas into State Plane Illinois East (Meters).
Reproject the commareas object into an object called commareas_il, which is in State Plane Illinois East (Meters). Provide the code you use to accomplish this task.
Once we have the Community Areas in the correct projection, we can use them to estimate the values of any of the variables in the commareas_il object across the HOLC areas.
For extensive variables, we assume that whatever we are counting is evenly distributed across the polygons (for example, with the census data, we are assuming that population is evenly distributed within the community areas). Based on that assumption, we can estimate the amount of any extensive variable from the commareas_il object for each HOLC zone by calculating the proportion of each community area that falls in each HOLC zone and using that percentage to compute a weighted sum of the population of all community areas intersecting each zone. Here’s how that works:
#First, you need to compute the areas for the polygons with your source data
holc_census <- commareas_il |>
#note that the geometry column for this object is called "geom" - it
#might be called something else in other objects, so double check
mutate(comm_area_km2 = st_area(geom)/1000000) |>
#Then, you intersect the source with your target polygons
st_intersection(holc_il) |>
#And compute the intersecting area and then the proportion of the total
#source polygon that falls within each target polygon
mutate(intersect_area = st_area(geom)/1000000, #compute intersected areas
weight = intersect_area/comm_area_km2) |> #compute weights
group_by(holc_grade) |> #now we group the polygons by the unique identifier
#for the original target polygons
summarise(Total.Population = sum(Total.Population*weight),
White.Not.Hispanic.or.Latino = sum(White.Not.Hispanic.or.Latino*weight))Warning: attribute variables are assumed to be spatially constant throughout
all geometries
#Then we add up the weighted values of whatever extensive variables we are
#interested in
holc_censusSimple feature collection with 5 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 332577.8 ymin: 552874.5 xmax: 367345.8 ymax: 594869
Projected CRS: NAD83(HARN) / Illinois East
# A tibble: 5 × 4
holc_grade Total.Population White.Not.Hispanic.or.…¹ geom
<chr> [1] [1] <MULTIPOLYGON [m]>
1 A 15661. 10783. (((354189.4 561409.9, 35…
2 B 213490. 72771. (((346300.4 581308.8, 34…
3 C 1059653. 339571. (((346240.3 567580.4, 34…
4 D 544234. 165002. (((353055.3 557653.4, 35…
5 Ungraded 813849. 267390. (((349119.6 567254, 3491…
# ℹ abbreviated name: ¹White.Not.Hispanic.or.Latino
Once we have estimated those population levels in areas of different grades, we can compute more values of interest. For example, we can compute the estimated percentage of the population in each HOLC zone identified as white (not Hispanic or Latino):
holc_census |>
mutate(pct_white = White.Not.Hispanic.or.Latino/Total.Population*100) |>
select(holc_grade, pct_white)Simple feature collection with 5 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 332577.8 ymin: 552874.5 xmax: 367345.8 ymax: 594869
Projected CRS: NAD83(HARN) / Illinois East
# A tibble: 5 × 3
holc_grade pct_white geom
<chr> [1] <MULTIPOLYGON [m]>
1 A 68.9 (((354189.4 561409.9, 354189.4 561410.5, 354189.3 561412…
2 B 34.1 (((346300.4 581308.8, 346299.2 581349.1, 346298.6 581367…
3 C 32.0 (((346240.3 567580.4, 346200.1 568292.8, 347251.2 568295…
4 D 30.3 (((353055.3 557653.4, 353055.1 557653.1, 353053.2 557651…
5 Ungraded 32.9 (((349119.6 567254, 349107.5 567253.3, 349106.7 567253.3…
Here, we can clearly see that A-graded HOLC areas retain a disproportionately white population.
Estimate another extensive variable of interest across the HOLC areas. Then convert it into a variable that will be more comparable across the areas (for example, you can compute the variable relative to the total area, total households, or, as above, total population in the HOLC area). Create a plot to show your result. Provide the code you use to conduct the estimation and generate the plot and briefly interpret what you find.
Areal interpolation with intensive variables
Here, we’re going to use areal interpolation to estimate the percentage of children under 6 with elevated blood lead levels in each of the different HOLC grade areas over fifty years after the grades were assigned. The City of Chicago conducted studies of children’s blood lead levels as part of a remediation program in 1999 and 2013 and released these data at the scale of community areas (which can roughly be understood as neighborhoods).
First, we’ll read in the data from the blood lead (PB) monitoring:
pb <- read.csv("BloodPBLevelsCommAreas.csv", stringsAsFactors = FALSE)
head(pb) CommArea CommAreaName ElevBloodPb1999Pct ElevBloodPb2013Pct
1 1 Rogers Park 10.5 0.8
2 2 West Ridge 6.0 1.1
3 3 Uptown 11.3 0.6
4 4 Lincoln Square 7.2 0.9
5 5 North Center 6.4 0.1
6 6 Lake View 4.0 0.2
These data contain, for each community area, the estimated percentage of children under 6 with elevated blood lead levels, based on the 1999 and 2013 surveys. By 2013, thankfully, the numbers were quite low, but they were very problematic in 1999.
Conducting table joins in tidyverse
You’ll notice that pb has no spatial information other than the names of the community areas.
You can see from here that the commareas_il object has an entry for the community area name, so we should be able to match it up, but it’s not in the same case as the community area names in the dataset with the blood lead estimates.
unique(commareas_il$community) [1] "DOUGLAS" "OAKLAND" "FULLER PARK"
[4] "GRAND BOULEVARD" "KENWOOD" "LINCOLN SQUARE"
[7] "WASHINGTON PARK" "HYDE PARK" "WOODLAWN"
[10] "ROGERS PARK" "JEFFERSON PARK" "FOREST GLEN"
[13] "NORTH PARK" "ALBANY PARK" "PORTAGE PARK"
[16] "IRVING PARK" "DUNNING" "MONTCLARE"
[19] "BELMONT CRAGIN" "WEST RIDGE" "HERMOSA"
[22] "AVONDALE" "LOGAN SQUARE" "HUMBOLDT PARK"
[25] "WEST TOWN" "AUSTIN" "WEST GARFIELD PARK"
[28] "EAST GARFIELD PARK" "NEAR WEST SIDE" "NORTH LAWNDALE"
[31] "UPTOWN" "SOUTH LAWNDALE" "LOWER WEST SIDE"
[34] "NEAR SOUTH SIDE" "ARMOUR SQUARE" "NORWOOD PARK"
[37] "NEAR NORTH SIDE" "LOOP" "SOUTH SHORE"
[40] "CHATHAM" "AVALON PARK" "SOUTH CHICAGO"
[43] "BURNSIDE" "MCKINLEY PARK" "LAKE VIEW"
[46] "CALUMET HEIGHTS" "ROSELAND" "NORTH CENTER"
[49] "PULLMAN" "SOUTH DEERING" "EAST SIDE"
[52] "WEST PULLMAN" "RIVERDALE" "HEGEWISCH"
[55] "GARFIELD RIDGE" "ARCHER HEIGHTS" "BRIGHTON PARK"
[58] "BRIDGEPORT" "NEW CITY" "WEST ELSDON"
[61] "GAGE PARK" "CLEARING" "WEST LAWN"
[64] "CHICAGO LAWN" "WEST ENGLEWOOD" "ENGLEWOOD"
[67] "GREATER GRAND CROSSING" "LINCOLN PARK" "ASHBURN"
[70] "AUBURN GRESHAM" "BEVERLY" "WASHINGTON HEIGHTS"
[73] "MOUNT GREENWOOD" "MORGAN PARK" "OHARE"
[76] "EDGEWATER" "EDISON PARK"
unique(pb$CommAreaName) [1] "Rogers Park" "West Ridge" "Uptown"
[4] "Lincoln Square" "North Center" "Lake View"
[7] "Lincoln Park" "Near North Side" "Edison Park"
[10] "Norwood Park" "Jefferson Park" "Forest Glen"
[13] "North Park" "Albany Park" "Portage Park"
[16] "Irving Park" "Dunning" "Montclaire"
[19] "Belmont Cragin" "Hermosa" "Avondale"
[22] "Logan Square" "Humboldt park" "West Town"
[25] "Austin" "West Garfield Park" "East Garfield Park"
[28] "Near West Side" "North Lawndale" "South Lawndale"
[31] "Lower West Side" "Loop" "Near South Side"
[34] "Armour Square" "Douglas" "Oakland"
[37] "Fuller Park" "Grand Boulevard" "Kenwood"
[40] "Washington Park" "Hyde Park" "Woodlawn"
[43] "South Shore" "Chatham" "Avalon Park"
[46] "South Chicago" "Burnside" "Calumet Heights"
[49] "Roseland" "Pullman" "South Deering"
[52] "East Side" "West Pullman" "Riverdale"
[55] "Hegewisch" "Garfield Ridge" "Archer Heights"
[58] "O'Hare" "Brighton Park" "McKinley Park"
[61] "Bridgeport" "New City" "West Elsdon"
[64] "Gage Park" "Clearing" "West Lawn"
[67] "Chicago Lawn" "West Englewood" "Englewood"
[70] "Greater Grand Crossing" "Ashburn" "Auburn Gresham"
[73] "Beverly" "Washington Heights" "Mount Greenwood"
[76] "Morgan Park" "Edgewater" "Chicago"
[79] "Unknown"
One option here would be to use R functions to convert everything into either uppercase or lowercase. Because that’s a pretty handy trick when cleaning data, here’s how we’d to it:
commareas_il <- commareas_il |>
mutate(community_lower_case = str_to_lower(community))
pb <- pb |>
mutate(community_upper_case = str_to_upper(CommAreaName))
pb$community_upper_case [1] "ROGERS PARK" "WEST RIDGE" "UPTOWN"
[4] "LINCOLN SQUARE" "NORTH CENTER" "LAKE VIEW"
[7] "LINCOLN PARK" "NEAR NORTH SIDE" "EDISON PARK"
[10] "NORWOOD PARK" "JEFFERSON PARK" "FOREST GLEN"
[13] "NORTH PARK" "ALBANY PARK" "PORTAGE PARK"
[16] "IRVING PARK" "DUNNING" "MONTCLAIRE"
[19] "BELMONT CRAGIN" "HERMOSA" "AVONDALE"
[22] "LOGAN SQUARE" "HUMBOLDT PARK" "WEST TOWN"
[25] "AUSTIN" "WEST GARFIELD PARK" "EAST GARFIELD PARK"
[28] "NEAR WEST SIDE" "NORTH LAWNDALE" "SOUTH LAWNDALE"
[31] "LOWER WEST SIDE" "LOOP" "NEAR SOUTH SIDE"
[34] "ARMOUR SQUARE" "DOUGLAS" "OAKLAND"
[37] "FULLER PARK" "GRAND BOULEVARD" "KENWOOD"
[40] "WASHINGTON PARK" "HYDE PARK" "WOODLAWN"
[43] "SOUTH SHORE" "CHATHAM" "AVALON PARK"
[46] "SOUTH CHICAGO" "BURNSIDE" "CALUMET HEIGHTS"
[49] "ROSELAND" "PULLMAN" "SOUTH DEERING"
[52] "EAST SIDE" "WEST PULLMAN" "RIVERDALE"
[55] "HEGEWISCH" "GARFIELD RIDGE" "ARCHER HEIGHTS"
[58] "O'HARE" "BRIGHTON PARK" "MCKINLEY PARK"
[61] "BRIDGEPORT" "NEW CITY" "WEST ELSDON"
[64] "GAGE PARK" "CLEARING" "WEST LAWN"
[67] "CHICAGO LAWN" "WEST ENGLEWOOD" "ENGLEWOOD"
[70] "GREATER GRAND CROSSING" "ASHBURN" "AUBURN GRESHAM"
[73] "BEVERLY" "WASHINGTON HEIGHTS" "MOUNT GREENWOOD"
[76] "MORGAN PARK" "EDGEWATER" "CHICAGO"
[79] "UNKNOWN"
Those functions are pretty easy - they do exactly what they say.
You might have noticed that there’s another option, however. Both objects also have columns that appear to be assigning numerical codes to the community areas:
unique(commareas_il$area_num_1) [1] "35" "36" "37" "38" "39" "4" "40" "41" "42" "1" "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "2" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
[31] "3" "30" "31" "33" "34" "10" "8" "32" "43" "44" "45" "46" "47" "59" "6"
[46] "48" "49" "5" "50" "51" "52" "53" "54" "55" "56" "57" "58" "60" "61" "62"
[61] "63" "64" "65" "66" "67" "68" "69" "7" "70" "71" "72" "73" "74" "75" "76"
[76] "77" "9"
unique(pb$CommArea) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56 57 76 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
[76] 75 77 0 99
So now we could use any of three options to join the two tables: 1) community area names in uppercase; 2) community area names in lowercase; or 3) community area numbers.
In any case (pun intended) we are ready to do the table join and then intersect with the holc_il object.
commareas_il <- commareas_il |>
left_join(pb, by = join_by(community == community_upper_case))
plot(commareas_il[,c("geom", "ElevBloodPb1999Pct")])Now we’re ready to use areal interpolation to estimate these variables for the HOLC areas. This time, however, we are working with an intensive variable, a percentage. Because we are working with an intensive variable, we will be computing a weighted mean, where the weights are based on the proportion of the target polygon that falls in each source polygon:
commareas_pb <- commareas_il |>
st_intersection(holc_il) |> #As before, we start by intersecting the layers
mutate(area_int = st_area(geom)/1000000) |> #Then compute the intersected areas
group_by(holc_grade) |> #Then group by our ID for the target polygons
mutate(weight = area_int/sum(area_int)) |> #Then compute the weight as the
#Proportion of the total area in the group in each intersection
summarise(ElevBloodPb1999Pct = sum(ElevBloodPb1999Pct*weight,
na.rm=TRUE))Warning: attribute variables are assumed to be spatially constant throughout
all geometries
#Then we compute a weighted sum for the intensive variable (note that here
#we are ignoring areas that have no data (the white polygons in the map
#above)). Becauce the weights all add up to one, this is equivalent to
#computing a weighted mean.
commareas_pbSimple feature collection with 5 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 332577.8 ymin: 552874.5 xmax: 367345.8 ymax: 594869
Projected CRS: NAD83(HARN) / Illinois East
# A tibble: 5 × 3
holc_grade ElevBloodPb1999Pct geom
<chr> [1] <MULTIPOLYGON [m]>
1 A 1.27 (((354189.4 561409.9, 354189.4 561410.5, 354189…
2 B 9.95 (((346300.4 581308.8, 346299.2 581349.1, 346298…
3 C 13.8 (((346240.3 567580.4, 346200.1 568292.8, 347251…
4 D 16.4 (((353055.3 557653.4, 353055.1 557653.1, 353053…
5 Ungraded 9.52 (((349119.6 567254, 349107.5 567253.3, 349106.7…
Create boxplots showing the differences in children’s blood lead levels across HOLC grades in Chicago in 1999 and 2013. Once you have your plot, interpret your results. What might account for the patterns that you see? Be sure to include the code you use to compute your estimates and make the plot.