Advanced Geoprocessing in R: Long-Term Legacies of Redlining in Chicago

Chicago

The Chicago “L.”Photo by Benoit Debaix on Unsplash.

Introduction: Redlining

The current structure of cities reflects very long-term forms of exclusion and inequality. In the United States, one of the most important of these has been redlining, which substantially restricted access to capital for communities of color while funneling subsidized loans into predominantly white neighborhoods, which, through a variety of mechanisms, were made inaccessible to non-whites.

Several researchers have argued that these practices were an essential component of institutionalized racism that has continued to reproduce structural inequalities in wealth and urban segregation long after the redlining maps were drawn. In this tutorial, we will look at the relationship between areas that were redlined in Chicago and present inequalities in the city, some of which have substantial environmental justice implications. To be clear, correlation does not mean causation, but to find substantial present differences across neighborhoods that were redlined approximately 80 years ago would be quite compelling evidence that redlining has had a long-term legacy.

Following a heroic data collection and digitization effort by the members of the Mapping Inequality Project at the University of Richmond, shapefiles showing the federal government’s Home Owners’ Loan Corporation (HOLC) lending maps are now available for most large and medium-sized cities in the United States. This is an unprecedented resource allowing us to investigate the legacies of redlining at present.

Before beginning the tutorial, I would like you to refresh your memory about redlining and, in particular, the HOLC program. There is a very clear introduction available at the Mapping Inequality Project site.

Deliverable 1

After thoroughly investigating the website, answer these questions. What was the HOLC program? Why have many researchers in the past several decades suggested that it has been a major force in perpetuating urban segregation in US cities?

Objectives

  • Get further practice with projections, intersections, and mapping with vectors, as well as constructing data visualizations with ggplot()
  • Learn how to estimate intensive and extensive variables using areal interpolation
  • Practice interpreting the results of simple geospatial analyses

The Data

  • Redlines.gpkg and its associated files show areas marked under the HOLC program. The data were digitized by the Mapping Inequality Project.
  • chicago_boundary.kml is the city boundary of Chicago, sourced from the Chicago Data Portal
  • GroceryStores.csv contains location information for all grocery stores in Chicago as of 2020, sourced from the Chicago Data Portal
  • BloodPBLevelsCommAreas.csv is a table with results from survey studies of the percentage of children residing in Chicago’s official community areas, sourced from the Chicago Data Portal (basically, neighborhoods) with elevated blood lead levels as of 1999 and 2013
  • chicago_community_areas.gpkg contains the boundaries of Chicago’s official community areas, sourced from the Chicago Data Portal
  • ACS_5_year_comm_areas.csv is a table with demographic and income data from the US Census Bureau’s representative sample survey, the American Community Survey, which estimates 5-year moving averages of important census variables for community areas as of 2023
  • Building_Code_Scofflaw.csv contains the locations of buildings whose landlords own at least three buildings with unaddressed code violations

The data you will need for this tutorial can be found in this zip file.

Packages

  • tidyverse for data wrangling
  • sf for vector data geoprocessing
  • ggspatial for mapping
  • maptiles for adding contextual basemaps to your maps
  • tidyterra for mapping those tiles
require(tidyverse)
require(sf)
require(ggspatial)
require(maptiles)
require(tidyterra)

HOLC Data

Let’s start with the boundaries of the HOLC-graded areas, along with the city limits of Chicago. We’re using the same packages and functions that we used in our previous tutorial on the Amazon:

setwd("G:/My Drive/Classes/GIS for Global Environment/Labs/R_Tutorials_/Vector Geoprocessing in R")

holc <- st_read("redlines.gpkg")
Reading layer `redlines' from data source 
  `G:\My Drive\Classes\GIS for Global Environment\Labs\R_Tutorials_\Vector Geoprocessing in R\redlines.gpkg' 
  using driver `GPKG'
Simple feature collection with 5 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.60414 xmax: -87.52414 ymax: 42.28303
Geodetic CRS:  +proj=longlat +datum=WGS84 +no_defs
chicago <- st_read("chicago_boundary.kml")
Reading layer `Layer #0' from data source 
  `G:\My Drive\Classes\GIS for Global Environment\Labs\R_Tutorials_\Vector Geoprocessing in R\chicago_boundary.kml' 
  using driver `KML'
Simple feature collection with 1 feature and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS:  WGS 84
plot(holc)

plot(st_geometry(chicago))

In order to assess the evidence for long-term impacts of HOLC’s maps, we need to use some tricks from the previous tutorial to estimate values within HOLC-graded areas using data for other types of areal units.

First, however, because some of the datasets we’ll be using cover different extents, we should use st_intersection() to make sure everything is clipped to the Chicago city limits:

chicago_holc <- chicago |>
  st_transform(crs=st_crs(holc))

holc <- holc |>
  st_intersection(chicago_holc)
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
plot(st_geometry(holc))

Summarizing data collected at point locations with sf

We’re going to start with data on grocery store locations (where grocery store is pretty broadly defined and includes some things closer to convenience stores) as of 2020.

We’ll start by reading in the data, which is in a CSV spreadsheet, which is a very common format for point data, for which you only need latitude and longitude coordinates:

groceries <- read.csv("GroceryStores.csv", stringsAsFactors = FALSE)

head(groceries)
              Store.Name Longitude Latitude
1           Jewel - Osco -87.78556 41.92624
2 Mariano's Fresh Market -87.64738 41.88056
3 Mariano's Fresh Market -87.66564 41.83651
4                   Aldi -87.58747 41.75278
5            Food 4 Less -87.66390 41.76517
6     Whole Foods Market -87.64733 41.88250

Then we’ll use a pipeline to convert the table we just read in into an sf object, reproject it to the same projection as the holc layer, and then conduct a join by location operation to add data from the red layer to the airbnb layer. Join by location is similar to intersection, except that instead of creating a new layer with the combined geometries of the intersected layers, it just adds the data from one layer to the other.

We can conduct a join by location using the st_join() function:

groceries <- groceries |>
  st_as_sf(coords=c("Longitude", "Latitude"),
           crs=st_crs(4326)) |>
  #These lines convert the data frame into an sf
  #object. The coords argument allows you to specify which columns hold the
  #x and y coordinates. EPSG number 4326 is just WGS 1984. It is important to remember that the coordinates are in x, y order, where Longitude is the x coordinate and Latitude is the y coordinate.
  st_transform(crs=st_crs(holc)) |>
  st_join(holc)
  #The st_join() function performs an operation called join by location.
  #all it does is take variables from features in one layer and adds 
  #them to features in another layer that intersects them. in this case,
  #we're adding information on the holc grade as of 1940 to every one
  #of our grocery store points

head(groceries)
Simple feature collection with 6 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -87.78556 ymin: 41.75278 xmax: -87.58747 ymax: 41.92624
Geodetic CRS:  +proj=longlat +datum=WGS84 +no_defs
              Store.Name holc_grade    Name Description
1           Jewel - Osco   Ungraded CHICAGO            
2 Mariano's Fresh Market          D CHICAGO            
3 Mariano's Fresh Market          D CHICAGO            
4                   Aldi   Ungraded CHICAGO            
5            Food 4 Less          C CHICAGO            
6     Whole Foods Market   Ungraded CHICAGO            
                    geometry
1 POINT (-87.78556 41.92624)
2 POINT (-87.64738 41.88056)
3 POINT (-87.66564 41.83651)
4 POINT (-87.58747 41.75278)
5  POINT (-87.6639 41.76517)
6  POINT (-87.64733 41.8825)

If a feature in the object we are joining data to using st_join() does not intersect any feature in the other object, R will assign that feature a value of NA for all the variables from the other object. That means we can then fill in those missing values to designate grocery stores in areas that were not graded in the 1940s.

groceries <- groceries |>
  mutate(holc_grade = case_when(is.na(holc_grade) ~ "Ungraded",
                                .default = holc_grade))
Deliverable 1

Look up the case_when function’s help file. Read it carefully and then explain the arguments in the function as it is used in the previous code block. Describe how it is working in plain terms.

Now let’s have a quick look at grocery stores across Chicago:

osm_basemap <- get_tiles(chicago, provider = "OpenStreetMap")

ggplot(data = chicago)+
  geom_spatraster_rgb(data=osm_basemap) +
  #st_cast() allows you to convert sf objects to different formats; here,
  #we're turning the Chicago boundary into a line, as opposed to a polygon,
  #so that the resulting map will have just the outline of the city, without
  #a colored fill:
  geom_sf(data = chicago |> st_cast("MULTILINESTRING"), color = "black", linewidth=1.5)+
  geom_sf(data=groceries, size=1,
          aes(color = holc_grade)) +
  scale_color_viridis_d("HOLC Grade") +
  annotation_scale(location = "bl") +
  theme_minimal()+
  ggtitle("Grocery Stores in Chicago in 2020")+
  labs(
    x = "",
    y = ""
  )

Counting points in areas in sf

For years, there has been a lot of discussion about the problem of food deserts - areas with very limited access to grocery stores. We might think that redlined areas could have been less attractive to investment and might go on to host fewer grocery stores.

Let’s see how many grocery stores are in each HOLC class:

#table is a helpful function that takes categorical
#columns (or multiple categorical columns) and
#shows you how many times each value (or each 
#combination of values, if you have multiple 
#columns) shows up.

table(groceries$holc_grade)

       A        B        C        D Ungraded 
       2       18      107       60       75 
Deliverable 2

The raw totals of grocery stores are not particularly informative if we are interested in the long-term impacts of redlining. Explain why the values we just calculated aren’t a great way to address this question.

Now, let’s examine the density of points in the different HOLC graded areas.

Because we’ll be doing calculations of distances and areas, we’ll need to project our layers into a local projection. In the United States, the most common projections used for city- and state-scale maps are probably those from the state plane system. The state plane system, like the UTM system we’ve used before, is a set of projections optimized for specific locations, in this case in the United States. Most states in the us have two or three state plane projections, optimized for different parts of the state depending on its size and shape.

Deliverable 3

The most commonly used projection for Chicago is State Plane Illinois East. To keep things consistent, let’s have the units in meters. Find the EPSG number for this projection at spatialreference.org. Then use the st_crs() function to create a CRS object called il_east that we can use to refer to this projection in the rest of the code. Add the code you create to your tutorial report. If you did things correctly, running the il_east on its own line should produce the following output:

Coordinate Reference System:
  User input: EPSG:2790 
  wkt:
PROJCRS["NAD83(HARN) / Illinois East",
    BASEGEOGCRS["NAD83(HARN)",
        DATUM["NAD83 (High Accuracy Reference Network)",
            ELLIPSOID["GRS 1980",6378137,298.257222101,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4152]],
    CONVERSION["SPCS83 Illinois East zone (meter)",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",36.6666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",-88.3333333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",0.999975,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",300000,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["easting (X)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["northing (Y)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Engineering survey, topographic mapping."],
        AREA["United States (USA) - Illinois - counties of Boone; Champaign; Clark; Clay; Coles; Cook; Crawford; Cumberland; De Kalb; De Witt; Douglas; Du Page; Edgar; Edwards; Effingham; Fayette; Ford; Franklin; Gallatin; Grundy; Hamilton; Hardin; Iroquois; Jasper; Jefferson; Johnson; Kane; Kankakee; Kendall; La Salle; Lake; Lawrence; Livingston; Macon; Marion; Massac; McHenry; McLean; Moultrie; Piatt; Pope; Richland; Saline; Shelby; Vermilion; Wabash; Wayne; White; Will; Williamson."],
        BBOX[37.06,-89.27,42.5,-87.02]],
    ID["EPSG",2790]]

Now we’ll reproject our layers into State Plane Illinois East:

holc_il <- holc |>
  st_transform(crs = il_east)

groceries <- groceries |>
  st_transform(crs = st_crs(holc_il))

# We're using a trick here, using st_crs() to get
# the CRS information for a layer that's already in
# the projection we want to set the parameters for
# our reprojection operation.

Because we are going to look at the density of grocery stores in each area, we also need to calculate the area of each HOLC grade. Remember that st_area uses map units, so everything is in meters. Below, we’re dividing by 1,000,000 to convert from square meters to square kilometers:

holc_il <- holc_il |>
  mutate(area = st_area(holc_il)/1000000)

plot(holc_il[,c("geom", "area")])

Now we join the data on the HOLC area to the grocery layer:

groceries_holc <- groceries |>
  st_join(holc_il[,"area"])

Then we can use group_by() to group by HOLC grade and then summarise() to add up the points and the area in each grade:

groceries_holc <- groceries_holc |>
  filter(!is.na(holc_grade)) |>
  group_by(holc_grade) |>
  summarise(stores=n(),
            area=min(area)) |>
  # n() function is used to simply
  #count the number (n) of observations in each group when summarizing.
  #because each HOLC grade in holc_il is only one multipolygon object, area is
  #the entire area under each HOLC grade, so we don't want
  #to add them up, and instead just take the minimum value. the areas within
  #each grade will all be equal, so we could just as easily use max, median,
  #or mean to summarize them.
  mutate(storesperkm = as.numeric(stores/area))
  #the as.numeric() part is to turn the result into a plain number; otherwise,
  #R will treat the value as if it's in meters.

Now we can plot our results:

ggplot(groceries_holc,
       aes(x=holc_grade, y=storesperkm))+
  geom_bar(stat="identity", fill="black", color="gray", linewidth=1)+
  #Here we are
  #using a bar graph format. stat="identity" means that we already have
  #defined y-values. Without that, geom_bar() will just count up the number
  #of observations of each value of a categorical variable supplied as the
  #x-axis variable.
  scale_x_discrete("HOLC Grade in 1940")+
  scale_y_continuous("Grocery Stores per Square KM, 2020")+
  theme_minimal()+
  theme(axis.title.x=element_text(size=18, colour="black", face="bold"),
        axis.text.x=element_text(size=16, colour="black", face="bold"),
        axis.ticks.x=element_blank(),
        axis.title.y=element_text(size=18, colour="black", face="bold"),
        axis.text.y=element_text(size=16, colour="black", face="bold"),
        axis.ticks.y=element_blank(),
        plot.title = element_text(size=28, colour="black", face="bold"),
        strip.text=element_text(size=16, colour="black", face="bold"),
        legend.title=element_text(size=18, colour="black", face="bold"),
        legend.text=element_text(size=16, colour="black", face="bold"),
        legend.position = "bottom")
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_bar()`).

Deliverable 4

Read in the Building_Code_Scofflaw.csv file and convert it to an sf object. This file shows the point locations of buildings whose landlords have at least three buildings with building code violations that they have failed to correct. Compute the density of these buildings and produce a figure like that above. Include your code and results and reflect on what you find in your tutorial report.

Areal interpolation

Overview of areal interpolation

Areal interpolation is a fancy name for estimating the values of variables in one set of polygons based on the variable values in an overlapping - but different - set of polygons. Why would you want to do that? Well, it’s quite common for spatial data - often for privacy reasons or because the data in question are collected only for a sample, not a population - to be available only as a summary for a particular area. The thing is, sometimes we want to know the values of these variables in another set of areas defined for a completely different purpose, whose boundaries differ from the areas for which we have data, even though both sets of boundaries overlap in actual space.

This is a bit hard to wrap your head around without some more visuals, so please take a moment to look through the logic behind areal interpolation at this link.

The strategy for areal interpolation depends on whether we’re interpolating extensive variables, which are really just a count of something, like the number of people or the total GDP of an area, or intensive variables, which are rates or relative measures, like median income, proportion of the population that identifies as female.

Table joins in the tidyverse

Chicago has its own official set of spatial subdivisions, called community areas, which you can think of as the city’s rough attempt to define crisp boundaries for distinct neighborhoods. Because community areas are central to how Chicago’s municipal government - and the city’s residents - think about and interact with the city, a lot of the information on Chicago’s City Data Portal is presented for community areas.

Let’s start by reading in and having a look at the community areas:

commareas <- st_read("chicago_community_areas.gpkg") 
Reading layer `ChicagoCommunityAreas' from data source 
  `G:\My Drive\Classes\GIS for Global Environment\Labs\R_Tutorials_\Vector Geoprocessing in R\chicago_community_areas.gpkg' 
  using driver `GPKG'
Simple feature collection with 77 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS:  WGS 84
plot(st_geometry(commareas))

commareas
Simple feature collection with 77 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS:  WGS 84
First 10 features:
   area area_num_1 area_numbe comarea comarea_id       community perimeter
1     0         35         35       0          0         DOUGLAS         0
2     0         36         36       0          0         OAKLAND         0
3     0         37         37       0          0     FULLER PARK         0
4     0         38         38       0          0 GRAND BOULEVARD         0
5     0         39         39       0          0         KENWOOD         0
6     0          4          4       0          0  LINCOLN SQUARE         0
7     0         40         40       0          0 WASHINGTON PARK         0
8     0         41         41       0          0       HYDE PARK         0
9     0         42         42       0          0        WOODLAWN         0
10    0          1          1       0          0     ROGERS PARK         0
   shape_area shape_len                           geom
1    46004621  31027.05 MULTIPOLYGON (((-87.60914 4...
2    16913961  19565.51 MULTIPOLYGON (((-87.59215 4...
3    19916705  25339.09 MULTIPOLYGON (((-87.6288 41...
4    48492503  28196.84 MULTIPOLYGON (((-87.60671 4...
5    29071742  23325.17 MULTIPOLYGON (((-87.59215 4...
6    71352328  36624.60 MULTIPOLYGON (((-87.67441 4...
7    42373881  28175.32 MULTIPOLYGON (((-87.60604 4...
8    45105380  29746.71 MULTIPOLYGON (((-87.58038 4...
9    57815180  46936.96 MULTIPOLYGON (((-87.57714 4...
10   51259902  34052.40 MULTIPOLYGON (((-87.65456 4...

This geopackage file basically just has the boundaries of the community areas, along with some identifying information. This is actually a pretty common situation when working with GIS data - it’s not at all rare to have datasets in which the data are measured for specific areal units, like cities, counties, or, in this case, neighborhoods, while nevertheless lacking information on these locations’ boundaries.

What we generally need to do in these cases is to find a separate dataset that has location information, along with some common column identifying the locations that we can use to perform a table join. Table joins match data from one table to another based on a common column that can be used to identify which rows in Table 1 correspond to which rows in Table 2.

For example:

Table 1:

Store Zip Code
A 18042
B 43210
C 43260
D 64850
E 64865

Table 2:

Zip Code Median HH Income ($)
64850 105000
18042 55000
43260 45000
43210 40000
64865 80000

Table 1 after table joining Table 2:

Store Zip Code Median HH Income ($)
A 18042 55000
B 43210 40000
C 43260 45000
D 64850 105000
E 64865 80000

So let’s get some data that we can attach to these community areas. The Chicago City Data Portal hosts a dataset that presents some basic income and demographic data for community areas based on the US Census’s American Community Survey 5-year estimates, which is a representative sample survey used to estimate important demographic variables on a 5-year moving average between the full census years that take place every decade. Let’s read data for 2023 in now:

commareas_acs <-
  read.csv("ACS_5_year_comm_areas.csv",
           stringsAsFactors = FALSE)

head(commareas_acs)
  ACS.Year Community.Area Under..25000 X.25000.to..49999 X.50000.to..74999
1     2023    ALBANY PARK         1269              1916              1801
2     2023 ARCHER HEIGHTS          223               752               441
3     2023  ARMOUR SQUARE          701               798               370
4     2023        ASHBURN          797              1351              1985
5     2023 AUBURN GRESHAM         2541              2451              1592
6     2023         AUSTIN         5506              5084              3600
  X.75000.to..125000 X.125000.. Male.0.to.17 Male.18.to.24 Male.25.to.34
1               2306       3379         4799          2955          4513
2                795        739         1927           732          1102
3                637        597         1300           487           871
4               3014       2735         5150          1964          2881
5               2202       1850         5803          1836          2964
6               4047       3725        12399          4668          6681
  Male.35.to.49 Male.50.to.64 Male.65. Female.0.to.17 Female.18.to.24
1          5442          4354     2287           4913            2405
2          1240          1417     1035           1502             899
3          1174          1177     1032           1022             517
4          4178          4228     2473           5138            2248
5          3431          3469     3223           5305            2304
6          9920          8110     6175          12163            4575
  Female.25.to.34 Female.35.to.49 Female.50.to.64 Female.65.. Total.Population
1            4116            5228            3764        3054            47830
2             978            1167             854        1021            13875
3             965            1163            1378        2063            13149
4            2774            4582            4044        3184            42842
5            3057            4522            5213        5357            46483
6            6643           10092           10040        8811           100278
  White Black.or.African.American American.Indian.or.Alaska.Native Asian
1 21496                      2228                              759  7124
2  6232                        10                              108   679
3  2556                      1487                              107  8402
4 11297                     18124                              697   436
5   760                     43414                              119   399
6 10447                     73602                              531   678
  Native.Hawaiian.or.Pacific.Islander Other.Race Multiracial
1                                   1       7888        8334
2                                   0       3705        3142
3                                  61        212         325
4                                   0       7772        4517
5                                   0        993         798
6                                  21       8512        6487
  White.Not.Hispanic.or.Latino Hispanic.or.Latino           Record.ID
1                        16115              21108    2023_ALBANY PARK
2                         2043              11097 2023_ARCHER HEIGHTS
3                         2226                565  2023_ARMOUR SQUARE
4                         3774              19917        2023_ASHBURN
5                          491               1577 2023_AUBURN GRESHAM
6                         5386              19591         2023_AUSTIN

Generally, the US Census presents data in terms of population counts of the number of households or people in an area with different characteristics. In this case, the numbers are broken down by income, gender, and race. For example, Under..25000 is the number of households in the community area with an income below $25,000. The non-income variables, on the other hand, are in terms of individuals. Hence, Male.0.to.17 designates the estimated number of male individuals under the age of 18 resident in the community area.

So how do we perform a table join with these two datasets? We’ll use the left_join() function, which conducts a table join and keeps all observations from the table to which you are joining (right_join(), not surprisingly, does the opposite, while inner_join() keeps all observations from both tables, whether they match to a row in the other table or not).

All the _join() functions other than st_join() automatically look for columns that have the same name in each table and then try to match those up. If the columns are named something different in the different tables, we have to use a special syntax to tell R which columns to use.

commareas <- commareas |>
  left_join(commareas_acs,
            by = join_by(community == Community.Area))

#The by = join_by() argument is what tells R which
#columns in each table should be matched. In this case,
#the commareas object has a column called community that
#has the community area names, while the commareas_acs
#object calls the same column Community.Area

commareas
Simple feature collection with 77 features and 38 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS:  WGS 84
First 10 features:
   area area_num_1 area_numbe comarea comarea_id       community perimeter
1     0         35         35       0          0         DOUGLAS         0
2     0         36         36       0          0         OAKLAND         0
3     0         37         37       0          0     FULLER PARK         0
4     0         38         38       0          0 GRAND BOULEVARD         0
5     0         39         39       0          0         KENWOOD         0
6     0          4          4       0          0  LINCOLN SQUARE         0
7     0         40         40       0          0 WASHINGTON PARK         0
8     0         41         41       0          0       HYDE PARK         0
9     0         42         42       0          0        WOODLAWN         0
10    0          1          1       0          0     ROGERS PARK         0
   shape_area shape_len ACS.Year Under..25000 X.25000.to..49999
1    46004621  31027.05     2023          689               571
2    16913961  19565.51     2023          361               228
3    19916705  25339.09     2023          186               113
4    48492503  28196.84     2023          913               944
5    29071742  23325.17     2023          564               296
6    71352328  36624.60     2023          576              1022
7    42373881  28175.32     2023         1142               717
8    45105380  29746.71     2023          301               255
9    57815180  46936.96     2023         1039              1223
10   51259902  34052.40     2023         1038              1986
   X.50000.to..74999 X.75000.to..125000 X.125000.. Male.0.to.17 Male.18.to.24
1                526                431        737         1630          1654
2                144                125        182          772           104
3                 36                 61         34          225           149
4                651               1406       1292         2160           807
5                116                646       1254         1502           480
6               1173               2162       4410         3588          1258
7                274                317        267         2178           497
8                309                731       2078         1599          3025
9                622                917        807         2564          1958
10              1477               2460       2802         4392          2431
   Male.25.to.34 Male.35.to.49 Male.50.to.64 Male.65. Female.0.to.17
1           1445          1400          1276     1149           1464
2            185           236           236      142            565
3             75            87           134      365            212
4           2105          1857          2299     1347           2902
5           1052          1167          1308     1034           1548
6           4535          5778          3366     1739           3553
7            673          1005           749      465           1647
8           2476          1535          1527     1245           1619
9           1809          1861          2246     1019           2410
10          4440          6095          5495     2837           4879
   Female.18.to.24 Female.25.to.34 Female.35.to.49 Female.50.to.64 Female.65..
1             1511            1994            1463            1626        2017
2              211             351             591             564         414
3               86             110             194             269         280
4             1080            2370            2930            2586        2365
5              450            1156            1542            1566        1564
6             1589            5379            5576            3357        2462
7              793            1189            1602            1001         564
8             3157            2614            1916            1683        1696
9             1801            1958            2533            2372        1905
10            4537            5128            5772            4080        3081
   Total.Population White Black.or.African.American
1             18629  2278                     12400
2              4373   159                      3946
3              2186   148                      1850
4             24808  1339                     21947
5             14368  2958                      9791
6             42180 28731                      1623
7             12362   242                     11691
8             24092 11509                      6143
9             24437  2807                     18965
10            53169 25129                     13602
   American.Indian.or.Alaska.Native Asian Native.Hawaiian.or.Pacific.Islander
1                               214  2006                                   0
2                                59    51                                   0
3                                 7     7                                   0
4                                38   204                                  24
5                                18   697                                   4
6                               280  4345                                   0
7                                10     0                                   0
8                                49  3763                                   0
9                                50   867                                   0
10                              459  2750                                   0
   Other.Race Multiracial White.Not.Hispanic.or.Latino Hispanic.or.Latino
1         797         936                         2042               1416
2          14         143                          155                131
3          22         151                           86                232
4         241        1015                         1296                809
5         124         775                         2804                383
6        2662        4539                        25776               8311
7         138         281                          230                171
8         604        2024                        10805               1762
9         863         886                         2438                783
10       5278        5951                        23158              10599
              Record.ID                           geom
1          2023_DOUGLAS MULTIPOLYGON (((-87.60914 4...
2          2023_OAKLAND MULTIPOLYGON (((-87.59215 4...
3      2023_FULLER PARK MULTIPOLYGON (((-87.6288 41...
4  2023_GRAND BOULEVARD MULTIPOLYGON (((-87.60671 4...
5          2023_KENWOOD MULTIPOLYGON (((-87.59215 4...
6   2023_LINCOLN SQUARE MULTIPOLYGON (((-87.67441 4...
7  2023_WASHINGTON PARK MULTIPOLYGON (((-87.60604 4...
8        2023_HYDE PARK MULTIPOLYGON (((-87.58038 4...
9         2023_WOODLAWN MULTIPOLYGON (((-87.57714 4...
10     2023_ROGERS PARK MULTIPOLYGON (((-87.65456 4...

Areal interpolation with extensive variables

Because the variables from the census are counts of either individuals or households, they are extensive variables. Becuase areal interpolation involves intersecting the two vector layers in question, they have to be in the same projection. You can see from the output above that commareas is in WGS 1984, and you’ll remember that we reprojected the HOLC areas already. So our first step is to reproject commareas into State Plane Illinois East (Meters).

Deliverable 5

Reproject the commareas object into an object called commareas_il, which is in State Plane Illinois East (Meters). Provide the code you use to accomplish this task.

Once we have the Community Areas in the correct projection, we can use them to estimate the values of any of the variables in the commareas_il object across the HOLC areas.

For extensive variables, we assume that whatever we are counting is evenly distributed across the polygons (for example, with the census data, we are assuming that population is evenly distributed within the community areas). Based on that assumption, we can estimate the amount of any extensive variable from the commareas_il object for each HOLC zone by calculating the proportion of each community area that falls in each HOLC zone and using that percentage to compute a weighted sum of the population of all community areas intersecting each zone. Here’s how that works:

#First, you need to compute the areas for the polygons with your source data
holc_census <- commareas_il |>
  #note that the geometry column for this object is called "geom" - it
  #might be called something else in other objects, so double check
  mutate(comm_area_km2 = st_area(geom)/1000000) |>
  #Then, you intersect the source with your target polygons
  st_intersection(holc_il) |>
  #And compute the intersecting area and then the proportion of the total
  #source polygon that falls within each target polygon
  mutate(intersect_area = st_area(geom)/1000000, #compute intersected areas
         weight = intersect_area/comm_area_km2) |> #compute weights   
  group_by(holc_grade) |> #now we group the polygons by the unique identifier
  #for the original target polygons
  summarise(Total.Population = sum(Total.Population*weight),
            White.Not.Hispanic.or.Latino = sum(White.Not.Hispanic.or.Latino*weight))
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
  #Then we add up the weighted values of whatever extensive variables we are
  #interested in

holc_census
Simple feature collection with 5 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 332577.8 ymin: 552874.5 xmax: 367345.8 ymax: 594869
Projected CRS: NAD83(HARN) / Illinois East
# A tibble: 5 × 4
  holc_grade Total.Population White.Not.Hispanic.or.…¹                      geom
  <chr>                   [1]                      [1]        <MULTIPOLYGON [m]>
1 A                    15661.                   10783. (((354189.4 561409.9, 35…
2 B                   213490.                   72771. (((346300.4 581308.8, 34…
3 C                  1059653.                  339571. (((346240.3 567580.4, 34…
4 D                   544234.                  165002. (((353055.3 557653.4, 35…
5 Ungraded            813849.                  267390. (((349119.6 567254, 3491…
# ℹ abbreviated name: ¹​White.Not.Hispanic.or.Latino

Once we have estimated those population levels in areas of different grades, we can compute more values of interest. For example, we can compute the estimated percentage of the population in each HOLC zone identified as white (not Hispanic or Latino):

holc_census |>
  mutate(pct_white = White.Not.Hispanic.or.Latino/Total.Population*100) |>
  select(holc_grade, pct_white)
Simple feature collection with 5 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 332577.8 ymin: 552874.5 xmax: 367345.8 ymax: 594869
Projected CRS: NAD83(HARN) / Illinois East
# A tibble: 5 × 3
  holc_grade pct_white                                                      geom
  <chr>            [1]                                        <MULTIPOLYGON [m]>
1 A               68.9 (((354189.4 561409.9, 354189.4 561410.5, 354189.3 561412…
2 B               34.1 (((346300.4 581308.8, 346299.2 581349.1, 346298.6 581367…
3 C               32.0 (((346240.3 567580.4, 346200.1 568292.8, 347251.2 568295…
4 D               30.3 (((353055.3 557653.4, 353055.1 557653.1, 353053.2 557651…
5 Ungraded        32.9 (((349119.6 567254, 349107.5 567253.3, 349106.7 567253.3…

Here, we can clearly see that A-graded HOLC areas retain a disproportionately white population.

Deliverable 6

Estimate another extensive variable of interest across the HOLC areas. Then convert it into a variable that will be more comparable across the areas (for example, you can compute the variable relative to the total area, total households, or, as above, total population in the HOLC area). Create a plot to show your result. Provide the code you use to conduct the estimation and generate the plot and briefly interpret what you find.

Areal interpolation with intensive variables

Here, we’re going to use areal interpolation to estimate the percentage of children under 6 with elevated blood lead levels in each of the different HOLC grade areas over fifty years after the grades were assigned. The City of Chicago conducted studies of children’s blood lead levels as part of a remediation program in 1999 and 2013 and released these data at the scale of community areas (which can roughly be understood as neighborhoods).

First, we’ll read in the data from the blood lead (PB) monitoring:

pb <- read.csv("BloodPBLevelsCommAreas.csv", stringsAsFactors = FALSE)
head(pb)
  CommArea   CommAreaName ElevBloodPb1999Pct ElevBloodPb2013Pct
1        1    Rogers Park               10.5                0.8
2        2     West Ridge                6.0                1.1
3        3         Uptown               11.3                0.6
4        4 Lincoln Square                7.2                0.9
5        5   North Center                6.4                0.1
6        6      Lake View                4.0                0.2

These data contain, for each community area, the estimated percentage of children under 6 with elevated blood lead levels, based on the 1999 and 2013 surveys. By 2013, thankfully, the numbers were quite low, but they were very problematic in 1999.

Conducting table joins in tidyverse

You’ll notice that pb has no spatial information other than the names of the community areas.

You can see from here that the commareas_il object has an entry for the community area name, so we should be able to match it up, but it’s not in the same case as the community area names in the dataset with the blood lead estimates.

unique(commareas_il$community)
 [1] "DOUGLAS"                "OAKLAND"                "FULLER PARK"           
 [4] "GRAND BOULEVARD"        "KENWOOD"                "LINCOLN SQUARE"        
 [7] "WASHINGTON PARK"        "HYDE PARK"              "WOODLAWN"              
[10] "ROGERS PARK"            "JEFFERSON PARK"         "FOREST GLEN"           
[13] "NORTH PARK"             "ALBANY PARK"            "PORTAGE PARK"          
[16] "IRVING PARK"            "DUNNING"                "MONTCLARE"             
[19] "BELMONT CRAGIN"         "WEST RIDGE"             "HERMOSA"               
[22] "AVONDALE"               "LOGAN SQUARE"           "HUMBOLDT PARK"         
[25] "WEST TOWN"              "AUSTIN"                 "WEST GARFIELD PARK"    
[28] "EAST GARFIELD PARK"     "NEAR WEST SIDE"         "NORTH LAWNDALE"        
[31] "UPTOWN"                 "SOUTH LAWNDALE"         "LOWER WEST SIDE"       
[34] "NEAR SOUTH SIDE"        "ARMOUR SQUARE"          "NORWOOD PARK"          
[37] "NEAR NORTH SIDE"        "LOOP"                   "SOUTH SHORE"           
[40] "CHATHAM"                "AVALON PARK"            "SOUTH CHICAGO"         
[43] "BURNSIDE"               "MCKINLEY PARK"          "LAKE VIEW"             
[46] "CALUMET HEIGHTS"        "ROSELAND"               "NORTH CENTER"          
[49] "PULLMAN"                "SOUTH DEERING"          "EAST SIDE"             
[52] "WEST PULLMAN"           "RIVERDALE"              "HEGEWISCH"             
[55] "GARFIELD RIDGE"         "ARCHER HEIGHTS"         "BRIGHTON PARK"         
[58] "BRIDGEPORT"             "NEW CITY"               "WEST ELSDON"           
[61] "GAGE PARK"              "CLEARING"               "WEST LAWN"             
[64] "CHICAGO LAWN"           "WEST ENGLEWOOD"         "ENGLEWOOD"             
[67] "GREATER GRAND CROSSING" "LINCOLN PARK"           "ASHBURN"               
[70] "AUBURN GRESHAM"         "BEVERLY"                "WASHINGTON HEIGHTS"    
[73] "MOUNT GREENWOOD"        "MORGAN PARK"            "OHARE"                 
[76] "EDGEWATER"              "EDISON PARK"           
unique(pb$CommAreaName)
 [1] "Rogers Park"            "West Ridge"             "Uptown"                
 [4] "Lincoln Square"         "North Center"           "Lake View"             
 [7] "Lincoln Park"           "Near North Side"        "Edison Park"           
[10] "Norwood Park"           "Jefferson Park"         "Forest Glen"           
[13] "North Park"             "Albany Park"            "Portage Park"          
[16] "Irving Park"            "Dunning"                "Montclaire"            
[19] "Belmont Cragin"         "Hermosa"                "Avondale"              
[22] "Logan Square"           "Humboldt park"          "West Town"             
[25] "Austin"                 "West Garfield Park"     "East Garfield Park"    
[28] "Near West Side"         "North Lawndale"         "South Lawndale"        
[31] "Lower West Side"        "Loop"                   "Near South Side"       
[34] "Armour Square"          "Douglas"                "Oakland"               
[37] "Fuller Park"            "Grand Boulevard"        "Kenwood"               
[40] "Washington Park"        "Hyde Park"              "Woodlawn"              
[43] "South Shore"            "Chatham"                "Avalon Park"           
[46] "South Chicago"          "Burnside"               "Calumet Heights"       
[49] "Roseland"               "Pullman"                "South Deering"         
[52] "East Side"              "West Pullman"           "Riverdale"             
[55] "Hegewisch"              "Garfield Ridge"         "Archer Heights"        
[58] "O'Hare"                 "Brighton Park"          "McKinley Park"         
[61] "Bridgeport"             "New City"               "West Elsdon"           
[64] "Gage Park"              "Clearing"               "West Lawn"             
[67] "Chicago Lawn"           "West Englewood"         "Englewood"             
[70] "Greater Grand Crossing" "Ashburn"                "Auburn Gresham"        
[73] "Beverly"                "Washington Heights"     "Mount Greenwood"       
[76] "Morgan Park"            "Edgewater"              "Chicago"               
[79] "Unknown"               

One option here would be to use R functions to convert everything into either uppercase or lowercase. Because that’s a pretty handy trick when cleaning data, here’s how we’d to it:

commareas_il <- commareas_il |>
  mutate(community_lower_case = str_to_lower(community))

pb <- pb |>
  mutate(community_upper_case = str_to_upper(CommAreaName))

pb$community_upper_case
 [1] "ROGERS PARK"            "WEST RIDGE"             "UPTOWN"                
 [4] "LINCOLN SQUARE"         "NORTH CENTER"           "LAKE VIEW"             
 [7] "LINCOLN PARK"           "NEAR NORTH SIDE"        "EDISON PARK"           
[10] "NORWOOD PARK"           "JEFFERSON PARK"         "FOREST GLEN"           
[13] "NORTH PARK"             "ALBANY PARK"            "PORTAGE PARK"          
[16] "IRVING PARK"            "DUNNING"                "MONTCLAIRE"            
[19] "BELMONT CRAGIN"         "HERMOSA"                "AVONDALE"              
[22] "LOGAN SQUARE"           "HUMBOLDT PARK"          "WEST TOWN"             
[25] "AUSTIN"                 "WEST GARFIELD PARK"     "EAST GARFIELD PARK"    
[28] "NEAR WEST SIDE"         "NORTH LAWNDALE"         "SOUTH LAWNDALE"        
[31] "LOWER WEST SIDE"        "LOOP"                   "NEAR SOUTH SIDE"       
[34] "ARMOUR SQUARE"          "DOUGLAS"                "OAKLAND"               
[37] "FULLER PARK"            "GRAND BOULEVARD"        "KENWOOD"               
[40] "WASHINGTON PARK"        "HYDE PARK"              "WOODLAWN"              
[43] "SOUTH SHORE"            "CHATHAM"                "AVALON PARK"           
[46] "SOUTH CHICAGO"          "BURNSIDE"               "CALUMET HEIGHTS"       
[49] "ROSELAND"               "PULLMAN"                "SOUTH DEERING"         
[52] "EAST SIDE"              "WEST PULLMAN"           "RIVERDALE"             
[55] "HEGEWISCH"              "GARFIELD RIDGE"         "ARCHER HEIGHTS"        
[58] "O'HARE"                 "BRIGHTON PARK"          "MCKINLEY PARK"         
[61] "BRIDGEPORT"             "NEW CITY"               "WEST ELSDON"           
[64] "GAGE PARK"              "CLEARING"               "WEST LAWN"             
[67] "CHICAGO LAWN"           "WEST ENGLEWOOD"         "ENGLEWOOD"             
[70] "GREATER GRAND CROSSING" "ASHBURN"                "AUBURN GRESHAM"        
[73] "BEVERLY"                "WASHINGTON HEIGHTS"     "MOUNT GREENWOOD"       
[76] "MORGAN PARK"            "EDGEWATER"              "CHICAGO"               
[79] "UNKNOWN"               

Those functions are pretty easy - they do exactly what they say.

You might have noticed that there’s another option, however. Both objects also have columns that appear to be assigning numerical codes to the community areas:

unique(commareas_il$area_num_1)
 [1] "35" "36" "37" "38" "39" "4"  "40" "41" "42" "1"  "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "2"  "20" "21" "22" "23" "24" "25" "26" "27" "28" "29"
[31] "3"  "30" "31" "33" "34" "10" "8"  "32" "43" "44" "45" "46" "47" "59" "6" 
[46] "48" "49" "5"  "50" "51" "52" "53" "54" "55" "56" "57" "58" "60" "61" "62"
[61] "63" "64" "65" "66" "67" "68" "69" "7"  "70" "71" "72" "73" "74" "75" "76"
[76] "77" "9" 
unique(pb$CommArea)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
[26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
[51] 51 52 53 54 55 56 57 76 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
[76] 75 77  0 99

So now we could use any of three options to join the two tables: 1) community area names in uppercase; 2) community area names in lowercase; or 3) community area numbers.

In any case (pun intended) we are ready to do the table join and then intersect with the holc_il object.

commareas_il <- commareas_il |>
  left_join(pb, by = join_by(community == community_upper_case))

plot(commareas_il[,c("geom", "ElevBloodPb1999Pct")])

Now we’re ready to use areal interpolation to estimate these variables for the HOLC areas. This time, however, we are working with an intensive variable, a percentage. Because we are working with an intensive variable, we will be computing a weighted mean, where the weights are based on the proportion of the target polygon that falls in each source polygon:

commareas_pb <- commareas_il |>
  st_intersection(holc_il) |> #As before, we start by intersecting the layers
  mutate(area_int = st_area(geom)/1000000) |> #Then compute the intersected areas
  group_by(holc_grade) |> #Then group by our ID for the target polygons
  mutate(weight = area_int/sum(area_int)) |> #Then compute the weight as the
  #Proportion of the total area in the group in each intersection
  summarise(ElevBloodPb1999Pct = sum(ElevBloodPb1999Pct*weight,
                                      na.rm=TRUE))
Warning: attribute variables are assumed to be spatially constant throughout
all geometries
  #Then we compute a weighted sum for the intensive variable (note that here
  #we are ignoring areas that have no data (the white polygons in the map
  #above)). Becauce the weights all add up to one, this is equivalent to
  #computing a weighted mean.

commareas_pb
Simple feature collection with 5 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 332577.8 ymin: 552874.5 xmax: 367345.8 ymax: 594869
Projected CRS: NAD83(HARN) / Illinois East
# A tibble: 5 × 3
  holc_grade ElevBloodPb1999Pct                                             geom
  <chr>                     [1]                               <MULTIPOLYGON [m]>
1 A                        1.27 (((354189.4 561409.9, 354189.4 561410.5, 354189…
2 B                        9.95 (((346300.4 581308.8, 346299.2 581349.1, 346298…
3 C                       13.8  (((346240.3 567580.4, 346200.1 568292.8, 347251…
4 D                       16.4  (((353055.3 557653.4, 353055.1 557653.1, 353053…
5 Ungraded                 9.52 (((349119.6 567254, 349107.5 567253.3, 349106.7…
Deliverable 7

Create boxplots showing the differences in children’s blood lead levels across HOLC grades in Chicago in 1999 and 2013. Once you have your plot, interpret your results. What might account for the patterns that you see? Be sure to include the code you use to compute your estimates and make the plot.

Congratulations! You’re done!