This second lab assignment is built with aggregating data from different sources in mind, tied to the zip code field. This reuses the geopackage file made from last weeks lab data read as three sf objects of NYC zip code areas, food retail stores, and health facilities. [Note this assumes a file structure wherein all tables are already stored under a ./data directory].
First, The COVID-10 data must be joined to the NYC zip code area
data. The ZIP code polygon layer created in the previous lab was reused
by reading the nyc_zip_codes layer from the saved
GeoPackage. For the COVID data, one wee+kly file,
tests-by-zcta_2021_04_23.csv, was selected for this lab. The ZIP-level
COVID table was then joined to the ZIP polygon sf object using
left_join(), which matches the polygon field
ZIPCODE to the COVID table field
MODIFIED_ZCTA. (Left input was used so that spatial
geometry would be preserved in the product!).
gpkg_file <- "data/sf_objects.gpkg"
zip_sf <- st_read(gpkg_file, layer = "nyc_zip_codes", quiet = TRUE)
covid_df <- read_csv("data/tests-by-zcta_2021_04_23.csv", show_col_types = FALSE) # A Week of COVID ZIP-level data
# Conv ZIP join field to character so it matches ZIPCODE
covid_df <- covid_df %>%
mutate(MODIFIED_ZCTA = as.character(MODIFIED_ZCTA))
# left join for COVID attributes
zpNYC <- zip_sf %>%
left_join(covid_df, by = c("ZIPCODE" = "MODIFIED_ZCTA"))
#result
glimpse(zpNYC)
## Rows: 263
## Columns: 25
## $ ZIPCODE <chr> "11436", "11213", "11212", "11225", "11218", "11226"…
## $ BLDGZIP <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
## $ PO_NAME <chr> "Jamaica", "Brooklyn", "Brooklyn", "Brooklyn", "Broo…
## $ POPULATION <dbl> 18681, 62426, 83866, 56527, 72280, 106132, 92561, 67…
## $ AREA <dbl> 22699295, 29631004, 41972104, 23698630, 36868799, 39…
## $ STATE <chr> "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY"…
## $ COUNTY <chr> "Queens", "Kings", "Kings", "Kings", "Kings", "Kings…
## $ ST_FIPS <chr> "36", "36", "36", "36", "36", "36", "36", "36", "36"…
## $ CTY_FIPS <chr> "081", "047", "047", "047", "047", "047", "047", "04…
## $ URL <chr> "http://www.usps.com/", "http://www.usps.com/", "htt…
## $ SHAPE_AREA <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SHAPE_LEN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ NEIGHBORHOOD_NAME <chr> "South Jamaica/South Ozone Park", "Crown Heights (Ea…
## $ BOROUGH_GROUP <chr> "Queens", "Brooklyn", "Brooklyn", "Brooklyn", "Brook…
## $ label <chr> "11436", "11213", "11212", "11225", "11218", "11226"…
## $ lat <dbl> 40.67582, 40.67107, 40.66293, 40.66306, 40.64348, 40…
## $ lon <dbl> -73.79662, -73.93633, -73.91301, -73.95423, -73.9760…
## $ COVID_CASE_COUNT <dbl> 1888, 5166, 7182, 3833, 6199, 7279, 8429, 5380, 1104…
## $ COVID_CASE_RATE <dbl> 9419.96, 7996.75, 9709.74, 6664.50, 8377.49, 7476.75…
## $ POP_DENOMINATOR <dbl> 20042.54, 64601.26, 73966.99, 57513.69, 73995.92, 97…
## $ COVID_DEATH_COUNT <dbl> 64, 203, 330, 177, 218, 368, 256, 206, 380, 219, 113…
## $ COVID_DEATH_RATE <dbl> 319.32, 314.24, 446.14, 307.75, 294.61, 378.00, 284.…
## $ PERCENT_POSITIVE <dbl> 17.57, 13.72, 15.64, 11.62, 13.93, 13.33, 15.64, 15.…
## $ TOTAL_COVID_TESTS <dbl> 11082, 38560, 47319, 33709, 45884, 56287, 55444, 350…
## $ geom <POLYGON [°]> POLYGON ((-73.80585 40.6829..., POLYGON ((-7…
sum(is.na(zpNYC$COVID_CASE_COUNT))
## [1] 74
After the join, the new zpNYC objects contains bot the
original polygon geometry together with the weekly COVID variables (such
as cases, death counts, test counts, and % positive). It seems that some
ZIP polygons didn’t actually receive COVID values, probably because the
ZIP poly layer contains more USPS area codes than the modified ZCTA
geography used in the COVID table.
Next, the retail food store point data were aggregated to the ZIP
polygon layer so that the number of selected food retail establishments
in each ZIP code area could be counted. (Since the simplified food-store
layer saved from the previous lab no longer included the
establishment-type field needed for filtering, I went back to the
original shapefile and rebuilt the point layer with the full attributes
intact). I then used st_join() to assign store points to
ZIP polygons, filtered the points using the Estbl_T field,
counted the selected stores by ZIPCODE, and joined the resulting counts
back to the zpNYC object from Task 1 so that the COVID
variables and food-store totals would remain in the same final sf
layer.
food_sf_full <- st_read("data/nycFoodStore.shp", quiet = TRUE)
food_sf_full <- st_transform(food_sf_full, st_crs(zip_sf))
food_sf_nyc_full <- st_join(food_sf_full, zip_sf, left = FALSE) # spatial join for food-store points
#gilter to establishment types and count stores by ZIP code
food_counts <- food_sf_nyc_full %>%
filter(str_detect(Estbl_T, "[AJD]")) %>%
st_drop_geometry() %>%
count(ZIPCODE, name = "FoodStoreNum")
# Join store counts back to the ZIP-COVID sf object from Task 1
zpNYC <- zpNYC %>%
select(-any_of("FoodStoreNum")) %>%
left_join(food_counts, by = "ZIPCODE") %>%
mutate(FoodStoreNum = replace_na(FoodStoreNum, 0))
#check:
summary(zpNYC$FoodStoreNum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 10.00 44.33 81.00 262.00
plot(zpNYC["FoodStoreNum"],
main = "Food Stores per ZIP")
This produced a ZIP-level store count field named
FoodStoreNum, which could then be used together with the
COVID variables already joined in Task 1.
Similar to the last step, now health-facility points will be
aggregated to the ZIP code polygons so I could cound how many fell
within each area. (Again, as s with the food-store data, the simplified
health-facility layer saved from the previous lab did not retain the
subtype field needed for filtering, so I re-read the original
health-facility CSV and rebuilt the full point layer.)
st_join() was then used to assign the facility points to
polygons, then I filtered the facilities usign the
Short Description field, counted the nursing homes by
ZIPCODE, and joined the resulting counts back to the
zpNYC object from prior.
# (re)read the original health facility CSV so the subtype field is available
health_raw <- read_csv("data/NYS_Health_Facility.csv", show_col_types = FALSE)
# conv- the facility table to an sf point object
health_sf_full <- health_raw %>%
filter(
!is.na(`Facility Longitude`),
!is.na(`Facility Latitude`)
) %>%
st_as_sf(
coords = c("Facility Longitude", "Facility Latitude"),
crs = 4326,
remove = FALSE
)
health_sf_nyc_full <- st_join(health_sf_full, zip_sf, left = FALSE) # spatial join
nh_counts <- health_sf_nyc_full %>% # Filter to nursing homes
filter(`Short Description` == "NH") %>%
st_drop_geometry() %>%
count(ZIPCODE, name = "NursingHomeNum") # count facilities by ZIP
# join nursing-home counts back to the ZIP sf object() from Task 2)
zpNYC <- zpNYC %>%
select(-any_of("NursingHomeNum")) %>%
left_join(nh_counts, by = "ZIPCODE") %>%
mutate(NursingHomeNum = replace_na(NursingHomeNum, 0))
#check
summary(zpNYC$NursingHomeNum)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.6844 1.0000 12.0000
plot(zpNYC["NursingHomeNum"],
main = "Nursing Homes per ZIP")
This added a ZIP-level nursing-home count field,
NursingHomeNum, to the same zpNYC object that
already contained the COVID variables and food-store counts from the
earlier tasks. (Notably, it seems that a lot of nursing homes are
centered around Far Rockaway…)
For this second part, I will join the Census ACS population, race,
and age data to the NYC Planning Census Tract Data as an attribute join.
Since the tract shapefile and ACS table did not begin with the same
matching identifier, I first had to derive compatible tract codes in
both datasets. On the tract side, I created a county FIPS field from
borough names and then built a full tract FIPS code by combining the
county FIPS with the 2010 tract code. As for the ACS file, the CSV was
read (skipping the extra metadata line), and tract codes were extracted
from GEO_ID. These were then joined conservatively to the
tract polygons.
# NYC census tracts from Department of Planning
nycCensus <- st_read(
"data/2010 Census Tracts/geo_export_1dc7b645-647b-4806-b9a0-7b79660f120a.shp",
quiet = TRUE,
stringsAsFactors = FALSE
)
str(nycCensus)
## Classes 'sf' and 'data.frame': 2165 obs. of 12 variables:
## $ boro_code : chr "5" "1" "1" "1" ...
## $ boro_ct201: chr "5000900" "1009800" "1010000" "1010200" ...
## $ boro_name : chr "Staten Island" "Manhattan" "Manhattan" "Manhattan" ...
## $ cdeligibil: chr "E" "I" "I" "I" ...
## $ ct2010 : chr "000900" "009800" "010000" "010200" ...
## $ ctlabel : chr "9" "98" "100" "102" ...
## $ ntacode : chr "SI22" "MN19" "MN19" "MN17" ...
## $ ntaname : chr "West New Brighton-New Brighton-St. George" "Turtle Bay-East Midtown" "Turtle Bay-East Midtown" "Midtown-Midtown South" ...
## $ puma : chr "3903" "3808" "3808" "3807" ...
## $ shape_area: num 2497010 1906016 1860938 1860993 1864600 ...
## $ shape_leng: num 7729 5534 5692 5688 5693 ...
## $ geometry :sfc_MULTIPOLYGON of length 2165; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:28, 1:2] -74.1 -74.1 -74.1 -74.1 -74.1 ...
## ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
## ..- attr(*, "names")= chr [1:11] "boro_code" "boro_ct201" "boro_name" "cdeligibil" ...
# add a cntyFIPS column to nycCensus based on borough names && create complete tractFIPS code by concatanating the county fip and census tract fip.
nycCensus <- nycCensus %>%
mutate(
cntyFIPS = case_when(
boro_name == "Bronx" ~ "005",
boro_name == "Brooklyn" ~ "047",
boro_name == "Manhattan" ~ "061",
boro_name == "Queens" ~ "081",
boro_name == "Staten Island" ~ "085"
),
tractFIPS = paste0(cntyFIPS, ct2010)
)
# The ACS data from CENSUS website, 2018 data on population
# used readLines to bypass the second line/row in the csv file.
acsData <- readLines("data/ACSDP5Y2018.DP05_data_with_overlays_2020-04-22T132935.csv") %>%
magrittr::extract(-2) %>%
textConnection() %>%
read.csv(header = TRUE, quote = "\"") %>%
as_tibble() %>%
select(
GEO_ID,
totPop = DP05_0001E,
elderlyPop = DP05_0024E, # >= 65
malePop = DP05_0002E,
femalePop = DP05_0003E,
whitePop = DP05_0037E,
blackPop = DP05_0038E,
asianPop = DP05_0067E,
hispanicPop = DP05_0071E,
adultPop = DP05_0021E,
citizenAdult = DP05_0087E
) %>%
mutate(
censusCode = str_sub(GEO_ID, -9, -1),
across(c(totPop, elderlyPop, malePop, femalePop,
whitePop, blackPop, asianPop, hispanicPop,
adultPop, citizenAdult), as.numeric)
)
acsData %>%
slice(1:10)
## # A tibble: 10 × 12
## GEO_ID totPop elderlyPop malePop femalePop whitePop blackPop asianPop
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1400000US3600… 7080 51 6503 577 1773 4239 130
## 2 1400000US3600… 4542 950 2264 2278 2165 1279 119
## 3 1400000US3600… 5634 710 2807 2827 2623 1699 226
## 4 1400000US3600… 5917 989 2365 3552 2406 2434 68
## 5 1400000US3600… 2765 76 1363 1402 585 1041 130
## 6 1400000US3600… 9409 977 4119 5290 3185 4487 29
## 7 1400000US3600… 4600 648 2175 2425 479 2122 27
## 8 1400000US3600… 172 0 121 51 69 89 14
## 9 1400000US3600… 5887 548 2958 2929 903 1344 68
## 10 1400000US3600… 2868 243 1259 1609 243 987 0
## # ℹ 4 more variables: hispanicPop <dbl>, adultPop <dbl>, citizenAdult <dbl>,
## # censusCode <chr>
# Merge (JOIN) ACS data to the census tracts
# join by attributes /columns
popData <- merge(nycCensus, acsData, by.x = "tractFIPS", by.y = "censusCode")
# verification
sum(popData$totPop, na.rm = TRUE)
## [1] 8443713
st_crs(popData)
## Coordinate Reference System:
## User input: WGS84(DD)
## wkt:
## GEOGCRS["WGS84(DD)",
## DATUM["WGS84",
## ELLIPSOID["WGS84",6378137,298.257223563,
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic longitude",east,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic latitude",north,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]]]
popNYC <- st_transform(popData, st_crs(zpNYC)) # zpNYC is the JOINED zip code data from task 1.
# this may take a long time to run
plot(popNYC["asianPop"],
main = "Asian Population by Census Tract")
As an example metric, Asian population was mapped per tract, which seems to align with expectation; prominent centers showing up in Chinatown (Manhattan) and Flushing (Queens).
Finally, the ACS census data will be combined with the cumulative
outputs from tasks 3 and 4 to produce a final ZIP-level sf
object containing all variables. To do this, I used a spatial join
between the ZIP geometries and tract centroids, then summarized
(interpolated) the tract-level ACS variables within each area. One final
map (again of Asian Population, this time per ZIP area) was made to
verify the aggregation worked.
# Aggregate the ACS data from census tracts to ZIP code areas
covidPopZipNYC <- st_join(
zpNYC %>% select(-any_of(c("totPop", "elderlyPop", "malePop", "femalePop",
"whitePop", "blackPop", "asianPop",
"hispanicPop", "adultPop", "citizenAdult"))),
popNYC %>% st_centroid(),
join = st_contains,
left = TRUE
) %>%
group_by(
ZIPCODE, BLDGZIP, PO_NAME, POPULATION, AREA, STATE, COUNTY,
ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN,
NEIGHBORHOOD_NAME, BOROUGH_GROUP, label, lat, lon,
COVID_CASE_COUNT, COVID_CASE_RATE, POP_DENOMINATOR,
COVID_DEATH_COUNT, COVID_DEATH_RATE, PERCENT_POSITIVE,
TOTAL_COVID_TESTS, FoodStoreNum, NursingHomeNum
) %>%
summarise(
totPop = sum(totPop, na.rm = TRUE),
elderlyPop = sum(elderlyPop, na.rm = TRUE),
malePop = sum(malePop, na.rm = TRUE),
femalePop = sum(femalePop, na.rm = TRUE),
whitePop = sum(whitePop, na.rm = TRUE),
blackPop = sum(blackPop, na.rm = TRUE),
asianPop = sum(asianPop, na.rm = TRUE),
hispanicPop = sum(hispanicPop, na.rm = TRUE),
adultPop = sum(adultPop, na.rm = TRUE),
citizenAdult = sum(citizenAdult, na.rm = TRUE)
)
## Warning: st_centroid assumes attributes are constant over geometries
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by ZIPCODE, BLDGZIP, PO_NAME, POPULATION,
## AREA, STATE, COUNTY, ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN,
## NEIGHBORHOOD_NAME, BOROUGH_GROUP, label, lat, lon, COVID_CASE_COUNT, …,
## FoodStoreNum, and NursingHomeNum.
## ℹ Output is grouped by ZIPCODE, BLDGZIP, PO_NAME, POPULATION, AREA, STATE,
## COUNTY, ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN, NEIGHBORHOOD_NAME,
## BOROUGH_GROUP, label, lat, lon, COVID_CASE_COUNT, …, TOTAL_COVID_TESTS, and
## FoodStoreNum.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(ZIPCODE, BLDGZIP, PO_NAME, POPULATION, AREA, STATE,
## COUNTY, ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN, NEIGHBORHOOD_NAME,
## BOROUGH_GROUP, label, lat, lon, COVID_CASE_COUNT, COVID_CASE_RATE,
## POP_DENOMINATOR, COVID_DEATH_COUNT, COVID_DEATH_RATE, PERCENT_POSITIVE,
## TOTAL_COVID_TESTS, FoodStoreNum, NursingHomeNum))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
# verify the data
head(covidPopZipNYC)
## Simple feature collection with 6 features and 36 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -74.0473 ymin: 40.68846 xmax: -73.94922 ymax: 40.80055
## Geodetic CRS: WGS 84
## # A tibble: 6 × 37
## # Groups: ZIPCODE, BLDGZIP, PO_NAME, POPULATION, AREA, STATE, COUNTY,
## # ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN, NEIGHBORHOOD_NAME,
## # BOROUGH_GROUP, label, lat, lon, COVID_CASE_COUNT, COVID_CASE_RATE,
## # POP_DENOMINATOR, COVID_DEATH_COUNT, COVID_DEATH_RATE, PERCENT_POSITIVE,
## # TOTAL_COVID_TESTS, FoodStoreNum [6]
## ZIPCODE BLDGZIP PO_NAME POPULATION AREA STATE COUNTY ST_FIPS CTY_FIPS URL
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 00083 0 Central… 25 3.83e7 NY New Y… 36 061 http…
## 2 10001 0 New York 22413 1.78e7 NY New Y… 36 061 http…
## 3 10002 0 New York 81305 2.63e7 NY New Y… 36 061 http…
## 4 10003 0 New York 55878 1.55e7 NY New Y… 36 061 http…
## 5 10004 0 New York 2187 6.71e5 NY New Y… 36 061 http…
## 6 10004 0 New York 2187 1.20e6 NY New Y… 36 061 http…
## # ℹ 27 more variables: SHAPE_AREA <dbl>, SHAPE_LEN <dbl>,
## # NEIGHBORHOOD_NAME <chr>, BOROUGH_GROUP <chr>, label <chr>, lat <dbl>,
## # lon <dbl>, COVID_CASE_COUNT <dbl>, COVID_CASE_RATE <dbl>,
## # POP_DENOMINATOR <dbl>, COVID_DEATH_COUNT <dbl>, COVID_DEATH_RATE <dbl>,
## # PERCENT_POSITIVE <dbl>, TOTAL_COVID_TESTS <dbl>, FoodStoreNum <int>,
## # NursingHomeNum <int>, totPop <dbl>, elderlyPop <dbl>, malePop <dbl>,
## # femalePop <dbl>, whitePop <dbl>, blackPop <dbl>, asianPop <dbl>, …
sum(covidPopZipNYC$totPop, na.rm = TRUE)
## [1] 8446394
# see where those NAs are lol
covidPopZipNYC %>%
filter(is.na(totPop))
## Simple feature collection with 0 features and 36 fields
## Geometry type: POLYGON
## Bounding box: xmin: NA ymin: NA xmax: NA ymax: NA
## Geodetic CRS: WGS 84
## # A tibble: 0 × 37
## # Groups: ZIPCODE, BLDGZIP, PO_NAME, POPULATION, AREA, STATE, COUNTY,
## # ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN, NEIGHBORHOOD_NAME,
## # BOROUGH_GROUP, label, lat, lon, COVID_CASE_COUNT, COVID_CASE_RATE,
## # POP_DENOMINATOR, COVID_DEATH_COUNT, COVID_DEATH_RATE, PERCENT_POSITIVE,
## # TOTAL_COVID_TESTS, FoodStoreNum [0]
## # ℹ 37 variables: ZIPCODE <chr>, BLDGZIP <chr>, PO_NAME <chr>,
## # POPULATION <dbl>, AREA <dbl>, STATE <chr>, COUNTY <chr>, ST_FIPS <chr>,
## # CTY_FIPS <chr>, URL <chr>, SHAPE_AREA <dbl>, SHAPE_LEN <dbl>,
## # NEIGHBORHOOD_NAME <chr>, BOROUGH_GROUP <chr>, label <chr>, lat <dbl>,
## # lon <dbl>, COVID_CASE_COUNT <dbl>, COVID_CASE_RATE <dbl>,
## # POP_DENOMINATOR <dbl>, COVID_DEATH_COUNT <dbl>, COVID_DEATH_RATE <dbl>,
## # PERCENT_POSITIVE <dbl>, TOTAL_COVID_TESTS <dbl>, FoodStoreNum <int>, …
# final verification map
plot(covidPopZipNYC["asianPop"],
main = "Asian Population by ZIP Code")