GTECH 385 – R Week 08 Assignment

This second lab assignment is built with aggregating data from different sources in mind, tied to the zip code field. This reuses the geopackage file made from last weeks lab data read as three sf objects of NYC zip code areas, food retail stores, and health facilities. [Note this assumes a file structure wherein all tables are already stored under a ./data directory].

Task 1 – Joining ZIP codes to COVID data:

First, The COVID-10 data must be joined to the NYC zip code area data. The ZIP code polygon layer created in the previous lab was reused by reading the nyc_zip_codes layer from the saved GeoPackage. For the COVID data, one wee+kly file, tests-by-zcta_2021_04_23.csv, was selected for this lab. The ZIP-level COVID table was then joined to the ZIP polygon sf object using left_join(), which matches the polygon field ZIPCODE to the COVID table field MODIFIED_ZCTA. (Left input was used so that spatial geometry would be preserved in the product!).

gpkg_file <- "data/sf_objects.gpkg"
zip_sf <- st_read(gpkg_file, layer = "nyc_zip_codes", quiet = TRUE)

covid_df <- read_csv("data/tests-by-zcta_2021_04_23.csv", show_col_types = FALSE) # A Week of COVID ZIP-level data

# Conv ZIP join field to character so it matches ZIPCODE
covid_df <- covid_df %>%
  mutate(MODIFIED_ZCTA = as.character(MODIFIED_ZCTA))

# left join for COVID attributes
zpNYC <- zip_sf %>%
  left_join(covid_df, by = c("ZIPCODE" = "MODIFIED_ZCTA"))

#result
glimpse(zpNYC)

## Rows: 263
## Columns: 25
## $ ZIPCODE           <chr> "11436", "11213", "11212", "11225", "11218", "11226"…
## $ BLDGZIP           <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
## $ PO_NAME           <chr> "Jamaica", "Brooklyn", "Brooklyn", "Brooklyn", "Broo…
## $ POPULATION        <dbl> 18681, 62426, 83866, 56527, 72280, 106132, 92561, 67…
## $ AREA              <dbl> 22699295, 29631004, 41972104, 23698630, 36868799, 39…
## $ STATE             <chr> "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY", "NY"…
## $ COUNTY            <chr> "Queens", "Kings", "Kings", "Kings", "Kings", "Kings…
## $ ST_FIPS           <chr> "36", "36", "36", "36", "36", "36", "36", "36", "36"…
## $ CTY_FIPS          <chr> "081", "047", "047", "047", "047", "047", "047", "04…
## $ URL               <chr> "http://www.usps.com/", "http://www.usps.com/", "htt…
## $ SHAPE_AREA        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ SHAPE_LEN         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ NEIGHBORHOOD_NAME <chr> "South Jamaica/South Ozone Park", "Crown Heights (Ea…
## $ BOROUGH_GROUP     <chr> "Queens", "Brooklyn", "Brooklyn", "Brooklyn", "Brook…
## $ label             <chr> "11436", "11213", "11212", "11225", "11218", "11226"…
## $ lat               <dbl> 40.67582, 40.67107, 40.66293, 40.66306, 40.64348, 40…
## $ lon               <dbl> -73.79662, -73.93633, -73.91301, -73.95423, -73.9760…
## $ COVID_CASE_COUNT  <dbl> 1888, 5166, 7182, 3833, 6199, 7279, 8429, 5380, 1104…
## $ COVID_CASE_RATE   <dbl> 9419.96, 7996.75, 9709.74, 6664.50, 8377.49, 7476.75…
## $ POP_DENOMINATOR   <dbl> 20042.54, 64601.26, 73966.99, 57513.69, 73995.92, 97…
## $ COVID_DEATH_COUNT <dbl> 64, 203, 330, 177, 218, 368, 256, 206, 380, 219, 113…
## $ COVID_DEATH_RATE  <dbl> 319.32, 314.24, 446.14, 307.75, 294.61, 378.00, 284.…
## $ PERCENT_POSITIVE  <dbl> 17.57, 13.72, 15.64, 11.62, 13.93, 13.33, 15.64, 15.…
## $ TOTAL_COVID_TESTS <dbl> 11082, 38560, 47319, 33709, 45884, 56287, 55444, 350…
## $ geom              <POLYGON [°]> POLYGON ((-73.80585 40.6829..., POLYGON ((-7…

sum(is.na(zpNYC$COVID_CASE_COUNT))

## [1] 74

After the join, the new zpNYC objects contains bot the original polygon geometry together with the weekly COVID variables (such as cases, death counts, test counts, and % positive). It seems that some ZIP polygons didn’t actually receive COVID values, probably because the ZIP poly layer contains more USPS area codes than the modified ZCTA geography used in the COVID table.

Task 2 – Aggregating NYC food retail data to ZIP codes

Next, the retail food store point data were aggregated to the ZIP polygon layer so that the number of selected food retail establishments in each ZIP code area could be counted. (Since the simplified food-store layer saved from the previous lab no longer included the establishment-type field needed for filtering, I went back to the original shapefile and rebuilt the point layer with the full attributes intact). I then used st_join() to assign store points to ZIP polygons, filtered the points using the Estbl_T field, counted the selected stores by ZIPCODE, and joined the resulting counts back to the zpNYC object from Task 1 so that the COVID variables and food-store totals would remain in the same final sf layer.

food_sf_full <- st_read("data/nycFoodStore.shp", quiet = TRUE)
food_sf_full <- st_transform(food_sf_full, st_crs(zip_sf))

food_sf_nyc_full <- st_join(food_sf_full, zip_sf, left = FALSE) # spatial join for food-store points

#gilter to establishment types and count stores by ZIP code
food_counts <- food_sf_nyc_full %>%
  filter(str_detect(Estbl_T, "[AJD]")) %>%
  st_drop_geometry() %>%
  count(ZIPCODE, name = "FoodStoreNum")

# Join  store counts back to the ZIP-COVID sf object from Task 1
zpNYC <- zpNYC %>%
  select(-any_of("FoodStoreNum")) %>%
  left_join(food_counts, by = "ZIPCODE") %>%
  mutate(FoodStoreNum = replace_na(FoodStoreNum, 0))

#check:
summary(zpNYC$FoodStoreNum)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   10.00   44.33   81.00  262.00

plot(zpNYC["FoodStoreNum"],
     main = "Food Stores per ZIP")

This produced a ZIP-level store count field named FoodStoreNum, which could then be used together with the COVID variables already joined in Task 1.

Task 3 – Aggregating Nursing homes to ZIP codes

Similar to the last step, now health-facility points will be aggregated to the ZIP code polygons so I could cound how many fell within each area. (Again, as s with the food-store data, the simplified health-facility layer saved from the previous lab did not retain the subtype field needed for filtering, so I re-read the original health-facility CSV and rebuilt the full point layer.) st_join() was then used to assign the facility points to polygons, then I filtered the facilities usign the Short Description field, counted the nursing homes by ZIPCODE, and joined the resulting counts back to the zpNYC object from prior.

# (re)read the original health facility CSV so the subtype field is available
health_raw <- read_csv("data/NYS_Health_Facility.csv", show_col_types = FALSE)

# conv- the facility table to an sf point object
health_sf_full <- health_raw %>%
  filter(
    !is.na(`Facility Longitude`),
    !is.na(`Facility Latitude`)
  ) %>%
  st_as_sf(
    coords = c("Facility Longitude", "Facility Latitude"),
    crs = 4326,
    remove = FALSE
  )

health_sf_nyc_full <- st_join(health_sf_full, zip_sf, left = FALSE) # spatial join

nh_counts <- health_sf_nyc_full %>% # Filter to nursing homes
  filter(`Short Description` == "NH") %>%
  st_drop_geometry() %>%
  count(ZIPCODE, name = "NursingHomeNum") # count facilities by ZIP

# join nursing-home counts back to the ZIP sf object() from Task 2)
zpNYC <- zpNYC %>%
  select(-any_of("NursingHomeNum")) %>%
  left_join(nh_counts, by = "ZIPCODE") %>%
  mutate(NursingHomeNum = replace_na(NursingHomeNum, 0))

#check
summary(zpNYC$NursingHomeNum)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.6844  1.0000 12.0000

plot(zpNYC["NursingHomeNum"],
     main = "Nursing Homes per ZIP")

This added a ZIP-level nursing-home count field, NursingHomeNum, to the same zpNYC object that already contained the COVID variables and food-store counts from the earlier tasks. (Notably, it seems that a lot of nursing homes are centered around Far Rockaway…)

Task 4 - ACS Join

For this second part, I will join the Census ACS population, race, and age data to the NYC Planning Census Tract Data as an attribute join. Since the tract shapefile and ACS table did not begin with the same matching identifier, I first had to derive compatible tract codes in both datasets. On the tract side, I created a county FIPS field from borough names and then built a full tract FIPS code by combining the county FIPS with the 2010 tract code. As for the ACS file, the CSV was read (skipping the extra metadata line), and tract codes were extracted from GEO_ID. These were then joined conservatively to the tract polygons.

# NYC census tracts from Department of Planning
nycCensus <- st_read(
  "data/2010 Census Tracts/geo_export_1dc7b645-647b-4806-b9a0-7b79660f120a.shp",
  quiet = TRUE,
  stringsAsFactors = FALSE
)

str(nycCensus)

## Classes 'sf' and 'data.frame':   2165 obs. of  12 variables:
##  $ boro_code : chr  "5" "1" "1" "1" ...
##  $ boro_ct201: chr  "5000900" "1009800" "1010000" "1010200" ...
##  $ boro_name : chr  "Staten Island" "Manhattan" "Manhattan" "Manhattan" ...
##  $ cdeligibil: chr  "E" "I" "I" "I" ...
##  $ ct2010    : chr  "000900" "009800" "010000" "010200" ...
##  $ ctlabel   : chr  "9" "98" "100" "102" ...
##  $ ntacode   : chr  "SI22" "MN19" "MN19" "MN17" ...
##  $ ntaname   : chr  "West New Brighton-New Brighton-St. George" "Turtle Bay-East Midtown" "Turtle Bay-East Midtown" "Midtown-Midtown South" ...
##  $ puma      : chr  "3903" "3808" "3808" "3807" ...
##  $ shape_area: num  2497010 1906016 1860938 1860993 1864600 ...
##  $ shape_leng: num  7729 5534 5692 5688 5693 ...
##  $ geometry  :sfc_MULTIPOLYGON of length 2165; first list element: List of 1
##   ..$ :List of 1
##   .. ..$ : num [1:28, 1:2] -74.1 -74.1 -74.1 -74.1 -74.1 ...
##   ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
##  - attr(*, "sf_column")= chr "geometry"
##  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
##   ..- attr(*, "names")= chr [1:11] "boro_code" "boro_ct201" "boro_name" "cdeligibil" ...

# add a cntyFIPS column to nycCensus based on borough names && create complete tractFIPS code by concatanating the county fip and census tract fip.
nycCensus <- nycCensus %>%
  mutate(
    cntyFIPS = case_when(
      boro_name == "Bronx" ~ "005",
      boro_name == "Brooklyn" ~ "047",
      boro_name == "Manhattan" ~ "061",
      boro_name == "Queens" ~ "081",
      boro_name == "Staten Island" ~ "085"
    ),
    tractFIPS = paste0(cntyFIPS, ct2010)
  )

# The ACS data from CENSUS website, 2018 data on population
# used readLines to bypass the second line/row in the csv file. 
acsData <- readLines("data/ACSDP5Y2018.DP05_data_with_overlays_2020-04-22T132935.csv") %>%
  magrittr::extract(-2) %>%
  textConnection() %>%
  read.csv(header = TRUE, quote = "\"") %>%
  as_tibble() %>%
  select(
    GEO_ID,
    totPop = DP05_0001E,
    elderlyPop = DP05_0024E, # >= 65
    malePop = DP05_0002E,
    femalePop = DP05_0003E,
    whitePop = DP05_0037E,
    blackPop = DP05_0038E,
    asianPop = DP05_0067E,
    hispanicPop = DP05_0071E,
    adultPop = DP05_0021E,
    citizenAdult = DP05_0087E
  ) %>%
  mutate(
    censusCode = str_sub(GEO_ID, -9, -1),
    across(c(totPop, elderlyPop, malePop, femalePop,
             whitePop, blackPop, asianPop, hispanicPop,
             adultPop, citizenAdult), as.numeric)
  )

acsData %>%
  slice(1:10)

## # A tibble: 10 × 12
##    GEO_ID         totPop elderlyPop malePop femalePop whitePop blackPop asianPop
##    <chr>           <dbl>      <dbl>   <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
##  1 1400000US3600…   7080         51    6503       577     1773     4239      130
##  2 1400000US3600…   4542        950    2264      2278     2165     1279      119
##  3 1400000US3600…   5634        710    2807      2827     2623     1699      226
##  4 1400000US3600…   5917        989    2365      3552     2406     2434       68
##  5 1400000US3600…   2765         76    1363      1402      585     1041      130
##  6 1400000US3600…   9409        977    4119      5290     3185     4487       29
##  7 1400000US3600…   4600        648    2175      2425      479     2122       27
##  8 1400000US3600…    172          0     121        51       69       89       14
##  9 1400000US3600…   5887        548    2958      2929      903     1344       68
## 10 1400000US3600…   2868        243    1259      1609      243      987        0
## # ℹ 4 more variables: hispanicPop <dbl>, adultPop <dbl>, citizenAdult <dbl>,
## #   censusCode <chr>

# Merge (JOIN) ACS data to the census tracts
# join by attributes /columns
popData <- merge(nycCensus, acsData, by.x = "tractFIPS", by.y = "censusCode")

# verification
sum(popData$totPop, na.rm = TRUE)

## [1] 8443713

st_crs(popData)

## Coordinate Reference System:
##   User input: WGS84(DD) 
##   wkt:
## GEOGCRS["WGS84(DD)",
##     DATUM["WGS84",
##         ELLIPSOID["WGS84",6378137,298.257223563,
##             LENGTHUNIT["metre",1,
##                 ID["EPSG",9001]]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic longitude",east,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic latitude",north,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]]]

popNYC <- st_transform(popData, st_crs(zpNYC)) # zpNYC is the JOINED zip code data from task 1.

# this may take a long time to run
plot(popNYC["asianPop"],
     main = "Asian Population by Census Tract")

As an example metric, Asian population was mapped per tract, which seems to align with expectation; prominent centers showing up in Chinatown (Manhattan) and Flushing (Queens).

Task 5 - Aggregating ACS data by ZIP code

Finally, the ACS census data will be combined with the cumulative outputs from tasks 3 and 4 to produce a final ZIP-level sf object containing all variables. To do this, I used a spatial join between the ZIP geometries and tract centroids, then summarized (interpolated) the tract-level ACS variables within each area. One final map (again of Asian Population, this time per ZIP area) was made to verify the aggregation worked.

# Aggregate the ACS data from census tracts to ZIP code areas
covidPopZipNYC <- st_join(
  zpNYC %>% select(-any_of(c("totPop", "elderlyPop", "malePop", "femalePop",
                             "whitePop", "blackPop", "asianPop",
                             "hispanicPop", "adultPop", "citizenAdult"))),
  popNYC %>% st_centroid(),
  join = st_contains,
  left = TRUE
) %>%
  group_by(
    ZIPCODE, BLDGZIP, PO_NAME, POPULATION, AREA, STATE, COUNTY,
    ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN,
    NEIGHBORHOOD_NAME, BOROUGH_GROUP, label, lat, lon,
    COVID_CASE_COUNT, COVID_CASE_RATE, POP_DENOMINATOR,
    COVID_DEATH_COUNT, COVID_DEATH_RATE, PERCENT_POSITIVE,
    TOTAL_COVID_TESTS, FoodStoreNum, NursingHomeNum
  ) %>%
  summarise(
    totPop = sum(totPop, na.rm = TRUE),
    elderlyPop = sum(elderlyPop, na.rm = TRUE),
    malePop = sum(malePop, na.rm = TRUE),
    femalePop = sum(femalePop, na.rm = TRUE),
    whitePop = sum(whitePop, na.rm = TRUE),
    blackPop = sum(blackPop, na.rm = TRUE),
    asianPop = sum(asianPop, na.rm = TRUE),
    hispanicPop = sum(hispanicPop, na.rm = TRUE),
    adultPop = sum(adultPop, na.rm = TRUE),
    citizenAdult = sum(citizenAdult, na.rm = TRUE)
  )

## Warning: st_centroid assumes attributes are constant over geometries

## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by ZIPCODE, BLDGZIP, PO_NAME, POPULATION,
##   AREA, STATE, COUNTY, ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN,
##   NEIGHBORHOOD_NAME, BOROUGH_GROUP, label, lat, lon, COVID_CASE_COUNT, …,
##   FoodStoreNum, and NursingHomeNum.
## ℹ Output is grouped by ZIPCODE, BLDGZIP, PO_NAME, POPULATION, AREA, STATE,
##   COUNTY, ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN, NEIGHBORHOOD_NAME,
##   BOROUGH_GROUP, label, lat, lon, COVID_CASE_COUNT, …, TOTAL_COVID_TESTS, and
##   FoodStoreNum.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(ZIPCODE, BLDGZIP, PO_NAME, POPULATION, AREA, STATE,
##   COUNTY, ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN, NEIGHBORHOOD_NAME,
##   BOROUGH_GROUP, label, lat, lon, COVID_CASE_COUNT, COVID_CASE_RATE,
##   POP_DENOMINATOR, COVID_DEATH_COUNT, COVID_DEATH_RATE, PERCENT_POSITIVE,
##   TOTAL_COVID_TESTS, FoodStoreNum, NursingHomeNum))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.

# verify the data
head(covidPopZipNYC)

## Simple feature collection with 6 features and 36 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -74.0473 ymin: 40.68846 xmax: -73.94922 ymax: 40.80055
## Geodetic CRS:  WGS 84
## # A tibble: 6 × 37
## # Groups:   ZIPCODE, BLDGZIP, PO_NAME, POPULATION, AREA, STATE, COUNTY,
## #   ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN, NEIGHBORHOOD_NAME,
## #   BOROUGH_GROUP, label, lat, lon, COVID_CASE_COUNT, COVID_CASE_RATE,
## #   POP_DENOMINATOR, COVID_DEATH_COUNT, COVID_DEATH_RATE, PERCENT_POSITIVE,
## #   TOTAL_COVID_TESTS, FoodStoreNum [6]
##   ZIPCODE BLDGZIP PO_NAME  POPULATION   AREA STATE COUNTY ST_FIPS CTY_FIPS URL  
##   <chr>   <chr>   <chr>         <dbl>  <dbl> <chr> <chr>  <chr>   <chr>    <chr>
## 1 00083   0       Central…         25 3.83e7 NY    New Y… 36      061      http…
## 2 10001   0       New York      22413 1.78e7 NY    New Y… 36      061      http…
## 3 10002   0       New York      81305 2.63e7 NY    New Y… 36      061      http…
## 4 10003   0       New York      55878 1.55e7 NY    New Y… 36      061      http…
## 5 10004   0       New York       2187 6.71e5 NY    New Y… 36      061      http…
## 6 10004   0       New York       2187 1.20e6 NY    New Y… 36      061      http…
## # ℹ 27 more variables: SHAPE_AREA <dbl>, SHAPE_LEN <dbl>,
## #   NEIGHBORHOOD_NAME <chr>, BOROUGH_GROUP <chr>, label <chr>, lat <dbl>,
## #   lon <dbl>, COVID_CASE_COUNT <dbl>, COVID_CASE_RATE <dbl>,
## #   POP_DENOMINATOR <dbl>, COVID_DEATH_COUNT <dbl>, COVID_DEATH_RATE <dbl>,
## #   PERCENT_POSITIVE <dbl>, TOTAL_COVID_TESTS <dbl>, FoodStoreNum <int>,
## #   NursingHomeNum <int>, totPop <dbl>, elderlyPop <dbl>, malePop <dbl>,
## #   femalePop <dbl>, whitePop <dbl>, blackPop <dbl>, asianPop <dbl>, …

sum(covidPopZipNYC$totPop, na.rm = TRUE)

## [1] 8446394

# see where those NAs are lol 
covidPopZipNYC %>%
  filter(is.na(totPop))

## Simple feature collection with 0 features and 36 fields
## Geometry type: POLYGON
## Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
## Geodetic CRS:  WGS 84
## # A tibble: 0 × 37
## # Groups:   ZIPCODE, BLDGZIP, PO_NAME, POPULATION, AREA, STATE, COUNTY,
## #   ST_FIPS, CTY_FIPS, URL, SHAPE_AREA, SHAPE_LEN, NEIGHBORHOOD_NAME,
## #   BOROUGH_GROUP, label, lat, lon, COVID_CASE_COUNT, COVID_CASE_RATE,
## #   POP_DENOMINATOR, COVID_DEATH_COUNT, COVID_DEATH_RATE, PERCENT_POSITIVE,
## #   TOTAL_COVID_TESTS, FoodStoreNum [0]
## # ℹ 37 variables: ZIPCODE <chr>, BLDGZIP <chr>, PO_NAME <chr>,
## #   POPULATION <dbl>, AREA <dbl>, STATE <chr>, COUNTY <chr>, ST_FIPS <chr>,
## #   CTY_FIPS <chr>, URL <chr>, SHAPE_AREA <dbl>, SHAPE_LEN <dbl>,
## #   NEIGHBORHOOD_NAME <chr>, BOROUGH_GROUP <chr>, label <chr>, lat <dbl>,
## #   lon <dbl>, COVID_CASE_COUNT <dbl>, COVID_CASE_RATE <dbl>,
## #   POP_DENOMINATOR <dbl>, COVID_DEATH_COUNT <dbl>, COVID_DEATH_RATE <dbl>,
## #   PERCENT_POSITIVE <dbl>, TOTAL_COVID_TESTS <dbl>, FoodStoreNum <int>, …

# final verification map
plot(covidPopZipNYC["asianPop"],
     main = "Asian Population by ZIP Code")