Week 9 Lesson

In today’s class we’ll learn a few spatial analysis tools so that we can select data based on location.

New functions and concepts:

na.rm = TRUE
multiple values in pivot_wider()
na.value
everything()
multiple geoms in one ggplot
st_read()
st_join()
st_crs()
st_transform()
filter() within a ggplot

Class preparations

Download data for class 9 here
Create new project from new folder:
- methods1/class9
- create data folder in the class9 folder with these folders within
  - raw
  - processed
  - output
- create figures folder in the class9 folder

Outline

Managing missing values in calculations

Assignment questions and overview

Import spatial data from other sources

NYC Open Data

NHGIS

NYC Neighborhood Tabulation Areas

Final Projects discussion

Assignment 9

Managing missing values

Missing values are a part of messy, real-world data. Understanding how missing data are defined in R and how to perform operations with them will be a critical component of your data cleaning and analysis work.

What is a missing value?

Simply put, a missing value is a way to signal an absence of information in a dataset. It’s the equivalent of a blank cell in an Excel spreadsheet. In R, missing values typically look like an NA appearing in a variable, a vector, or a dataframe.

library(tidyverse)

# create a variable with a value of NA
var1 <- NA

# create a vector of numeric values with one NA value
vector1 <- c(4, 6, 2, 8, var1, 9)

# view structure of vector1
str(vector1)

##  num [1:6] 4 6 2 8 NA 9

However, you may also encounter missing values in datasets that aren’t they equivalent of blank cells. Sometimes the creators of a dataset will use a numeric value to indicate missing data (such as 999) or a string of characters (such as "N/A" or “–”`). Recall our work with the ProPublica desegregation dataset from lessons 1 and 2:

# load propublica desegregation data
deseg_pp_raw <- read_csv("data/raw/invol_data_propublica.csv")

# view glimpse of df
glimpse(deseg_pp_raw)

## Rows: 769
## Columns: 5
## $ District.Name <chr> "Abbeville 60", "Aberdeen School Dist", "Acadia Parish",…
## $ City          <chr> "Abbeville", "Aberdeen", "Crowley", "St Louis", "Alabast…
## $ State         <chr> "SC", "MS", "LA", "MO", "AL", "FL", "NC", "TN", "TX", "A…
## $ Year.Lifted   <chr> "1984", "STILL OPEN", "1981", "1999", "STILL OPEN", "197…
## $ Year.Placed   <chr> "N/A", "1969", "N/A", "N/A", "1963", "N/A", "N/A", "1966…

The use of a character-based representation of a missing value can make your dataset messier than it needs to be. In the Year.Placed column from the ProPublica dataset, we can see that the column should probably be a numeric column, but the presence of “N/A” in that column causes R to play it safe and consider each value in that column to be a character.

Managing the messines of missing values

As you begin to work with a new dataset, you should always investigate and document the following:

Are missing values are present in my data?
Where are the missing values?
How will these missing values affect my analysis?

You may want to revisit these questions more than once during the exploratory and cleaning phases of your data analysis work. Understanding the context of missing values in your data will help you produce more precise analysis products.

After documenting your missing data, you’ll want to make sure that all of your missing values are represented by NA and not numeric or character values. This can be accomplished several ways:

Defining the NA value while reading data into R

# define the dataset's NA value during data import
deseg_pp_clean_na <- read_csv("data/raw/invol_data_propublica.csv", 
                              na = "N/A")

# view glimpse of cleaned version of propublica data
glimpse(deseg_pp_clean_na)

## Rows: 769
## Columns: 5
## $ District.Name <chr> "Abbeville 60", "Aberdeen School Dist", "Acadia Parish",…
## $ City          <chr> "Abbeville", "Aberdeen", "Crowley", "St Louis", "Alabast…
## $ State         <chr> "SC", "MS", "LA", "MO", "AL", "FL", "NC", "TN", "TX", "A…
## $ Year.Lifted   <chr> "1984", "STILL OPEN", "1981", "1999", "STILL OPEN", "197…
## $ Year.Placed   <dbl> NA, 1969, NA, NA, 1963, NA, NA, 1966, NA, NA, NA, 1968, …

# note: the Year.Placed column is now a numeric column

Changing the data type of a column to force NA conversion

# change a column's data type from character to numeric
deseg_pp_raw %>%
  mutate(Year.Placed = as.numeric(Year.Lifted)) %>%
  glimpse()

## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion

## Rows: 769
## Columns: 5
## $ District.Name <chr> "Abbeville 60", "Aberdeen School Dist", "Acadia Parish",…
## $ City          <chr> "Abbeville", "Aberdeen", "Crowley", "St Louis", "Alabast…
## $ State         <chr> "SC", "MS", "LA", "MO", "AL", "FL", "NC", "TN", "TX", "A…
## $ Year.Lifted   <chr> "1984", "STILL OPEN", "1981", "1999", "STILL OPEN", "197…
## $ Year.Placed   <dbl> 1984, NA, 1981, 1999, NA, 1971, 2009, NA, 2002, 2002, NA…

# Again, the Year.Placed column is now numeric and the "N/A" values were converted to `NA`.

Managing missing values in calculations

It’s important not to ignore missing values when you are trying to run calculations with your data. It’s so important that R will not let you ignore it:

# attempt to sum the values in vector1
sum(vector1)

## [1] NA

# attempt to summarise and extract the value of the median year deseg orders were placed
deseg_pp_clean_na %>%
  summarise(avg_yr_placed = median(Year.Placed)) %>%
  pull(avg_yr_placed)  # pull() selects a column and transforms it into a vector

## [1] NA

In order to run a calculation on a vector or column with NA values, you will need to include an optional argument found in many R functions: na.rm = TRUE. Adding this argument to a function will ask R to ignore the NA values in the operation of that function.

sum(vector1, na.rm = TRUE)

## [1] 29

deseg_pp_clean_na %>%
  summarise(avg_yr_placed = median(Year.Placed, na.rm = TRUE)) %>%
  pull(avg_yr_placed)

## [1] 1969

CAUTION: Do not get in the habit of adding na.rm = TRUE to functions the first time you run them. Be sure to first understand the nature of your missing data before you start ignoring it in your calculations.

Assignment review

Questions

Map idea 1

Download the median rent for every county in New York (The variable is called MEDIAN CONTRACT RENT in the ACS). 1. Create a choropleth map of the median rent in each county in New York 2. Identify areas that are affordable for a person making New York minimum wage - assume 30% of income for rent, and 40 hours a week at NY minimum wage (currently the minimum wage is $12.50 outside of NYC, Westchester and Long Island, source).

# First load in all of the packages we'll use
library(tidyverse)
library(tidycensus)
options(tigris_use_cache = TRUE)
library(sf)
library(scales)
library(RColorBrewer)
library(viridis)

# load all acs variables
acs201519 <- load_variables(2019, "acs5", cache = T)

Look through the available variables in ACS
Copy the variable for median rent

Import and process the census data

Import median rent for every county in NY

# Import median rent for every county in NY
raw_ny_county_rent = get_acs(geography = "county", 
                 variables = "B25058_001", 
                 state='NY', 
                 geometry = TRUE, 
                 year = 2019)

## Getting data from the 2015-2019 5-year ACS

Process it into a nice data frame

# process it into a nice dataframe
ny_county_rent <- raw_ny_county_rent %>% 
  rename(median_rent = estimate,
         rent_moe = moe) %>% 
  select(-variable) %>% 
  # estimate affordable monthly rent in NYC as 30% of $15/hr, 40 hours a week, 4 weeks in a month
  # use the link above to get the minimum wage in the rest of the state
  mutate(min_wage = case_when(NAME == "Westchester County, New York" ~ 14,
                                     NAME == "Suffolk County, New York" ~ 14,
                                     NAME == "Nassau County, New York" ~ 14,
                                     NAME == "New York, New York" ~ 15,
                                     NAME == "Queens, New York" ~ 15,
                                     NAME == "Kings, New York" ~ 15,
                                     NAME == "Bronx, New York" ~ 15,
                                     NAME == "Richmond, New York" ~ 15,
                                     TRUE ~ 12.5),
         affordable_rent = .3*min_wage*40*4,
         affordable = if_else(affordable_rent > median_rent, "yes", "no"))

Explore the data to map it effectively
- Look at the range of values of the variables you want to map

summary(ny_county_rent$median_rent)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   510.0   612.5   678.0   805.7   878.0  1651.0

hist(ny_county_rent$median_rent)

Create a choropleth map of the median rent in each county in New York

Use the histogram to select breaks in your data that are:
- logical
- increase understanding of the issue you are mapping

Using quartiles may seem like a good idea, but they usually don’t make sense

ggplot()  + 
  geom_sf(data = ny_county_rent, mapping = aes(fill = median_rent), color = "#ffffff") +
  theme_void() +
  scale_fill_distiller(breaks=c(510, 612.5, 805.7, 878),
                    direction = 1,
                    name="Median Rent ($)",
                    labels=dollar_format(accuracy = 1L)) +
  labs(
    title = "New York, Median Rent by County",
    subtitle = "Breaks = Quartiles seems like a good idea, but the numbers are meaningless and confusing", 
    caption = "Source: American Community Survey, 2015-19"
  )

Use breaks that are round numbers, and that are relevant to the data

ggplot()  + 
  geom_sf(data = ny_county_rent, mapping = aes(fill = median_rent), color = "#ffffff") +
  theme_void() +
  scale_fill_distiller(breaks=c(750, 1000, 1250, 1500),
                    direction = 1,
                    name="Median Rent ($)",
                    labels=dollar_format(accuracy = 1L)) +
  labs(
    title = "New York, Median Rent by County",
    caption = "Source: American Community Survey, 2015-19"
  )

Identify areas that are affordable for a person making New York minimum wage

ggplot()  + 
  geom_sf(data = ny_county_rent, mapping = aes(fill = affordable), color = "#e7e9f4") +
  theme_void() +
  scale_fill_manual(values = c('yes' = '#8da0cb', 'no' = '#fc8d62'),
                    name = "Median rent affordable?") +
  labs(
    title = "New York, Rent Affordability by County",
    subtitle = "Counties where a full-time worker making $15/hour can afford the median rent",
    caption = "Source: American Community Survey, 2015-19"
  )

Map idea 2

Download the PEOPLE REPORTING ANCESTRY table for every census tract in Kings County (Brooklyn)

Create a choropleth map of the percent of people that have West Indian ancestry in each census tract
Create a choropleth map of the percent of people that have Polish ancestry in each census tract

When you download more than one variable from the census with geography (or a whole table) the geometry variable can get in the way of making your dataset tidy. In this case, I will import the geometry with one variable, and then the whole table without the geometry, process it, and then join them.

Import the geometry with one variable

# import the geometry with one variable 
raw_ancestry_shp <- get_acs(geography = "tract", 
                        variables ="B04006_001", 
                        state='NY',
                        county = 'Kings',
                        geometry = TRUE, 
                        year = 2019)

## Getting data from the 2015-2019 5-year ACS

Import the whole table without geometry and make it tidy

raw_ancestry <- get_acs(geography = "tract", 
                        table="B04006", 
                        state='NY',
                        county = 'Kings',
                        geometry = FALSE, 
                        year = 2019)

## Getting data from the 2015-2019 5-year ACS

# manipulate the data easily, and the rejoin the shapes later
west_indian <- raw_ancestry %>% 
  pivot_wider(names_from = variable, values_from = c(estimate, moe)) %>% 
  # use rowSums() to add up all of your columns, across() allows you to select some columns
  mutate(west_indian = rowSums(across(estimate_B04006_094:estimate_B04006_106))) %>% 
  # keep the west indian columns to check your sum
  select(GEOID, NAME, west_indian, estimate_B04006_094:estimate_B04006_106) %>% 
  # now that I looked at the table, I can remove each individual west indian variable
  select(GEOID, NAME, west_indian)

Clean up the spatial dataframe and join the data to it

ancestry_shp <- raw_ancestry_shp %>% 
  select(GEOID, estimate, moe, geometry) %>% 
  rename(total = estimate,
         total_moe = moe) %>% 
  left_join(west_indian, by = "GEOID") %>% 
  mutate(pct_west_indian = round(west_indian/total, 3)) %>% 
  select(GEOID, NAME, total, west_indian, pct_west_indian, total_moe, geometry)

Explore the values to map it effectively

## check the values of percent west indian to see how to map
summary(ancestry_shp$pct_west_indian)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.0070  0.0570  0.2390  0.3628  1.3580       9

hist(ancestry_shp$pct_west_indian)

Map it

ggplot()  + 
  geom_sf(data = ancestry_shp, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
  theme_void() +
  scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
                       direction = 1,
                       # na.value = "#fafafa",
                       na.value = "transparent",
                       name="Percent West Indian Ancestry (%)",
                       labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Brooklyn, West Indian Ancestry by Census Tract",
    caption = "Source: American Community Survey, 2015-19"
  )

Let’s correct the values that are above 100%

First we’ll look at the margin of error

ancestry_check <- raw_ancestry %>% 
  pivot_wider(names_from = variable, values_from = c(estimate, moe)) %>% 
  select(GEOID, estimate_B04006_001, moe_B04006_001, 
         moe_B04006_094:moe_B04006_106,
         estimate_B04006_094:estimate_B04006_106) %>% 
  mutate(west_indian = rowSums(across(estimate_B04006_094:estimate_B04006_106)),
         pct_west_indian = round(west_indian/estimate_B04006_001, 3)) %>% 
  select(GEOID, pct_west_indian, estimate_B04006_001, moe_B04006_001, 
         estimate_B04006_094, moe_B04006_094, everything()) %>% 
  rename(estimate_total = estimate_B04006_001, 
         moe_total = moe_B04006_001, 
         estimate_wi1 = estimate_B04006_094, 
         moe_wi1 = moe_B04006_094)

In this case, the margin of errors are high, but I think the areas with West Indian % > 100% do have a very high proportion of people with West Indian ancestry. I am going to change the value to 100%, and put a disclaimer on my map.

ancestry_shp_corrected <- ancestry_shp %>% 
  mutate(pct_west_indian = if_else(pct_west_indian >1, 1, pct_west_indian))

ggplot()  + 
  geom_sf(data = ancestry_shp_corrected, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
  theme_void() +
  scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
                       direction = 1,
                       # na.value = "#fafafa",
                       na.value = "transparent",
                       name="West Indian Ancestry (%)",
                       labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Brooklyn, West Indian Ancestry by Census Tract",
    caption = "Source: ACS, 2015-19, ** populations are estimates"
  )

Let’s add more context by adding the shape of Brooklyn to the map

We’ll download the total population for Kings County, with the shape

kings_shp <- get_acs(state = "NY", county = "Kings", geography = "county", 
                  variables = "B04006_001", geometry = TRUE)

## Getting data from the 2015-2019 5-year ACS

ggplot()  + 
  geom_sf(data = ancestry_shp_corrected, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
  theme_void() +
  scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
                       direction = 1,
                       # na.value = "#fafafa",
                       na.value = "transparent",
                       name="West Indian Ancestry (%)",
                       labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Brooklyn, West Indian Ancestry by Census Tract",
    caption = "Source: ACS, 2015-19, ** populations are estimates"
  ) + 
  geom_sf(data = kings_shp, color = "blue", fill = NA)

Add all of New York City

ny_shp <- get_acs(state = "NY", county = c("Kings", "New York", "Queens", "Richmond", "Bronx"), 
                     geography = "county", 
                  variables = "B04006_001", geometry = TRUE)

## Getting data from the 2015-2019 5-year ACS

ggplot()  + 
  geom_sf(data = ancestry_shp_corrected, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
  theme_void() +
  scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
                       direction = 1,
                       # na.value = "#fafafa",
                       na.value = "transparent",
                       name="West Indian Ancestry (%)",
                       labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Brooklyn, West Indian Ancestry by Census Tract",
    caption = "Source: ACS, 2015-19, ** populations are estimates"
  ) + 
  geom_sf(data = ny_shp, color = "blue", fill = NA)

Import spatial data from other sources

You can import shapefiles or geojsons from any source using the sf function st_read()

New York City Planning Department has useful shapefiles on their website, including 2020 Neighborhood Tabulation Areas.

NHGIS, from IPUMS, is another source of spatial census data.

Download the 2020 Neighborhood Tabulation Areas from NYC Ppanning
Unzip and save to your class9/data/raw folder
Import into your R project

raw_nabes <- st_read("data/raw/nynta2020_21c/nynta2020.shp")

## Reading layer `nynta2020' from data source 
##   `/Users/sarahodges/spatial/NewSchool/methods1-materials-fall2021/class9/data/raw/nynta2020_21c/nynta2020.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 262 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)

ggplot()  + 
  geom_sf(data = ancestry_shp_corrected, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
  theme_void() +
  scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
                       direction = 1,
                       # na.value = "#fafafa",
                       na.value = "transparent",
                       name="West Indian Ancestry (%)",
                       labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Brooklyn, West Indian Ancestry by Census Tract",
    caption = "Source: ACS, 2015-19, ** populations are estimates"
  ) + 
  geom_sf(data = raw_nabes, color = "black", fill = NA, lwd = .5)

Select census tracts in one neighborhood

You can use spatial functions in the `sf’ package to identify what neighborhood each census tract is in so you can:

filter to create a data frame of census tracts in a particular neighborhood
create a map of census tracts in a particular neighborhood
calculate statistics by neighborhood

You can use a spatial join to add the data from one spatial data frame to another. Spatial joins don’t need a common id, this operation joins data based on their spatial relationship.

The sf function for a spatial join is st_join(). The documentation gives details on the requirements. Most importantly, the spatial data frames need to overlap in space, and they need to use the same projection.

The steps to perform a spatial join

First, check that the two spatial data frames overlap in space.
- Our map above shows that they do!
Second, check to see if their projections match
- Use the st_crs() to print their projections in the console
- The projection for New York City data is 2263
Third, select the fields you want to add to your spatial data frame
Fourth, determine how you want to define overlap - see “details” in the st_join() help section

st_crs(ancestry_shp_corrected)

## Coordinate Reference System:
##   User input: NAD83 
##   wkt:
## GEOGCRS["NAD83",
##     DATUM["North American Datum 1983",
##         ELLIPSOID["GRS 1980",6378137,298.257222101,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["latitude",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["longitude",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     ID["EPSG",4269]]

st_crs(raw_nabes)

## Coordinate Reference System:
##   User input: NAD83 / New York Long Island (ftUS) 
##   wkt:
## PROJCRS["NAD83 / New York Long Island (ftUS)",
##     BASEGEOGCRS["NAD83",
##         DATUM["North American Datum 1983",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4269]],
##     CONVERSION["SPCS83 New York Long Island zone (US Survey feet)",
##         METHOD["Lambert Conic Conformal (2SP)",
##             ID["EPSG",9802]],
##         PARAMETER["Latitude of false origin",40.1666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8821]],
##         PARAMETER["Longitude of false origin",-74,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8822]],
##         PARAMETER["Latitude of 1st standard parallel",41.0333333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8823]],
##         PARAMETER["Latitude of 2nd standard parallel",40.6666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8824]],
##         PARAMETER["Easting at false origin",984250,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8826]],
##         PARAMETER["Northing at false origin",0,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8827]]],
##     CS[Cartesian,2],
##         AXIS["easting (X)",east,
##             ORDER[1],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##         AXIS["northing (Y)",north,
##             ORDER[2],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["United States (USA) - New York - counties of Bronx; Kings; Nassau; New York; Queens; Richmond; Suffolk."],
##         BBOX[40.47,-74.26,41.3,-71.8]],
##     ID["EPSG",2263]]

There is a lot of info about their projections. The key information is in the last line.

ancestry_shp_corrected projection = 4269
raw_nabes projection = 2263

If you are working with New York City data, you want the projection to be 2263. So we will transform the projection of ancestry_shp_corrected to 2263

ancestry_shp_corrected_2263 <- ancestry_shp_corrected %>% 
   st_transform(2263)

st_crs(ancestry_shp_corrected_2263)

## Coordinate Reference System:
##   User input: EPSG:2263 
##   wkt:
## PROJCRS["NAD83 / New York Long Island (ftUS)",
##     BASEGEOGCRS["NAD83",
##         DATUM["North American Datum 1983",
##             ELLIPSOID["GRS 1980",6378137,298.257222101,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4269]],
##     CONVERSION["SPCS83 New York Long Island zone (US Survey feet)",
##         METHOD["Lambert Conic Conformal (2SP)",
##             ID["EPSG",9802]],
##         PARAMETER["Latitude of false origin",40.1666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8821]],
##         PARAMETER["Longitude of false origin",-74,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8822]],
##         PARAMETER["Latitude of 1st standard parallel",41.0333333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8823]],
##         PARAMETER["Latitude of 2nd standard parallel",40.6666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8824]],
##         PARAMETER["Easting at false origin",984250,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8826]],
##         PARAMETER["Northing at false origin",0,
##             LENGTHUNIT["US survey foot",0.304800609601219],
##             ID["EPSG",8827]]],
##     CS[Cartesian,2],
##         AXIS["easting (X)",east,
##             ORDER[1],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##         AXIS["northing (Y)",north,
##             ORDER[2],
##             LENGTHUNIT["US survey foot",0.304800609601219]],
##     USAGE[
##         SCOPE["Engineering survey, topographic mapping."],
##         AREA["United States (USA) - New York - counties of Bronx; Kings; Nassau; New York; Queens; Richmond; Suffolk."],
##         BBOX[40.47,-74.26,41.3,-71.8]],
##     ID["EPSG",2263]]

Select the fields you want to add to your spatial data frame

#remove unnecessary fields in the neighborhood shapefile
nabes <- raw_nabes %>%
  select(BoroName, CountyFIPS, NTA2020, NTAName)

Perform the spatial join

We’ll try it with all the default parameters:

ancestry_shp_nabes <- ancestry_shp_corrected_2263 %>%
  st_join(nabes, join = st_intersects)

This creates a dataframe with duplicate census tracts if they intersect with more than one neighborhood. That works for this case where I want to create a spatial data frame of all census tracts in Crown Heights-North.

Filter to select your neighborhood of interest

crown_heights_n_west_indian <- ancestry_shp_nabes %>% 
  filter(NTAName == "Crown Heights (North)")

Map it to see if it

ggplot()  + 
  geom_sf(data = crown_heights_n_west_indian, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
  theme_void() +
  scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
                       direction = 1,
                       # na.value = "#fafafa",
                       na.value = "transparent",
                       name="West Indian Ancestry (%)",
                       labels=percent_format(accuracy = 1L)) +
  labs(
    title = "Crown Heights North, West Indian Ancestry by Census Tract",
    caption = "Source: ACS, 2015-19 (population  estimates), NYC Planning"
  ) + 
  geom_sf(data = nabes %>% filter(NTAName == "Crown Heights (North)"), color = "blue", fill = NA)

Calculate summary statistics too

west_indian_nabe_stats <- st_drop_geometry(ancestry_shp_nabes) %>% 
  group_by(NTAName) %>% 
  summarise(Borough = first(BoroName),
            `Est. Total Population` = sum(total),
            `Est. Total West Indian Population` = sum(west_indian)) %>% 
  mutate(`Est. Percent West Indian Ancestry` = percent(`Est. Total West Indian Population`/`Est. Total Population`, accuracy = 1L))

west_indian_nabe_stats

Assignment 9a: Reading, final project edition

Assignment 9a: Spatial research question with R Answer a simple research question about New York City using census data at the tract-level and at least two shapefiles from NYC Open Data. You can build upon the maps that you created for assignment 8, or use completely different data.

Download American Community Survey data or 2020 Decennial data of your choice at the census tract-level, with geography. Download two shapefiles from NYC Open Data portal. Do a similar level of analysis as we did in class with the West Indian population.

Create formatted summary tables of your chosen statistic at the city-level, and the level of your two NYC Open Data shapefiles. Export each as an excel file.
Create a map of your data at the census tract level with the boundaries of each NYC Open Data shapefiles
- so you shoudl create at least two maps
- save them to your figures folder
Write a paragraph about your results
Submit your maps, summary tables, and script to canvas.

Assignment 9b: Research Journal, final project edition

For the last installment of the research journal assignment, submit a research journal idea related to your final project. Write a 1-2 sentence description of a research idea that you are considering for you final project, and what data you could use or create to study it. Some additional information you may want to include:

A link to or description of the inspiration for this idea
How this data or research could be used by others
How you could visualize the data or results
What (if anything) you learned from thinking about this project

This doesn’t need to be your final project in the end, though it can be. If you haven’t selected a topic yet, use this assignment to explore possibilities. If you have, this should be a short intro to your research question.