In today’s class we’ll learn a few spatial analysis tools so that we can select data based on location.
New functions and concepts:
na.rm = TRUEpivot_wider()na.valueeverything()st_read()st_join()st_crs()st_transform()filter() within a ggplot
Managing missing values in calculations
Assignment questions and overview
Import spatial data from other sources
NYC Open Data
NHGIS
- NYC Neighborhood Tabulation Areas
Final Projects discussion
Assignment 9
Missing values are a part of messy, real-world data. Understanding how missing data are defined in R and how to perform operations with them will be a critical component of your data cleaning and analysis work.
Simply put, a missing value is a way to signal an absence of information in a dataset. It’s the equivalent of a blank cell in an Excel spreadsheet. In R, missing values typically look like an NA appearing in a variable, a vector, or a dataframe.
library(tidyverse)
# create a variable with a value of NA
var1 <- NA
# create a vector of numeric values with one NA value
vector1 <- c(4, 6, 2, 8, var1, 9)
# view structure of vector1
str(vector1)
## num [1:6] 4 6 2 8 NA 9
However, you may also encounter missing values in datasets that aren’t they equivalent of blank cells. Sometimes the creators of a dataset will use a numeric value to indicate missing data (such as 999) or a string of characters (such as "N/A" or “–”`). Recall our work with the ProPublica desegregation dataset from lessons 1 and 2:
# load propublica desegregation data
deseg_pp_raw <- read_csv("data/raw/invol_data_propublica.csv")
# view glimpse of df
glimpse(deseg_pp_raw)
## Rows: 769
## Columns: 5
## $ District.Name <chr> "Abbeville 60", "Aberdeen School Dist", "Acadia Parish",…
## $ City <chr> "Abbeville", "Aberdeen", "Crowley", "St Louis", "Alabast…
## $ State <chr> "SC", "MS", "LA", "MO", "AL", "FL", "NC", "TN", "TX", "A…
## $ Year.Lifted <chr> "1984", "STILL OPEN", "1981", "1999", "STILL OPEN", "197…
## $ Year.Placed <chr> "N/A", "1969", "N/A", "N/A", "1963", "N/A", "N/A", "1966…
The use of a character-based representation of a missing value can make your dataset messier than it needs to be. In the Year.Placed column from the ProPublica dataset, we can see that the column should probably be a numeric column, but the presence of “N/A” in that column causes R to play it safe and consider each value in that column to be a character.
As you begin to work with a new dataset, you should always investigate and document the following:
You may want to revisit these questions more than once during the exploratory and cleaning phases of your data analysis work. Understanding the context of missing values in your data will help you produce more precise analysis products.
After documenting your missing data, you’ll want to make sure that all of your missing values are represented by NA and not numeric or character values. This can be accomplished several ways:
NA value while reading data into R# define the dataset's NA value during data import
deseg_pp_clean_na <- read_csv("data/raw/invol_data_propublica.csv",
na = "N/A")
# view glimpse of cleaned version of propublica data
glimpse(deseg_pp_clean_na)
## Rows: 769
## Columns: 5
## $ District.Name <chr> "Abbeville 60", "Aberdeen School Dist", "Acadia Parish",…
## $ City <chr> "Abbeville", "Aberdeen", "Crowley", "St Louis", "Alabast…
## $ State <chr> "SC", "MS", "LA", "MO", "AL", "FL", "NC", "TN", "TX", "A…
## $ Year.Lifted <chr> "1984", "STILL OPEN", "1981", "1999", "STILL OPEN", "197…
## $ Year.Placed <dbl> NA, 1969, NA, NA, 1963, NA, NA, 1966, NA, NA, NA, 1968, …
# note: the Year.Placed column is now a numeric column
NA conversion# change a column's data type from character to numeric
deseg_pp_raw %>%
mutate(Year.Placed = as.numeric(Year.Lifted)) %>%
glimpse()
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
## Rows: 769
## Columns: 5
## $ District.Name <chr> "Abbeville 60", "Aberdeen School Dist", "Acadia Parish",…
## $ City <chr> "Abbeville", "Aberdeen", "Crowley", "St Louis", "Alabast…
## $ State <chr> "SC", "MS", "LA", "MO", "AL", "FL", "NC", "TN", "TX", "A…
## $ Year.Lifted <chr> "1984", "STILL OPEN", "1981", "1999", "STILL OPEN", "197…
## $ Year.Placed <dbl> 1984, NA, 1981, 1999, NA, 1971, 2009, NA, 2002, 2002, NA…
# Again, the Year.Placed column is now numeric and the "N/A" values were converted to `NA`.
It’s important not to ignore missing values when you are trying to run calculations with your data. It’s so important that R will not let you ignore it:
# attempt to sum the values in vector1
sum(vector1)
## [1] NA
# attempt to summarise and extract the value of the median year deseg orders were placed
deseg_pp_clean_na %>%
summarise(avg_yr_placed = median(Year.Placed)) %>%
pull(avg_yr_placed) # pull() selects a column and transforms it into a vector
## [1] NA
In order to run a calculation on a vector or column with NA values, you will need to include an optional argument found in many R functions: na.rm = TRUE. Adding this argument to a function will ask R to ignore the NA values in the operation of that function.
sum(vector1, na.rm = TRUE)
## [1] 29
deseg_pp_clean_na %>%
summarise(avg_yr_placed = median(Year.Placed, na.rm = TRUE)) %>%
pull(avg_yr_placed)
## [1] 1969
CAUTION: Do not get in the habit of adding na.rm = TRUE to functions the first time you run them. Be sure to first understand the nature of your missing data before you start ignoring it in your calculations.
Download the median rent for every county in New York (The variable is called MEDIAN CONTRACT RENT in the ACS). 1. Create a choropleth map of the median rent in each county in New York 2. Identify areas that are affordable for a person making New York minimum wage - assume 30% of income for rent, and 40 hours a week at NY minimum wage (currently the minimum wage is $12.50 outside of NYC, Westchester and Long Island, source).
# First load in all of the packages we'll use
library(tidyverse)
library(tidycensus)
options(tigris_use_cache = TRUE)
library(sf)
library(scales)
library(RColorBrewer)
library(viridis)
# load all acs variables
acs201519 <- load_variables(2019, "acs5", cache = T)
# Import median rent for every county in NY
raw_ny_county_rent = get_acs(geography = "county",
variables = "B25058_001",
state='NY',
geometry = TRUE,
year = 2019)
## Getting data from the 2015-2019 5-year ACS
# process it into a nice dataframe
ny_county_rent <- raw_ny_county_rent %>%
rename(median_rent = estimate,
rent_moe = moe) %>%
select(-variable) %>%
# estimate affordable monthly rent in NYC as 30% of $15/hr, 40 hours a week, 4 weeks in a month
# use the link above to get the minimum wage in the rest of the state
mutate(min_wage = case_when(NAME == "Westchester County, New York" ~ 14,
NAME == "Suffolk County, New York" ~ 14,
NAME == "Nassau County, New York" ~ 14,
NAME == "New York, New York" ~ 15,
NAME == "Queens, New York" ~ 15,
NAME == "Kings, New York" ~ 15,
NAME == "Bronx, New York" ~ 15,
NAME == "Richmond, New York" ~ 15,
TRUE ~ 12.5),
affordable_rent = .3*min_wage*40*4,
affordable = if_else(affordable_rent > median_rent, "yes", "no"))
summary(ny_county_rent$median_rent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 510.0 612.5 678.0 805.7 878.0 1651.0
hist(ny_county_rent$median_rent)
ggplot() +
geom_sf(data = ny_county_rent, mapping = aes(fill = median_rent), color = "#ffffff") +
theme_void() +
scale_fill_distiller(breaks=c(510, 612.5, 805.7, 878),
direction = 1,
name="Median Rent ($)",
labels=dollar_format(accuracy = 1L)) +
labs(
title = "New York, Median Rent by County",
subtitle = "Breaks = Quartiles seems like a good idea, but the numbers are meaningless and confusing",
caption = "Source: American Community Survey, 2015-19"
)
ggplot() +
geom_sf(data = ny_county_rent, mapping = aes(fill = median_rent), color = "#ffffff") +
theme_void() +
scale_fill_distiller(breaks=c(750, 1000, 1250, 1500),
direction = 1,
name="Median Rent ($)",
labels=dollar_format(accuracy = 1L)) +
labs(
title = "New York, Median Rent by County",
caption = "Source: American Community Survey, 2015-19"
)
ggplot() +
geom_sf(data = ny_county_rent, mapping = aes(fill = affordable), color = "#e7e9f4") +
theme_void() +
scale_fill_manual(values = c('yes' = '#8da0cb', 'no' = '#fc8d62'),
name = "Median rent affordable?") +
labs(
title = "New York, Rent Affordability by County",
subtitle = "Counties where a full-time worker making $15/hour can afford the median rent",
caption = "Source: American Community Survey, 2015-19"
)
Download the PEOPLE REPORTING ANCESTRY table for every census tract in Kings County (Brooklyn)
When you download more than one variable from the census with geography (or a whole table) the geometry variable can get in the way of making your dataset tidy. In this case, I will import the geometry with one variable, and then the whole table without the geometry, process it, and then join them.
# import the geometry with one variable
raw_ancestry_shp <- get_acs(geography = "tract",
variables ="B04006_001",
state='NY',
county = 'Kings',
geometry = TRUE,
year = 2019)
## Getting data from the 2015-2019 5-year ACS
raw_ancestry <- get_acs(geography = "tract",
table="B04006",
state='NY',
county = 'Kings',
geometry = FALSE,
year = 2019)
## Getting data from the 2015-2019 5-year ACS
# manipulate the data easily, and the rejoin the shapes later
west_indian <- raw_ancestry %>%
pivot_wider(names_from = variable, values_from = c(estimate, moe)) %>%
# use rowSums() to add up all of your columns, across() allows you to select some columns
mutate(west_indian = rowSums(across(estimate_B04006_094:estimate_B04006_106))) %>%
# keep the west indian columns to check your sum
select(GEOID, NAME, west_indian, estimate_B04006_094:estimate_B04006_106) %>%
# now that I looked at the table, I can remove each individual west indian variable
select(GEOID, NAME, west_indian)
ancestry_shp <- raw_ancestry_shp %>%
select(GEOID, estimate, moe, geometry) %>%
rename(total = estimate,
total_moe = moe) %>%
left_join(west_indian, by = "GEOID") %>%
mutate(pct_west_indian = round(west_indian/total, 3)) %>%
select(GEOID, NAME, total, west_indian, pct_west_indian, total_moe, geometry)
## check the values of percent west indian to see how to map
summary(ancestry_shp$pct_west_indian)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0070 0.0570 0.2390 0.3628 1.3580 9
hist(ancestry_shp$pct_west_indian)
ggplot() +
geom_sf(data = ancestry_shp, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
theme_void() +
scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
direction = 1,
# na.value = "#fafafa",
na.value = "transparent",
name="Percent West Indian Ancestry (%)",
labels=percent_format(accuracy = 1L)) +
labs(
title = "Brooklyn, West Indian Ancestry by Census Tract",
caption = "Source: American Community Survey, 2015-19"
)
Let’s correct the values that are above 100%
ancestry_check <- raw_ancestry %>%
pivot_wider(names_from = variable, values_from = c(estimate, moe)) %>%
select(GEOID, estimate_B04006_001, moe_B04006_001,
moe_B04006_094:moe_B04006_106,
estimate_B04006_094:estimate_B04006_106) %>%
mutate(west_indian = rowSums(across(estimate_B04006_094:estimate_B04006_106)),
pct_west_indian = round(west_indian/estimate_B04006_001, 3)) %>%
select(GEOID, pct_west_indian, estimate_B04006_001, moe_B04006_001,
estimate_B04006_094, moe_B04006_094, everything()) %>%
rename(estimate_total = estimate_B04006_001,
moe_total = moe_B04006_001,
estimate_wi1 = estimate_B04006_094,
moe_wi1 = moe_B04006_094)
In this case, the margin of errors are high, but I think the areas with West Indian % > 100% do have a very high proportion of people with West Indian ancestry. I am going to change the value to 100%, and put a disclaimer on my map.
ancestry_shp_corrected <- ancestry_shp %>%
mutate(pct_west_indian = if_else(pct_west_indian >1, 1, pct_west_indian))
ggplot() +
geom_sf(data = ancestry_shp_corrected, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
theme_void() +
scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
direction = 1,
# na.value = "#fafafa",
na.value = "transparent",
name="West Indian Ancestry (%)",
labels=percent_format(accuracy = 1L)) +
labs(
title = "Brooklyn, West Indian Ancestry by Census Tract",
caption = "Source: ACS, 2015-19, ** populations are estimates"
)
Let’s add more context by adding the shape of Brooklyn to the map
kings_shp <- get_acs(state = "NY", county = "Kings", geography = "county",
variables = "B04006_001", geometry = TRUE)
## Getting data from the 2015-2019 5-year ACS
ggplot() +
geom_sf(data = ancestry_shp_corrected, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
theme_void() +
scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
direction = 1,
# na.value = "#fafafa",
na.value = "transparent",
name="West Indian Ancestry (%)",
labels=percent_format(accuracy = 1L)) +
labs(
title = "Brooklyn, West Indian Ancestry by Census Tract",
caption = "Source: ACS, 2015-19, ** populations are estimates"
) +
geom_sf(data = kings_shp, color = "blue", fill = NA)
ny_shp <- get_acs(state = "NY", county = c("Kings", "New York", "Queens", "Richmond", "Bronx"),
geography = "county",
variables = "B04006_001", geometry = TRUE)
## Getting data from the 2015-2019 5-year ACS
ggplot() +
geom_sf(data = ancestry_shp_corrected, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
theme_void() +
scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
direction = 1,
# na.value = "#fafafa",
na.value = "transparent",
name="West Indian Ancestry (%)",
labels=percent_format(accuracy = 1L)) +
labs(
title = "Brooklyn, West Indian Ancestry by Census Tract",
caption = "Source: ACS, 2015-19, ** populations are estimates"
) +
geom_sf(data = ny_shp, color = "blue", fill = NA)
You can import shapefiles or geojsons from any source using the sf function st_read()
New York City Planning Department has useful shapefiles on their website, including 2020 Neighborhood Tabulation Areas.
NHGIS, from IPUMS, is another source of spatial census data.
raw_nabes <- st_read("data/raw/nynta2020_21c/nynta2020.shp")
## Reading layer `nynta2020' from data source
## `/Users/sarahodges/spatial/NewSchool/methods1-materials-fall2021/class9/data/raw/nynta2020_21c/nynta2020.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 262 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
ggplot() +
geom_sf(data = ancestry_shp_corrected, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
theme_void() +
scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
direction = 1,
# na.value = "#fafafa",
na.value = "transparent",
name="West Indian Ancestry (%)",
labels=percent_format(accuracy = 1L)) +
labs(
title = "Brooklyn, West Indian Ancestry by Census Tract",
caption = "Source: ACS, 2015-19, ** populations are estimates"
) +
geom_sf(data = raw_nabes, color = "black", fill = NA, lwd = .5)
You can use spatial functions in the `sf’ package to identify what neighborhood each census tract is in so you can:
You can use a spatial join to add the data from one spatial data frame to another. Spatial joins don’t need a common id, this operation joins data based on their spatial relationship.
The sf function for a spatial join is st_join(). The documentation gives details on the requirements. Most importantly, the spatial data frames need to overlap in space, and they need to use the same projection.
st_crs() to print their projections in the consolest_join() help sectionst_crs(ancestry_shp_corrected)
## Coordinate Reference System:
## User input: NAD83
## wkt:
## GEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]]
st_crs(raw_nabes)
## Coordinate Reference System:
## User input: NAD83 / New York Long Island (ftUS)
## wkt:
## PROJCRS["NAD83 / New York Long Island (ftUS)",
## BASEGEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]],
## CONVERSION["SPCS83 New York Long Island zone (US Survey feet)",
## METHOD["Lambert Conic Conformal (2SP)",
## ID["EPSG",9802]],
## PARAMETER["Latitude of false origin",40.1666666666667,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-74,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",41.0333333333333,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",40.6666666666667,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",984250,
## LENGTHUNIT["US survey foot",0.304800609601219],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",0,
## LENGTHUNIT["US survey foot",0.304800609601219],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["easting (X)",east,
## ORDER[1],
## LENGTHUNIT["US survey foot",0.304800609601219]],
## AXIS["northing (Y)",north,
## ORDER[2],
## LENGTHUNIT["US survey foot",0.304800609601219]],
## USAGE[
## SCOPE["Engineering survey, topographic mapping."],
## AREA["United States (USA) - New York - counties of Bronx; Kings; Nassau; New York; Queens; Richmond; Suffolk."],
## BBOX[40.47,-74.26,41.3,-71.8]],
## ID["EPSG",2263]]
There is a lot of info about their projections. The key information is in the last line.
If you are working with New York City data, you want the projection to be 2263. So we will transform the projection of ancestry_shp_corrected to 2263
ancestry_shp_corrected_2263 <- ancestry_shp_corrected %>%
st_transform(2263)
st_crs(ancestry_shp_corrected_2263)
## Coordinate Reference System:
## User input: EPSG:2263
## wkt:
## PROJCRS["NAD83 / New York Long Island (ftUS)",
## BASEGEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]],
## CONVERSION["SPCS83 New York Long Island zone (US Survey feet)",
## METHOD["Lambert Conic Conformal (2SP)",
## ID["EPSG",9802]],
## PARAMETER["Latitude of false origin",40.1666666666667,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-74,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",41.0333333333333,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",40.6666666666667,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",984250,
## LENGTHUNIT["US survey foot",0.304800609601219],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",0,
## LENGTHUNIT["US survey foot",0.304800609601219],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["easting (X)",east,
## ORDER[1],
## LENGTHUNIT["US survey foot",0.304800609601219]],
## AXIS["northing (Y)",north,
## ORDER[2],
## LENGTHUNIT["US survey foot",0.304800609601219]],
## USAGE[
## SCOPE["Engineering survey, topographic mapping."],
## AREA["United States (USA) - New York - counties of Bronx; Kings; Nassau; New York; Queens; Richmond; Suffolk."],
## BBOX[40.47,-74.26,41.3,-71.8]],
## ID["EPSG",2263]]
#remove unnecessary fields in the neighborhood shapefile
nabes <- raw_nabes %>%
select(BoroName, CountyFIPS, NTA2020, NTAName)
ancestry_shp_nabes <- ancestry_shp_corrected_2263 %>%
st_join(nabes, join = st_intersects)
This creates a dataframe with duplicate census tracts if they intersect with more than one neighborhood. That works for this case where I want to create a spatial data frame of all census tracts in Crown Heights-North.
crown_heights_n_west_indian <- ancestry_shp_nabes %>%
filter(NTAName == "Crown Heights (North)")
ggplot() +
geom_sf(data = crown_heights_n_west_indian, mapping = aes(fill = pct_west_indian), color = "#ffffff") +
theme_void() +
scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
direction = 1,
# na.value = "#fafafa",
na.value = "transparent",
name="West Indian Ancestry (%)",
labels=percent_format(accuracy = 1L)) +
labs(
title = "Crown Heights North, West Indian Ancestry by Census Tract",
caption = "Source: ACS, 2015-19 (population estimates), NYC Planning"
) +
geom_sf(data = nabes %>% filter(NTAName == "Crown Heights (North)"), color = "blue", fill = NA)
west_indian_nabe_stats <- st_drop_geometry(ancestry_shp_nabes) %>%
group_by(NTAName) %>%
summarise(Borough = first(BoroName),
`Est. Total Population` = sum(total),
`Est. Total West Indian Population` = sum(west_indian)) %>%
mutate(`Est. Percent West Indian Ancestry` = percent(`Est. Total West Indian Population`/`Est. Total Population`, accuracy = 1L))
west_indian_nabe_stats
Assignment 9a: Reading, final project edition
Assignment 9a: Spatial research question with R Answer a simple research question about New York City using census data at the tract-level and at least two shapefiles from NYC Open Data. You can build upon the maps that you created for assignment 8, or use completely different data.
Download American Community Survey data or 2020 Decennial data of your choice at the census tract-level, with geography. Download two shapefiles from NYC Open Data portal. Do a similar level of analysis as we did in class with the West Indian population.
Assignment 9b: Research Journal, final project edition
For the last installment of the research journal assignment, submit a research journal idea related to your final project. Write a 1-2 sentence description of a research idea that you are considering for you final project, and what data you could use or create to study it. Some additional information you may want to include:
This doesn’t need to be your final project in the end, though it can be. If you haven’t selected a topic yet, use this assignment to explore possibilities. If you have, this should be a short intro to your research question.