# create a vector of numeric values with one NA value
vector1 <- c(4, 6, 2, 8, NA, 9)
# view structure of vector1
str(vector1)
num [1:6] 4 6 2 8 NA 9
I like to store my spatial data separately from my tabular data to keep things neat.
na.rm = TRUE
is.na()
: tests if a value is NAis.nan()
: tests if a value is NaN (Not a Number)st_join()
st_crs()
st_transform()
A missing value is a way to signal an absence of information in a dataset.
Common reasons for missing values:
Missing values are a part of messy, real-world data. Understanding how missing data are defined in R and how to perform operations with them will be a critical component of your data cleaning and analysis work.
NA
appearing in a variable, a vector, or a dataframe:# create a vector of numeric values with one NA value
vector1 <- c(4, 6, 2, 8, NA, 9)
# view structure of vector1
str(vector1)
num [1:6] 4 6 2 8 NA 9
NA
.999
) or a string of characters (such as "N/A"
or “–”`).District.Name | City | State | Year.Lifted | Year.Placed |
---|---|---|---|---|
Abbeville 60 | Abbeville | SC | 1984 | N/A |
Aberdeen School Dist | Aberdeen | MS | STILL OPEN | 1969 |
Acadia Parish | Crowley | LA | 1981 | N/A |
Affton 101 | St Louis | MO | 1999 | N/A |
Alabaster City | Alabaster | AL | STILL OPEN | 1963 |
Rows: 769
Columns: 5
$ District.Name <chr> "Abbeville 60", "Aberdeen School Dist", "Acadia Parish",…
$ City <chr> "Abbeville", "Aberdeen", "Crowley", "St Louis", "Alabast…
$ State <chr> "SC", "MS", "LA", "MO", "AL", "FL", "NC", "TN", "TX", "A…
$ Year.Lifted <chr> "1984", "STILL OPEN", "1981", "1999", "STILL OPEN", "197…
$ Year.Placed <chr> "N/A", "1969", "N/A", "N/A", "1963", "N/A", "N/A", "1966…
Year.Placed
column should probably be numeric, but “N/A” makes R assume each value in that column is a character.As you begin to work with a new dataset, you should always investigate and document the following:
If NA values are not represented by NA
you can:
NA
value while reading data into RNA
conversionifelse()
to redefine value to ’NA
NA
value during importDistrict.Name | City | State | Year.Lifted | Year.Placed |
---|---|---|---|---|
Abbeville 60 | Abbeville | SC | 1984 | NA |
Aberdeen School Dist | Aberdeen | MS | STILL OPEN | 1969 |
Acadia Parish | Crowley | LA | 1981 | NA |
Affton 101 | St Louis | MO | 1999 | NA |
Alabaster City | Alabaster | AL | STILL OPEN | 1963 |
Rows: 769
Columns: 5
$ District.Name <chr> "Abbeville 60", "Aberdeen School Dist", "Acadia Parish",…
$ City <chr> "Abbeville", "Aberdeen", "Crowley", "St Louis", "Alabast…
$ State <chr> "SC", "MS", "LA", "MO", "AL", "FL", "NC", "TN", "TX", "A…
$ Year.Lifted <chr> "1984", "STILL OPEN", "1981", "1999", "STILL OPEN", "197…
$ Year.Placed <dbl> NA, 1969, NA, NA, 1963, NA, NA, 1966, NA, NA, NA, 1968, …
District.Name | City | State | Year.Lifted | Year.Placed |
---|---|---|---|---|
Abbeville 60 | Abbeville | SC | 1984 | 1984 |
Aberdeen School Dist | Aberdeen | MS | STILL OPEN | NA |
Acadia Parish | Crowley | LA | 1981 | 1981 |
Affton 101 | St Louis | MO | 1999 | 1999 |
Alabaster City | Alabaster | AL | STILL OPEN | NA |
Rows: 769
Columns: 5
$ District.Name <chr> "Abbeville 60", "Aberdeen School Dist", "Acadia Parish",…
$ City <chr> "Abbeville", "Aberdeen", "Crowley", "St Louis", "Alabast…
$ State <chr> "SC", "MS", "LA", "MO", "AL", "FL", "NC", "TN", "TX", "A…
$ Year.Lifted <chr> "1984", "STILL OPEN", "1981", "1999", "STILL OPEN", "197…
$ Year.Placed <dbl> 1984, NA, 1981, 1999, NA, 1971, 2009, NA, 2002, 2002, NA…
NA
.ifelse()
to redefine value of NA valuesDistrict.Name | City | State | Year.Lifted | Year.Placed |
---|---|---|---|---|
Abbeville 60 | Abbeville | SC | 1984 | NA |
Aberdeen School Dist | Aberdeen | MS | STILL OPEN | 1969 |
Acadia Parish | Crowley | LA | 1981 | NA |
Affton 101 | St Louis | MO | 1999 | NA |
Alabaster City | Alabaster | AL | STILL OPEN | 1963 |
Rows: 769
Columns: 5
$ District.Name <chr> "Abbeville 60", "Aberdeen School Dist", "Acadia Parish",…
$ City <chr> "Abbeville", "Aberdeen", "Crowley", "St Louis", "Alabast…
$ State <chr> "SC", "MS", "LA", "MO", "AL", "FL", "NC", "TN", "TX", "A…
$ Year.Lifted <chr> "1984", "STILL OPEN", "1981", "1999", "STILL OPEN", "197…
$ Year.Placed <chr> NA, "1969", NA, NA, "1963", NA, NA, "1966", NA, NA, NA, …
NA
.It’s important not to ignore missing values when you are trying to run calculations with your data. It’s so important that R will not let you ignore it:
Calculate the median year a desegregation order was placed:
# attempt to calculate the median year deseg orders were placed
median(deseg_pp_clean_na$Year.Placed)
[1] NA
NA
in the column, mathematical calculations return NA
To run a calculation with NA
values, you will need to include an optional argument found in many R functions:
na.rm = TRUE
NA
values in the operation of that function.Recalculate the median year a desegregation order was placed:
# attempt to calculate the median year deseg orders were placed
median(deseg_pp_clean_na$Year.Placed, na.rm = TRUE)
[1] 1969
na.rm = TRUE
to functions the first time you run them. Be sure to first understand the nature of your missing data before you start ignoring it in your calculations.library(tidyverse)
library(tidycensus)
library(sf)
library(scales)
library(viridis)
# load all acs variables
acs201620 <- load_variables(2020, "acs5", cache = T)
## Import table of PEOPLE REPORTING ANCESTRY: B04006
raw_ancestry <- get_acs(geography = "tract",
variables = c(ancestry_pop = "B04006_001",
west_indian = "B04006_094"),
state='NY',
county = 'Kings',
geometry = T,
year = 2020,
output = "wide")
west_indian <- raw_ancestry |>
mutate(pct_west_indian = west_indianE/ancestry_popE)
GEOID | NAME | ancestry_popE | ancestry_popM | west_indianE | west_indianM | geometry | pct_west_indian |
---|---|---|---|---|---|---|---|
36047060600 | Census Tract 606, Kings County, New York | 2830 | 443 | 0 | 12 | MULTIPOLYGON (((-73.96035 4… | 0 |
36047005602 | Census Tract 56.02, Kings County, New York | 1787 | 386 | 0 | 12 | MULTIPOLYGON (((-74.03707 4… | 0 |
Explore the values to map it effectively
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.000000 0.003073 0.026961 0.117700 0.180517 0.645272 26
First, look at the rows with NA value.
is.na()
functionGEOID | NAME | ancestry_popE | ancestry_popM | west_indianE | west_indianM | geometry | pct_west_indian |
---|---|---|---|---|---|---|---|
36047001804 | Census Tract 18.04, Kings County, New York | 0 | 12 | 0 | 12 | MULTIPOLYGON (((-74.03297 4… | NaN |
36047070602 | Census Tract 706.02, Kings County, New York | 0 | 12 | 0 | 12 | MULTIPOLYGON (((-73.90467 4… | NaN |
36047008600 | Census Tract 86, Kings County, New York | 0 | 12 | 0 | 12 | MULTIPOLYGON (((-74.00566 4… | NaN |
36047040700 | Census Tract 407, Kings County, New York | 0 | 12 | 0 | 12 | MULTIPOLYGON (((-73.90449 4… | NaN |
36047031402 | Census Tract 314.02, Kings County, New York | 0 | 12 | 0 | 12 | MULTIPOLYGON (((-73.9985 40… | NaN |
In this case, the NA’s are actually NaNs
ggplot(data = west_indian, mapping = aes(fill = pct_west_indian)) +
geom_sf(color = "#ffffff") +
theme_void() +
scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
direction = 1,
na.value = "#fafafa",
name="Percent West Indian Ancestry (%)",
labels=percent_format(accuracy = 1L)) +
labs(
title = "Brooklyn, West Indian Ancestry by Census Tract",
caption = "Source: American Community Survey, 2016-20"
)
All spatial data does not come from the census. To import spatial data (usually shapefiles or geojsons) from any source:
st_read()
to read it inNYC sources:
New York City Planning Department has many NYC shapefiles on their Bytes of the Big Apple site
NYC Open Data has tons of data, spatial and tabular
NHGIS, from IPUMS, is another source of spatial census data.
## import borough shapefiles from NYC Open Data
boros <- st_read("part2/data/raw/geo/BoroughBoundaries.geojson")
Reading layer `BoroughBoundaries' from data source
`/Users/sarahodges/Documents/spatial/NewSchool/methods1-materials-fall2024/methods1-slides/part2/data/raw/geo/BoroughBoundaries.geojson'
using driver `GeoJSON'
Simple feature collection with 5 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
Geodetic CRS: WGS 84
## import Neighborhood Tabulation Areas for NYC
nabes <- st_read("part2/data/raw/geo/nynta2020_22b/nynta2020.shp")
Reading layer `nynta2020' from data source
`/Users/sarahodges/Documents/spatial/NewSchool/methods1-materials-fall2024/methods1-slides/part2/data/raw/geo/nynta2020_22b/nynta2020.shp'
using driver `ESRI Shapefile'
Simple feature collection with 262 features and 11 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 913175.1 ymin: 120128.4 xmax: 1067383 ymax: 272844.3
Projected CRS: NAD83 / New York Long Island (ftUS)
ggplot(data = west_indian, mapping = aes(fill = pct_west_indian)) +
geom_sf(color = "#ffffff",
lwd = 0) + # removes the census tract outline
theme_void() +
scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
direction = 1,
na.value = "transparent",
name="Percent West Indian Ancestry (%)",
labels=percent_format(accuracy = 1L)) +
labs(
title = "Brooklyn, West Indian Ancestry by Census Tract",
caption = "Source: American Community Survey, 2016-20"
) +
geom_sf(data = boros, color = "black", fill = NA, lwd = .5)
ggplot(data = west_indian, mapping = aes(fill = pct_west_indian)) +
geom_sf(color = "#ffffff",
lwd = 0) +
theme_void() +
scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
direction = 1,
na.value = "transparent",
name="Percent West Indian Ancestry (%)",
labels=percent_format(accuracy = 1L)) +
labs(
title = "Brooklyn, West Indian Ancestry by Census Tract",
caption = "Source: American Community Survey, 2016-20"
) +
geom_sf(data = boros |> filter(boro_name == "Brooklyn"),
color = "black", fill = NA, lwd = .5)
ggplot(data = west_indian, mapping = aes(fill = pct_west_indian)) +
geom_sf(color = "#ffffff",
lwd = 0) +
theme_void() +
scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
direction = 1,
na.value = "transparent",
name="Percent West Indian Ancestry (%)",
labels=percent_format(accuracy = 1L)) +
labs(
title = "Brooklyn, West Indian Ancestry by Census Tract",
caption = "Source: American Community Survey, 2016-20"
) +
geom_sf(data = nabes |> filter(BoroName == "Brooklyn"),
color = "gray", fill = NA, lwd = 0.25) +
geom_sf(data = boros |> filter(boro_name == "Brooklyn"),
color = "black", fill = NA, lwd = .5)
You can use a spatial join from the `sf’ package to identify what neighborhood each census tract is in so you can:
Spatial joins don’t need a common id, this operation joins data based on their spatial relationship.
Projections are the equation used to translate the round earth into a flat map.
st_crs()
to print their projections in the consolest_transform()
to project them
Coordinate Reference System:
User input: NAD83
wkt:
GEOGCRS["NAD83",
DATUM["North American Datum 1983",
ELLIPSOID["GRS 1980",6378137,298.257222101,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["latitude",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["longitude",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4269]]
There is a lot of info about their projections. The key information is in the last line.
Coordinate Reference System:
User input: NAD83 / New York Long Island (ftUS)
wkt:
PROJCRS["NAD83 / New York Long Island (ftUS)",
BASEGEOGCRS["NAD83",
DATUM["North American Datum 1983",
ELLIPSOID["GRS 1980",6378137,298.257222101,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4269]],
CONVERSION["SPCS83 New York Long Island zone (US Survey feet)",
METHOD["Lambert Conic Conformal (2SP)",
ID["EPSG",9802]],
PARAMETER["Latitude of false origin",40.1666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8821]],
PARAMETER["Longitude of false origin",-74,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8822]],
PARAMETER["Latitude of 1st standard parallel",41.0333333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8823]],
PARAMETER["Latitude of 2nd standard parallel",40.6666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8824]],
PARAMETER["Easting at false origin",984250,
LENGTHUNIT["US survey foot",0.304800609601219],
ID["EPSG",8826]],
PARAMETER["Northing at false origin",0,
LENGTHUNIT["US survey foot",0.304800609601219],
ID["EPSG",8827]]],
CS[Cartesian,2],
AXIS["easting (X)",east,
ORDER[1],
LENGTHUNIT["US survey foot",0.304800609601219]],
AXIS["northing (Y)",north,
ORDER[2],
LENGTHUNIT["US survey foot",0.304800609601219]],
USAGE[
SCOPE["Engineering survey, topographic mapping."],
AREA["United States (USA) - New York - counties of Bronx; Kings; Nassau; New York; Queens; Richmond; Suffolk."],
BBOX[40.47,-74.26,41.3,-71.8]],
ID["EPSG",2263]]
If you are working with New York City data, you want the projection to be 2263. So we we’ll transform the west_indian census tract data into 2263
Check the projections to make sure it worked!
Coordinate Reference System:
User input: EPSG:2263
wkt:
PROJCRS["NAD83 / New York Long Island (ftUS)",
BASEGEOGCRS["NAD83",
DATUM["North American Datum 1983",
ELLIPSOID["GRS 1980",6378137,298.257222101,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4269]],
CONVERSION["SPCS83 New York Long Island zone (US Survey feet)",
METHOD["Lambert Conic Conformal (2SP)",
ID["EPSG",9802]],
PARAMETER["Latitude of false origin",40.1666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8821]],
PARAMETER["Longitude of false origin",-74,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8822]],
PARAMETER["Latitude of 1st standard parallel",41.0333333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8823]],
PARAMETER["Latitude of 2nd standard parallel",40.6666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8824]],
PARAMETER["Easting at false origin",984250,
LENGTHUNIT["US survey foot",0.304800609601219],
ID["EPSG",8826]],
PARAMETER["Northing at false origin",0,
LENGTHUNIT["US survey foot",0.304800609601219],
ID["EPSG",8827]]],
CS[Cartesian,2],
AXIS["easting (X)",east,
ORDER[1],
LENGTHUNIT["US survey foot",0.304800609601219]],
AXIS["northing (Y)",north,
ORDER[2],
LENGTHUNIT["US survey foot",0.304800609601219]],
USAGE[
SCOPE["Engineering survey, topographic mapping."],
AREA["United States (USA) - New York - counties of Bronx; Kings; Nassau; New York; Queens; Richmond; Suffolk."],
BBOX[40.47,-74.26,41.3,-71.8]],
ID["EPSG",2263]]
GEOID | NAME | ancestry_popE | ancestry_popM | west_indianE | west_indianM | pct_west_indian | BoroCode | BoroName | NTA2020 | NTAName | geometry |
---|---|---|---|---|---|---|---|---|---|---|---|
36047060600 | Census Tract 606, Kings County, New York | 2830 | 443 | 0 | 12 | 0 | 3 | Brooklyn | BK1503 | Sheepshead Bay-Manhattan Beach-Gerritsen Beach | MULTIPOLYGON (((995262.8 15… |
36047005602 | Census Tract 56.02, Kings County, New York | 1787 | 386 | 0 | 12 | 0 | 3 | Brooklyn | BK1001 | Bay Ridge | MULTIPOLYGON (((973958.5 16… |
ggplot(data = west_indian_nabes |>
filter(NTAName == "Crown Heights (North)"),
mapping = aes(fill = pct_west_indian)) +
geom_sf(color = "#ffffff",
lwd = 0) +
theme_void() +
scale_fill_distiller(breaks=c(0, .2, .4, .6, .8, 1),
direction = 1,
na.value = "transparent",
name="Percent West Indian Ancestry (%)",
labels=percent_format(accuracy = 1L)) +
labs(
title = "Brooklyn, West Indian Ancestry by Census Tract",
caption = "Source: American Community Survey, 2016-20"
) +
geom_sf(data = nabes |> filter(NTAName == "Crown Heights (North)"),
color = "black", fill = NA, lwd = 0.5)
west_indian_nabe_stats <- st_drop_geometry(west_indian_nabes) |>
group_by(NTAName) |>
summarise(Borough = first(BoroName),
`Est. Total Population` = sum(ancestry_popE),
`Est. Total West Indian Population` = sum(west_indianM)) |>
mutate(`Est. Percent West Indian Ancestry` = percent(`Est. Total West Indian Population`/`Est. Total Population`, accuracy = 1))
NTAName | Borough | Est. Total Population | Est. Total West Indian Population | Est. Percent West Indian Ancestry |
---|---|---|---|---|
Barren Island-Floyd Bennett Field | Brooklyn | 26 | 12 | 46% |
Bath Beach | Brooklyn | 32716 | 324 | 1% |
Bay Ridge | Brooklyn | 80183 | 1437 | 2% |
Bedford-Stuyvesant (East) | Brooklyn | 86869 | 5935 | 7% |
Bedford-Stuyvesant (West) | Brooklyn | 83717 | 4048 | 5% |
Bensonhurst | Brooklyn | 96331 | 756 | 1% |
Borough Park | Brooklyn | 78836 | 299 | 0% |
Brighton Beach | Brooklyn | 29819 | 120 | 0% |
Brooklyn Heights | Brooklyn | 23874 | 383 | 2% |
Brooklyn Navy Yard | Brooklyn | 0 | 12 | Inf |
Complete the in-class assignment and submit your script to CANVAS.
Make at least 3 census tract-level maps of one neighborhood in NYC. Along with each map, create formatted summary tables that compare your neighborhood with other neighborhoods in the same boro, Upload your script to CANVAS.