This vignette demonstrates a complete, reproducible workflow for analyzing NYC census tract demographics in R using spatial data. The analysis moves from importing a GeoPackage layer to producing tract-level racial composition measures (proportions) and concentration measures (densities). The workflow is designed to be defensible, meaning it includes checks that help prevent common sources of error such as missing columns, invalid numeric types, impossible values, and division-by-zero.
NYC is a strong case for tract-level demographic analysis because the city contains extreme variation over short distances. Tracts can shift quickly from high concentration to high fragmentation, which makes it possible to observe how demographic patterns become spatial patterns. Importantly, these patterns are not only descriptive. They reflect how urban space is produced through institutions, housing markets, infrastructure, and neighborhood change processes. Measuring proportions and densities at the tract scale provides a way to operationalize these spatial outcomes.
Before running the code, ensure the GeoPackage is available in the same folder as this project or update the path accordingly.
This workflow relies on a few core geographic and statistical ideas.
This workflow relies on several geographic and statistical concepts used to measure demographic patterns across space. Two key measures are used throughout the analysis: population proportion, which represents the share of a demographic group within a tract, and population density, which represents the number of people per unit of land area.
Proportion (composition): group count divided by total population.
Density (concentration): group count divided by land area.
These can tell different stories. A tract can have a high proportion of a group but low density if it has low population overall. A tract can have low proportion but high density if the tract is highly populated and the group count is still large.
Census tracts represent standardized geographic units designed to approximate neighborhood scale populations. They are widely used in demographic research because they provide consistent spatial units for statistical comparison across metropolitan regions.
However, tract boundaries also influence statistical results because they aggregate populations into fixed spatial containers. This phenomenon is related to the modifiable areal unit problem, where different boundary definitions can produce different statistical outcomes even when underlying patterns remain the same.
Urban demographic clustering is produced through housing markets, zoning regulations, historical settlement patterns, and economic inequality. When demographic differences appear as spatial clusters on a map, they reflect the cumulative outcome of these institutional and social processes.
This analysis does not directly explain those processes but instead generates standardized measurements of demographic distribution. These measurements can later be used to interpret spatial inequality or neighborhood change dynamics.
This document uses knitr chunk options to control what appears in the final report. T
echo = TRUE ensures that code is displayed alongside
results, which supports transparency and reproducibility.
message = FALSE suppresses package startup messages
to reduce visual clutter in the report.
warning = TRUE option ensures that potential issues
are still visible when the document is rendered.
include = TRUE option controls whether both code and
results appear in the knitted document.
These options ensure that the computational workflow remains visible and verifiable while maintaining a readable report structure.
The following chunk loads the packages required for spatial analysis and reproducible reporting. Successfully loading these libraries confirms that the computational environment contains the functions needed for the analysis.
library(sf) # ANALYSIS - Simple Features for R - Handling Geographic Data.
## Warning: package 'sf' was built under R version 4.5.2
library(dplyr) # ANALYSIS - A Grammar for Data Manipulation.
## Warning: package 'dplyr' was built under R version 4.5.2
library(sp) # ANALYSIS - Classes and Methods for Spatial Data.
library(rmarkdown) # DELIVERABLE - A Notebook Interface to Produce Elegantly Formatted Output.
library(knitr) # DELIVERABLE - Elegant, Flexible, and Fast Dynamic Report Generation.
The sf package provides tools for working with spatial
vector data. In this analysis it is used to read the GeoPackage dataset
and manage the geometry column representing census tract boundaries. The
st_read() function later converts the spatial file into an
sf object that behaves similarly to a data frame while
retaining geographic geometry.
The dplyr package supports structured data manipulation.
Although this workflow primarily uses base R operations,
dplyr enables consistent column handling when working with
large spatial attribute tables.
The knitr and rmarkdown packages allow the
analysis to be rendered as a reproducible report. These packages control
how code chunks, figures, and output are displayed when the document is
knitted into the final report.
Successful loading of these packages confirms that the required computational environment is available for the spatial analysis.
The Environmental Setup section verifies the file structure, defines the data source, and loads the tract layer. The goal is to stabilize file access during knitting and make the analysis portable across sessions. Setting the knit root directory ensures that relative file references resolve consistently when rendering.
knitr::opts_knit$set(root.dir = "D:/2026_GES_666_Critical_Maps/Lab_1/26.2.12_R_Data/Green_assignment_one", echo = TRUE, include= TRUE)
list.dirs()
## [1] "."
## [2] "./backups"
## [3] "./Data"
## [4] "./Deliverables"
## [5] "./images"
## [6] "./Index"
## [7] "./logs"
## [8] "./rsconnect"
## [9] "./rsconnect/documents"
## [10] "./rsconnect/documents/Basic_Statistics_With_Geography.Rmd"
## [11] "./rsconnect/documents/Basic_Statistics_With_Geography.Rmd/rpubs.com"
## [12] "./rsconnect/documents/Basic_Statistics_With_Geography.Rmd/rpubs.com/rpubs"
The function knitr::opts_knit$set() establishes a root
directory for the knitted document. This allows file paths to remain
stable regardless of where the script is executed.
The function list.dirs() prints the folder structure
contained within the project directory. This provides verification that
the expected directories exist before attempting to load the
dataset.
list.files()
## [1] "backups"
## [2] "Basic_Statistics_With_Geography.html"
## [3] "Basic_Statistics_With_Geography.Rmd"
## [4] "Data"
## [5] "Deliverables"
## [6] "Green_basic_stats_v5_student_broken.R.R"
## [7] "images"
## [8] "Index"
## [9] "logs"
## [10] "new_york_city_metro_2022_class.gpkg"
## [11] "rsconnect"
The function list.files() prints the contents of the
working directory. This output confirms that the GeoPackage file is
present where the script expects it.
Before importing the dataset, the workflow first identifies the GeoPackage path and inspects the layers available within it. GeoPackages can contain multiple spatial layers, so confirming the available layer names ensures that the correct dataset is selected.
gpkg_path<-file.path(getwd(), "new_york_city_metro_2022_class.gpkg")
st_layers(gpkg_path) # OPTIONAL
## Driver: GPKG
## Available layers:
## layer_name geometry_type features fields
## 1 cb_2022_us_tract510_500k_nyc_v2a Multi Polygon 2323 75
## 2 neighborhood_boundaries_nyc_2019 Multi Polygon 195 11
## 3 neighborhood_locations_nyc_2019 Point 299 20
## 4 cb_2022_us_county510_500k_nyc Polygon 7 10
## 5 cb_2022_us_tract510_500k_nyc_v2a_updated Multi Polygon 2323 75
## crs_name
## 1 North_America_Albers_Equal_Area_Conic
## 2 North_America_Albers_Equal_Area_Conic
## 3 North_America_Albers_Equal_Area_Conic
## 4 North_America_Albers_Equal_Area_Conic
## 5 North_America_Albers_Equal_Area_Conic
Layer inspection matters because GeoPackages can contain multiple
datasets and versions. The printed output also confirms the geometry
type and projection. The reported projection is an equal-area conic
projection, which is appropriate when area-based comparisons are
central, especially density computations. If area distortions were large
or inconsistent, density values could be biased. This is creating a full
file-path by placing new_york_city_metro_2022_class.gpkg at
the end of the
root.dir, .../Lab_1/26.2.12_R_Data/Green_assignment_one/new_york_city_metro_2022_class.gpkg.
A optional step to check the data here is to use
st_layer it can be used to verify what layers exist in the
geopakcage. This then prints the available layer names and properties.
This can be extremely helpful for the next step in which you input the
layer name you will be examining. This is a environmental check and does
not change the dataset.
After confirming the available layers, the census tract dataset is imported as a spatial object. In this dataset, the output confirms that 2,323 census tract features and 76 attribute fields were loaded. These values serve as a baseline verification that the dataset imported correctly.
layer_name <- "cb_2022_us_tract510_500k_nyc_v2a"
nyc_trct <- st_read(dsn = gpkg_path, layer = layer_name, quiet= FALSE)
## Reading layer `cb_2022_us_tract510_500k_nyc_v2a' from data source
## `D:\2026_GES_666_Critical_Maps\Lab_1\26.2.12_R_Data\Green_assignment_one\new_york_city_metro_2022_class.gpkg'
## using driver `GPKG'
## Simple feature collection with 2323 features and 75 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 1718547 ymin: 256711 xmax: 1756674 ymax: 311508
## Projected CRS: North_America_Albers_Equal_Area_Conic
The function st_read() imports spatial vector data into
an sf object. This object stores both the attribute table and the
polygon geometry representing census tract boundaries.
The printed output from st_read() verifies the number of
features, the number of fields, and the coordinate reference system. In
this dataset, the output confirms that thousands of census tracts and
Fdozens of variables were successfully loaded.
Before performing demographic calculations, the dataset must be inspected to confirm that required fields are present and that the spatial structure is valid.
Previewing rows and summarizing attributes allows the dataset to be examined without modifying it. This inspection step verifies that tract identifiers, land area measurements, and population counts exist and contain plausible values before any transformations are applied.
names(nyc_trct)
## [1] "OBJECTID_1" "fid_1" "STATEFP" "COUNTYFP" "TRACTCE"
## [6] "GEOID" "NAME" "NAMELSAD" "MTFCC" "FUNCSTAT"
## [11] "ALAND" "AWATER" "INTPTLAT" "INTPTLON" "OBJECTID"
## [16] "GEO_ID" "NAME_1" "STATE_FIPS" "COUNTY_FIPS" "TRACT_FIPS"
## [21] "FIPS_ID" "DP05_0033E" "DP05_0037E" "DP05_0038E" "DP05_0039E"
## [26] "DP05_0044E" "DP05_0052E" "DP05_0057E" "DP05_0058E" "DP05_0072E"
## [31] "DP05_0073E" "DP05_0074E" "DP05_0075E" "DP05_0076E" "DP05_0077E"
## [36] "DP05_0078E" "DP05_0079E" "DP05_0080E" "DP05_0081E" "DP05_0082E"
## [41] "DP05_0083E" "DP05_0084E" "DP05_0085E" "Shape_Length" "Shape_Area"
## [46] "ecmb" "DP05_3958E" "DP05_0037P" "DP05_0038P" "DP05_0044P"
## [51] "DP05_3958P" "DP05_3837R" "DP05_0077P" "DP05_0075P" "DP05_0076P"
## [56] "DP05_0074P" "DP05_3937R" "DP05_395837R" "DP05_7475R" "DP05_7675R"
## [61] "DP05_7775R" "ALAND_SQMI" "DP05_0037D" "DP05_0038D" "DP05_0044D"
## [66] "DP05_3958D" "DP05_0074D" "DP05_0075D" "DP05_0076D" "DP05_0077D"
## [71] "DP05_0037LQ" "DP05_0038LQ" "DP05_0044LQ" "DP05_3958LQ" "MISS_ZIP"
## [76] "geom"
The function names() prints the list of column names
contained in the dataset. This step verifies that the demographic and
geographic variables required for analysis are present.
For this workflow, important fields include:
GEOID : unique census tract identifier
ALAND : tract land area
DP05_0033E : total population
race-specific population counts
The output confirms that these variables exist in the dataset, ensuring the dataset contains the required inputs for the demographic calculations performed later in the workflow.
head(nyc_trct[, 1:5], 5)
## Simple feature collection with 5 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 1742899 ymin: 288963.7 xmax: 1749236 ymax: 291632.7
## Projected CRS: North_America_Albers_Equal_Area_Conic
## OBJECTID_1 fid_1 STATEFP COUNTYFP TRACTCE geom
## 1 1 1 36 081 045000 MULTIPOLYGON (((1748446 289...
## 2 2 2 36 081 045400 MULTIPOLYGON (((1749170 289...
## 3 3 3 36 081 045500 MULTIPOLYGON (((1743456 291...
## 4 4 4 36 081 045600 MULTIPOLYGON (((1749178 290...
## 5 5 5 36 081 044602 MULTIPOLYGON (((1748945 289...
tail(nyc_trct[, 20:25], 5)
## Simple feature collection with 5 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 1742511 ymin: 287890.7 xmax: 1751010 ymax: 292934.6
## Projected CRS: North_America_Albers_Equal_Area_Conic
## TRACT_FIPS FIPS_ID DP05_0033E DP05_0037E DP05_0038E DP05_0039E
## 2319 41100 36081041100 3850 988 50 0
## 2320 41400 36081041400 5054 171 3651 0
## 2321 41500 36081041500 4292 1049 320 66
## 2322 42700 36081042700 4536 550 460 0
## 2323 43400 36081043400 1823 90 1444 33
## geom
## 2319 MULTIPOLYGON (((1742958 292...
## 2320 MULTIPOLYGON (((1751010 288...
## 2321 MULTIPOLYGON (((1743636 292...
## 2322 MULTIPOLYGON (((1744002 292...
## 2323 MULTIPOLYGON (((1750861 288...
The functions head() and tail() print
subsets of the dataset. This allows inspection of sample rows to confirm
that identifiers, numeric values, and geometry columns appear valid.
Here we produce a limited result of [, 1:5] and
, 20:25], a simplified subset of 5 features of selected
fields. This does not change the dataset; it only displays a subset.
A optional step here is also to use the diagnostic tools
str() and view(nyc_trct).
str(nyc_trct) confirms there are
2323 features and 76 fields, similar to
View(nyc_trct), but this tool also shows data types
(numeric, character, sf object) and previews values without printing the
full dataset. This does not change the dataset.View(nyc_trct) is another observational display tool
that opens the entire dataset within RStudio. This allows manual
inspection of all 2323 census tract features and their
76 fields. This does not alter the database; it is strictly
observational.The summary() function provides descriptive statistics
for selected fields and verifies that numeric ranges appear plausible.
This means fields such as DP05_0044E will print their
minimum, maximum, mean, median, 1st quartile, and 3rd quartile. For
character fields such as NAME,
summary(nyc_trct) this provides information on length,
class, and mode. This tool is used to detect missing values and
potential abnormalities.
The range was limited to fields [6–11] to avoid
clutter.
Visualizing the tract boundaries confirms that the geometry column is valid and that the dataset represents the expected geographic area. Plotting the geometry also serves as an early quality check, ensuring that spatial features align properly before demographic calculations or mapping are performed.
plot(st_geometry(nyc_trct))
st_geometry() extracts the geometry column from the
sf object. Plotting geometry verifies that the layer
renders correctly and that the object is spatially coherent. This is a
basic integrity check for geometry. This process pulls the dedicated
geometry column, GEOID stored within the object database,
nyc_trct. The projection information
(e.g., + proj = aea) tells us this dataset is in Albers
Equal Area Conic projection.
plot(st_geometry(nyc_trct)) uses the geometry and its
coordinate reference system to render the map. The map example is NYC
census tracts, the attribute table of this projection is
nyc_trct.
Before performing demographic calculations, the dataset must contain the variables required for the analysis. This section explicitly defines the columns needed to compute population proportions and densities. Defining required variables improves reproducibility because the script fails immediately if a required column is missing.
The census tract dataset includes a large number of attributes, but this workflow only uses the fields necessary for population counts and land area measurements. The fields selected represent total population and race-specific population counts from the American Community Survey demographic tables. This section defines the required columns for the analysis and enforces structural completeness.
needed_vars <- c(
"GEOID", # Unique Geographic ID
"NAME", # Name Identifier
"ALAND", # Land Area
"DP05_0033E", # total population
"DP05_0037E", # white
"DP05_0038E", # black
"DP05_0044E", # asian
"DP05_0039E", # AIAN
"DP05_0052E", # NHPI
"DP05_0057E", # some other race
"DP05_0058E" # two or more races
)
missing <- setdiff(needed_vars, names(nyc_trct))
if (length(missing) > 0) {
stop("Missing required columns: ", paste(missing, collapse = ", "))
}
needed_vars<- needed_vars[needed_vars != "DP03_0062E"]
print(needed_vars) # OPTIONAL
## [1] "GEOID" "NAME" "ALAND" "DP05_0033E" "DP05_0037E"
## [6] "DP05_0038E" "DP05_0044E" "DP05_0039E" "DP05_0052E" "DP05_0057E"
## [11] "DP05_0058E"
These are the core variables we will use in this assignment.
The object creates a character vector that stores the exact field
names required for this analysis. Defining needed_vars
makes the analysis explicit and audible. These include
GEOID, NAME, land area ALAND,
total population DP05_0033E, and race-specific population
counts. Defining this vector ensures that only the intended fields are
used in later calculations of proportion and density. This does not
change the dataset; it simply defines a reference list of required
columns.
The setdiff() check prevents downstream failure by
stopping early if required columns are missing. The tool is looking for
the fields with missing values and then piping <- that
difference into missing.
This step is a structural validation check that is meant to ensure the dataset table is complete before analysis, while still not changing the actual dataset.
If(lengh(..) > 0 ) starts this line by ensuring that
all geometry lengths are over 0. If a column comes back as
>0
stop("Missing required columns: ", paste(missing, collapse = ", "))
will immediately halt execution and print the name of the missing
variables for analysis, this didn’t happen in this instant as the
dataset was complete. This step prevents calculations from running
without complete data. While this step does not change the dataset, it
protects it against errors.
needed_vars <- needed_vars[needed_vars != "DP03_0062E"]
removes the DP03_0062E form the needed_vars
(if listed above as it will not be used). The removal of
DP03_0062E ensures the reference vector matches what will
be used later.
It is optional to use print(needed_vars) here as it prints out
the list of identifiers used in the dataset, This is used to ensure that
DP03_0062E is removed.
Even if required variables exist, they may still contain missing values. Missing data can propagate into later calculations and produce incorrect results. This section counts the number of missing values within each required variable to confirm that the dataset is structurally complete.
na_counts <- c()
for (v in needed_vars) {
na_counts[v] <- sum(is.na(nyc_trct[[v]]))
}
na_counts
## GEOID NAME ALAND DP05_0033E DP05_0037E DP05_0038E DP05_0044E
## 0 0 0 0 0 0 0
## DP05_0039E DP05_0052E DP05_0057E DP05_0058E
## 0 0 0 0
The function is.na() tests whether a value is missing,
returning TRUE when a value is NA. The function sum() then
counts how many TRUE values exist in the vector. The loop iterates
across each variable listed in needed_vars, storing the
number of missing values for each column.
The printed output provides direct verification of data completeness. In the dataset used for this analysis, the result shows zero missing values across all required demographic variables, confirming that all census tract records contain valid population counts and land area values.
If any variables had contained missing values, the output would display the number of NA values, indicating where the dataset requires cleaning before continuing.
Many demographic variables should be stored as numeric values so mathematical calculations can be performed. However, spatial data files sometimes import numeric values as character strings. If this occurs, calculations such as division would fail or produce incorrect results.
This section defines which variables must be numeric and verifies that those variables exist in the dataset.
numeric_vars <- c(
"GEOID",
"ALAND",
"DP05_0033E",
"DP05_0037E",
"DP05_0038E",
"DP05_0039E",
"DP05_0044E",
"DP05_0052E",
"DP05_0057E",
"DP05_0058E"
)
setdiff(numeric_vars, names(nyc_trct))
## character(0)
The setdiff() function again compares the expected
numeric variables with the dataset columns. When the output returns
character(0), it confirms that all expected variables are
present and available for conversion.
Next, the variables are explicitly converted to numeric values.
for (v in numeric_vars) {
before_na <- sum(is.na(nyc_trct[[v]]))
nyc_trct[[v]] <- suppressWarnings(as.numeric(nyc_trct[[v]]))
after_na <- sum(is.na(nyc_trct[[v]]))
if (after_na > before_na) {
cat("NOTE: Converting", v, "to numeric increased NAs from", before_na, "to", after_na, "\n")
}
}
The function as.numeric() converts a variable into
numeric format. The function suppressWarnings() prevents
unnecessary warning messages from appearing during conversion. Before
and after conversion, the script counts the number of missing
values.
If additional missing values appear after conversion, a message is printed identifying the affected variable. In this dataset, no conversion warnings appear, which confirms that the numeric variables were already stored in appropriate numeric formats.
Before calculating demographic proportions and densities, several integrity checks are performed to ensure that the dataset does not contain impossible or logically inconsistent values. These checks act as safeguards against errors that could invalidate the statistical results. Because proportions and densities rely on population counts and land area measurements, any incorrect values in these fields would directly distort the derived indicators.
Therefore, the script verifies that key variables contain valid values before any calculations are performed.
bad_pop <- which(!is.na(nyc_trct$DP05_0033E) & nyc_trct$DP05_0033E < 0)
if (length(bad_pop) > 0) stop("Negative total population found.")
bad_aland <- which(!is.na(nyc_trct$ALAND) & nyc_trct$ALAND < 0)
if (length(bad_aland) > 0) stop("Negative ALAND found.")
if ("DP03_0062E" %in% names(nyc_trct)) {
bad_inc <- which(!is.na(nyc_trct$DP03_0062E) & nyc_trct$DP03_0062E < 0)
if (length(bad_inc) > 0) stop("Negative median household income found.")
}
dups <- nyc_trct$GEOID[duplicated(nyc_trct$GEOID)]
if (length(dups) > 0) {
cat("WARNING: Duplicate GEOID values found (first few):\n")
print(head(dups, 10))
}
The script first checks whether total population
(DP05_0033E) or land area (ALAND) contain
negative values, which would be logically impossible for these
variables. The function which() identifies rows where these
conditions occur, and the stop() function halts the script
if any invalid values are detected. The script also conditionally checks
median household income (DP03_0062E) to ensure that income
values are not negative.
Finally, the function dups() or duplicate()
identifies whether any census tract identifiers (GEOID)
appear more than once in the dataset. Each tract should have a unique
identifier, so duplicate values trigger a warning message. Together,
these checks confirm that the dataset is structurally valid before
demographic calculations are performed.
Population density requires land area measured in square miles.
However, the ALAND variable in the dataset is recorded in
square feet. The next step converts land area into square miles. Because
the dataset records land area in square feet, the workflow converts
these values into square miles.
This section also creates a combined race category labeled
General Other, which aggregates several smaller race
categories into a single variable. Aggregating these groups simplifies
interpretation while preserving the original population counts.
nyc_trct$ALAND_SQMI <- nyc_trct$ALAND / 27878400
nyc_trct$DP05_3958E <- nyc_trct$DP05_0039E +
nyc_trct$DP05_0052E +
nyc_trct$DP05_0057E +
nyc_trct$DP05_0058E
safe_div <- function(num, den) {
out <- rep(NA_real_, length(num))
ok <- !is.na(num) & !is.na(den) & den > 0
out[ok] <- num[ok] / den[ok]
out
}
The constant 27,878,400 represents the number of
square feet in one square mile. Dividing the land area field by this
value produces a new variable called ALAND_SQMI, which
represents the tract land area in square miles. This is the given
integer of the conversion given from the Attribute Table Directory.
This new variable aggregates American Indian/Alaska Native, Native Hawaiian/Pacific Islander, Some Other Race, and Two or More Races populations. Aggregating smaller categories simplifies later visualization and interpretation.
To prevent divide by zero errors during calculations, the script defines a helper function.
The function safe_div() performs division only when both
numerator and denominator are valid and when the denominator is greater
than zero. If the denominator is zero or missing, the function returns
NA instead of producing infinite values.
Population proportion measures the share of each racial group within a census tract. This metric is calculated by dividing the population count of each group by the total tract population. These calculations create new variables that describe the demographic composition of each tract, allowing comparisons between neighborhoods regardless of total population size.
Proportion = Individual Population / Total Population
nyc_trct$DP05_0037P <- safe_div(nyc_trct$DP05_0037E, nyc_trct$DP05_0033E)
nyc_trct$DP05_0038P <- safe_div(nyc_trct$DP05_0038E, nyc_trct$DP05_0033E)
nyc_trct$DP05_0044P <- safe_div(nyc_trct$DP05_0044E, nyc_trct$DP05_0033E)
nyc_trct$DP05_3958P <- safe_div(nyc_trct$DP05_3958E, nyc_trct$DP05_0033E)
Each calculation divides a race-specific population count by the
total tract population. Because the safe_div() function
prevents invalid division, tracts with zero population produce NA values
rather than infinite numbers.
0≤P≤1
These new fields represent population composition. Their values range from 0 to 1, representing the share of the tract population belonging to each group.
DP05_0037P represents the proportion of the
population identified as White.
DP05_0038P represents the proportion of the
population identified as Black.
DP05_0044P represents the proportion of the
population identified as Asian.
DP05_3958P represents the proportion of the combined
“General Other” population category
(AIAN + NHPI + Some Other Race + Two or More Races) created
previously.
These calculations create new proportion fields within the dataset while preserving the original population counts. The dataset structure is expanded for analysis, not overwritten.
Population density is calculated by dividing population counts by tract land area measured in square miles. Density measures provide insight into how concentrated different demographic groups are across space and complement the proportional measures calculated earlier.
Density = Individual Population / Area sq-mile
nyc_trct$DP05_0037D <- safe_div(nyc_trct$DP05_0037E, nyc_trct$ALAND_SQMI)
nyc_trct$DP05_0038D <- safe_div(nyc_trct$DP05_0038E, nyc_trct$ALAND_SQMI)
nyc_trct$DP05_0044D <- safe_div(nyc_trct$DP05_0044E, nyc_trct$ALAND_SQMI)
nyc_trct$DP05_3958D <- safe_div(nyc_trct$DP05_3958E, nyc_trct$ALAND_SQMI)
For example, a value of 5000 in DP05_0037D would
indicate that approximately 5,000 White residents live per square mile
within that tract.
DP05_0037D represents the density of the White population per square mile.
DP05_0038D represents the density of the Black population per square mile.
DP05_0044D represents the density of the Asian population per square mile.
DP05_3958D represents the density of the combined “General Other” category per square mile.
These calculations add new density fields to the dataset while preserving the original population and land area values. The dataset is expanded for spatial analysis rather than altered.
Histograms are used to visualize how racial composition varies across census tracts. These plots display the distribution of proportion values for each population group. Examining these distributions helps identify clustering patterns and confirms that demographic composition varies across neighborhoods within the city.
hist(nyc_trct$DP05_0037P, breaks = 50, main = "White Population", xlab = "Proportion")
summary(nyc_trct$DP05_0037P)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1222 0.3180 0.3718 0.6096 1.0000 81
The histogram displays how White population proportions vary across tracts. The summary statistics provide numerical verification of the distribution.
For example, the summary output shows:
Minimum value near 0, indicating tracts with almost no White population
Maximum value near 1, indicating tracts that are almost entirely White
Mean proportion approximately 0.37
This confirms that demographic composition varies widely across NYC neighborhoods.
hist(nyc_trct$DP05_0038P, breaks = 50, main = "Black Population", xlab = "Proportion")
summary(nyc_trct$DP05_0038P)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.02158 0.10013 0.24031 0.41090 1.00000 81
The summary output here shows a median around 0.10 and a mean around 0.24, indicating that many tracts contain relatively small Black population shares while others contain large concentrations.
hist(nyc_trct$DP05_0044P, breaks = 50, main = "Asian Population", xlab = "Proportion")
summary(nyc_trct$DP05_0044P)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.02674 0.07944 0.15036 0.21546 0.92321 81
The Asian population distribution shows the median 0.079 is lower then the mean 0.15, indicating concentrations in specific tracts rather than even distribution. This also shows there are specific tracts have a very high proportion representative of clustering.
hist(nyc_trct$DP05_3958P, breaks = 50, main = "Other Population", xlab = "Proportion")
summary(nyc_trct$DP05_3958P)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.1056 0.1866 0.2376 0.3378 0.9104 81
The combined “General Other” category shows moderate clustering with proportions reaching approximately 0.91 in some tracts. The mean and median values indicate that these populations are unevenly distributed across the city.
After calculating new variables such as land area in square miles, racial proportions, and density measures, the dataset stored in memory now contains fields that were not present in the original GeoPackage.
This section writes the updated dataset back into the GeoPackage as a
new layer. The suffix _updated is appended to the layer
name so that the original dataset remains unchanged while the modified
dataset containing the derived variables is preserved.
layer_name_out <- paste0(layer_name, "_updated") # overwrite the original layer
# layer_name_out <- paste0(layer_name, "_updated") # <- use this instead to preserve original
st_write(
nyc_trct,
dsn = gpkg_path,
layer = layer_name_out,
delete_layer = TRUE,
quiet = TRUE
)
cat("\nDONE. Wrote updated layer to GeoPackage:\n", gpkg_path, "\nLayer:", layer_name_out, "\n")
##
## DONE. Wrote updated layer to GeoPackage:
## D:/2026_GES_666_Critical_Maps/Lab_1/26.2.12_R_Data/Green_assignment_one/new_york_city_metro_2022_class.gpkg
## Layer: cb_2022_us_tract510_500k_nyc_v2a_updated
To prevent accidental loss of the original dataset, a new layer name
is created by appending _updated to the original layer
name. This naming convention ensures that the original tract layer
remains unchanged while the modified dataset is stored as a separate
layer containing the additional derived variables. The process therefore
preserves the raw input data while clearly distinguishing the processed
analytical dataset.
The st_write() function then writes the
nyc_trct object to the GeoPackage using this new layer
name. Because a GeoPackage can contain multiple named layers, this step
alters the internal contents of the .gpkg file by either
creating or replacing a layer called *_updated. The
argument delete_layer = TRUE ensures that if a layer with
this name already exists, it will be removed before the new version is
written. This prevents conflicts that would otherwise occur when
attempting to write a layer that already exists in the file.
Finally, a confirmation message is printed to the console to document that the updated spatial layer was successfully written. This step does not change the data itself but provides verification during execution that the file path and layer name were correctly applied.
This workflow demonstrated a reproducible method for analyzing demographic patterns across New York City census tracts using spatial data in R. The analysis began by verifying the dataset structure, ensuring that required variables existed, that numeric values were properly formatted, and that no missing or invalid values were present. After confirming the integrity of the data, the workflow calculated population proportions and population densities for several racial groups. These measures describe different aspects of demographic distribution. Proportions represent the share of each tract’s population belonging to a specific group, while densities measure how concentrated those populations are across geographic space.
The results show that demographic composition varies widely across census tracts, with some tracts containing very high proportions of certain racial groups and others showing more mixed populations. These differences reflect the spatial clustering commonly observed in urban environments. By documenting each step of the workflow and verifying the dataset before performing calculations, the analysis ensures that the derived indicators are reliable and reproducible. The resulting measures provide a foundation for future spatial analysis, such as mapping demographic patterns or examining broader neighborhood level processes shaping urban population distributions.
This example map was produced using the updated spatial dataset
stored in new_york_city_metro_2022_class_updated.gpkg.
Earlier in the workflow, the census tract dataset was written back into
the GeoPackage as a new layer with the suffix _updated.
This step preserved the original dataset while storing a modified
version that included the newly derived variables created during the
analysis, such as land area in square miles, population proportions, and
population densities.
Because the _updated layer contains these derived
variables, it can be used directly for visualization without requiring
additional calculations. The example map uses the variable
nyc_trct$DP05_0038P, which represents the proportion of
Black residents within each census tract. Mapping this proportion allows
demographic composition to be visualized spatially, showing how the
relative share of the Black population varies across neighborhoods.
Using the _updated dataset ensures that the map is
produced from the validated and processed data generated during the
workflow. This approach also maintains reproducibility, because the
GeoPackage now stores both the original census tract layer and the
processed analytical layer used for mapping. The resulting map therefore
demonstrates how the derived demographic indicators can be applied to
produce spatial visualizations of population patterns across NYC census
tracts.