Introduction

This vignette demonstrates a complete, reproducible workflow for analyzing NYC census tract demographics in R using spatial data. The analysis moves from importing a GeoPackage layer to producing tract-level racial composition measures (proportions) and concentration measures (densities). The workflow is designed to be defensible, meaning it includes checks that help prevent common sources of error such as missing columns, invalid numeric types, impossible values, and division-by-zero.

NYC is a strong case for tract-level demographic analysis because the city contains extreme variation over short distances. Tracts can shift quickly from high concentration to high fragmentation, which makes it possible to observe how demographic patterns become spatial patterns. Importantly, these patterns are not only descriptive. They reflect how urban space is produced through institutions, housing markets, infrastructure, and neighborhood change processes. Measuring proportions and densities at the tract scale provides a way to operationalize these spatial outcomes.

Before running the code, ensure the GeoPackage is available in the same folder as this project or update the path accordingly.

Conceptual Foundations for This Workflow

This workflow relies on a few core geographic and statistical ideas.

Composition vs Concentration

This workflow relies on several geographic and statistical concepts used to measure demographic patterns across space. Two key measures are used throughout the analysis: population proportion, which represents the share of a demographic group within a tract, and population density, which represents the number of people per unit of land area.

Proportion (composition): group count divided by total population.
- This answers “what share of the tract is this group?”
Density (concentration): group count divided by land area.
- This answers “how many members of this group per square mile?”

These can tell different stories. A tract can have a high proportion of a group but low density if it has low population overall. A tract can have low proportion but high density if the tract is highly populated and the group count is still large.

Why Census Tracts Matter

Census tracts represent standardized geographic units designed to approximate neighborhood scale populations. They are widely used in demographic research because they provide consistent spatial units for statistical comparison across metropolitan regions.

However, tract boundaries also influence statistical results because they aggregate populations into fixed spatial containers. This phenomenon is related to the modifiable areal unit problem, where different boundary definitions can produce different statistical outcomes even when underlying patterns remain the same.

Why Urban Demographic Patterns Cluster

Urban demographic clustering is produced through housing markets, zoning regulations, historical settlement patterns, and economic inequality. When demographic differences appear as spatial clusters on a map, they reflect the cumulative outcome of these institutional and social processes.

This analysis does not directly explain those processes but instead generates standardized measurements of demographic distribution. These measurements can later be used to interpret spatial inequality or neighborhood change dynamics.

Reproducible Reporting and Chunk Behavior

This document uses knitr chunk options to control what appears in the final report. T

echo = TRUE ensures that code is displayed alongside results, which supports transparency and reproducibility.
message = FALSE suppresses package startup messages to reduce visual clutter in the report.
warning = TRUE option ensures that potential issues are still visible when the document is rendered.
include = TRUE option controls whether both code and results appear in the knitted document.

These options ensure that the computational workflow remains visible and verifiable while maintaining a readable report structure.

Load Libraries

The following chunk loads the packages required for spatial analysis and reproducible reporting. Successfully loading these libraries confirms that the computational environment contains the functions needed for the analysis.

library(sf) # ANALYSIS - Simple Features for R - Handling Geographic Data.

## Warning: package 'sf' was built under R version 4.5.2

library(dplyr) # ANALYSIS - A Grammar for Data Manipulation.

## Warning: package 'dplyr' was built under R version 4.5.2

library(sp) # ANALYSIS - Classes and Methods for Spatial Data.
library(rmarkdown) # DELIVERABLE - A Notebook Interface to Produce Elegantly Formatted Output.
library(knitr) # DELIVERABLE - Elegant, Flexible, and Fast Dynamic Report Generation.

The sf package provides tools for working with spatial vector data. In this analysis it is used to read the GeoPackage dataset and manage the geometry column representing census tract boundaries. The st_read() function later converts the spatial file into an sf object that behaves similarly to a data frame while retaining geographic geometry.

The dplyr package supports structured data manipulation. Although this workflow primarily uses base R operations, dplyr enables consistent column handling when working with large spatial attribute tables.

The knitr and rmarkdown packages allow the analysis to be rendered as a reproducible report. These packages control how code chunks, figures, and output are displayed when the document is knitted into the final report.

Successful loading of these packages confirms that the required computational environment is available for the spatial analysis.

Environmental Setup

The Environmental Setup section verifies the file structure, defines the data source, and loads the tract layer. The goal is to stabilize file access during knitting and make the analysis portable across sessions. Setting the knit root directory ensures that relative file references resolve consistently when rendering.

knitr::opts_knit$set(root.dir = "D:/2026_GES_666_Critical_Maps/Lab_1/26.2.12_R_Data/Green_assignment_one", echo = TRUE, include= TRUE)

list.dirs()

##  [1] "."                                                                        
##  [2] "./backups"                                                                
##  [3] "./Data"                                                                   
##  [4] "./Deliverables"                                                           
##  [5] "./images"                                                                 
##  [6] "./Index"                                                                  
##  [7] "./logs"                                                                   
##  [8] "./rsconnect"                                                              
##  [9] "./rsconnect/documents"                                                    
## [10] "./rsconnect/documents/Basic_Statistics_With_Geography.Rmd"                
## [11] "./rsconnect/documents/Basic_Statistics_With_Geography.Rmd/rpubs.com"      
## [12] "./rsconnect/documents/Basic_Statistics_With_Geography.Rmd/rpubs.com/rpubs"

The function knitr::opts_knit$set() establishes a root directory for the knitted document. This allows file paths to remain stable regardless of where the script is executed.

The function list.dirs() prints the folder structure contained within the project directory. This provides verification that the expected directories exist before attempting to load the dataset.

list.files()

##  [1] "backups"                                
##  [2] "Basic_Statistics_With_Geography.html"   
##  [3] "Basic_Statistics_With_Geography.Rmd"    
##  [4] "Data"                                   
##  [5] "Deliverables"                           
##  [6] "Green_basic_stats_v5_student_broken.R.R"
##  [7] "images"                                 
##  [8] "Index"                                  
##  [9] "logs"                                   
## [10] "new_york_city_metro_2022_class.gpkg"    
## [11] "rsconnect"

The function list.files() prints the contents of the working directory. This output confirms that the GeoPackage file is present where the script expects it.

Data Source Identification and Layer Discovery

Before importing the dataset, the workflow first identifies the GeoPackage path and inspects the layers available within it. GeoPackages can contain multiple spatial layers, so confirming the available layer names ensures that the correct dataset is selected.

gpkg_path<-file.path(getwd(), "new_york_city_metro_2022_class.gpkg") 

st_layers(gpkg_path) # OPTIONAL

## Driver: GPKG 
## Available layers:
##                                 layer_name geometry_type features fields
## 1         cb_2022_us_tract510_500k_nyc_v2a Multi Polygon     2323     75
## 2         neighborhood_boundaries_nyc_2019 Multi Polygon      195     11
## 3          neighborhood_locations_nyc_2019         Point      299     20
## 4            cb_2022_us_county510_500k_nyc       Polygon        7     10
## 5 cb_2022_us_tract510_500k_nyc_v2a_updated Multi Polygon     2323     75
##                                crs_name
## 1 North_America_Albers_Equal_Area_Conic
## 2 North_America_Albers_Equal_Area_Conic
## 3 North_America_Albers_Equal_Area_Conic
## 4 North_America_Albers_Equal_Area_Conic
## 5 North_America_Albers_Equal_Area_Conic

Layer inspection matters because GeoPackages can contain multiple datasets and versions. The printed output also confirms the geometry type and projection. The reported projection is an equal-area conic projection, which is appropriate when area-based comparisons are central, especially density computations. If area distortions were large or inconsistent, density values could be biased. This is creating a full file-path by placing new_york_city_metro_2022_class.gpkg at the end of the root.dir, .../Lab_1/26.2.12_R_Data/Green_assignment_one/new_york_city_metro_2022_class.gpkg.

A optional step to check the data here is to use st_layer it can be used to verify what layers exist in the geopakcage. This then prints the available layer names and properties. This can be extremely helpful for the next step in which you input the layer name you will be examining. This is a environmental check and does not change the dataset.

Read the Census Tract Layer

After confirming the available layers, the census tract dataset is imported as a spatial object. In this dataset, the output confirms that 2,323 census tract features and 76 attribute fields were loaded. These values serve as a baseline verification that the dataset imported correctly.

layer_name <- "cb_2022_us_tract510_500k_nyc_v2a"

nyc_trct <- st_read(dsn = gpkg_path, layer = layer_name, quiet= FALSE)

## Reading layer `cb_2022_us_tract510_500k_nyc_v2a' from data source 
##   `D:\2026_GES_666_Critical_Maps\Lab_1\26.2.12_R_Data\Green_assignment_one\new_york_city_metro_2022_class.gpkg' 
##   using driver `GPKG'
## Simple feature collection with 2323 features and 75 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1718547 ymin: 256711 xmax: 1756674 ymax: 311508
## Projected CRS: North_America_Albers_Equal_Area_Conic

The function st_read() imports spatial vector data into an sf object. This object stores both the attribute table and the polygon geometry representing census tract boundaries.

The printed output from st_read() verifies the number of features, the number of fields, and the coordinate reference system. In this dataset, the output confirms that thousands of census tracts and Fdozens of variables were successfully loaded.

Data Preview and Basic Exploration

Before performing demographic calculations, the dataset must be inspected to confirm that required fields are present and that the spatial structure is valid.

Previewing rows and summarizing attributes allows the dataset to be examined without modifying it. This inspection step verifies that tract identifiers, land area measurements, and population counts exist and contain plausible values before any transformations are applied.

Field Inventory

names(nyc_trct)

##  [1] "OBJECTID_1"   "fid_1"        "STATEFP"      "COUNTYFP"     "TRACTCE"     
##  [6] "GEOID"        "NAME"         "NAMELSAD"     "MTFCC"        "FUNCSTAT"    
## [11] "ALAND"        "AWATER"       "INTPTLAT"     "INTPTLON"     "OBJECTID"    
## [16] "GEO_ID"       "NAME_1"       "STATE_FIPS"   "COUNTY_FIPS"  "TRACT_FIPS"  
## [21] "FIPS_ID"      "DP05_0033E"   "DP05_0037E"   "DP05_0038E"   "DP05_0039E"  
## [26] "DP05_0044E"   "DP05_0052E"   "DP05_0057E"   "DP05_0058E"   "DP05_0072E"  
## [31] "DP05_0073E"   "DP05_0074E"   "DP05_0075E"   "DP05_0076E"   "DP05_0077E"  
## [36] "DP05_0078E"   "DP05_0079E"   "DP05_0080E"   "DP05_0081E"   "DP05_0082E"  
## [41] "DP05_0083E"   "DP05_0084E"   "DP05_0085E"   "Shape_Length" "Shape_Area"  
## [46] "ecmb"         "DP05_3958E"   "DP05_0037P"   "DP05_0038P"   "DP05_0044P"  
## [51] "DP05_3958P"   "DP05_3837R"   "DP05_0077P"   "DP05_0075P"   "DP05_0076P"  
## [56] "DP05_0074P"   "DP05_3937R"   "DP05_395837R" "DP05_7475R"   "DP05_7675R"  
## [61] "DP05_7775R"   "ALAND_SQMI"   "DP05_0037D"   "DP05_0038D"   "DP05_0044D"  
## [66] "DP05_3958D"   "DP05_0074D"   "DP05_0075D"   "DP05_0076D"   "DP05_0077D"  
## [71] "DP05_0037LQ"  "DP05_0038LQ"  "DP05_0044LQ"  "DP05_3958LQ"  "MISS_ZIP"    
## [76] "geom"

The function names() prints the list of column names contained in the dataset. This step verifies that the demographic and geographic variables required for analysis are present.

For this workflow, important fields include:

GEOID : unique census tract identifier
ALAND : tract land area
DP05_0033E : total population
race-specific population counts

The output confirms that these variables exist in the dataset, ensuring the dataset contains the required inputs for the demographic calculations performed later in the workflow.

Row Sampling

head(nyc_trct[, 1:5], 5)

## Simple feature collection with 5 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1742899 ymin: 288963.7 xmax: 1749236 ymax: 291632.7
## Projected CRS: North_America_Albers_Equal_Area_Conic
##   OBJECTID_1 fid_1 STATEFP COUNTYFP TRACTCE                           geom
## 1          1     1      36      081  045000 MULTIPOLYGON (((1748446 289...
## 2          2     2      36      081  045400 MULTIPOLYGON (((1749170 289...
## 3          3     3      36      081  045500 MULTIPOLYGON (((1743456 291...
## 4          4     4      36      081  045600 MULTIPOLYGON (((1749178 290...
## 5          5     5      36      081  044602 MULTIPOLYGON (((1748945 289...

tail(nyc_trct[, 20:25], 5)

## Simple feature collection with 5 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 1742511 ymin: 287890.7 xmax: 1751010 ymax: 292934.6
## Projected CRS: North_America_Albers_Equal_Area_Conic
##      TRACT_FIPS     FIPS_ID DP05_0033E DP05_0037E DP05_0038E DP05_0039E
## 2319      41100 36081041100       3850        988         50          0
## 2320      41400 36081041400       5054        171       3651          0
## 2321      41500 36081041500       4292       1049        320         66
## 2322      42700 36081042700       4536        550        460          0
## 2323      43400 36081043400       1823         90       1444         33
##                                geom
## 2319 MULTIPOLYGON (((1742958 292...
## 2320 MULTIPOLYGON (((1751010 288...
## 2321 MULTIPOLYGON (((1743636 292...
## 2322 MULTIPOLYGON (((1744002 292...
## 2323 MULTIPOLYGON (((1750861 288...

The functions head() and tail() print subsets of the dataset. This allows inspection of sample rows to confirm that identifiers, numeric values, and geometry columns appear valid. Here we produce a limited result of [, 1:5] and , 20:25], a simplified subset of 5 features of selected fields. This does not change the dataset; it only displays a subset.

A optional step here is also to use the diagnostic tools str() and view(nyc_trct).

This can be used to display the structure of the dataset within the console. str(nyc_trct) confirms there are 2323 features and 76 fields, similar to View(nyc_trct), but this tool also shows data types (numeric, character, sf object) and previews values without printing the full dataset. This does not change the dataset.
View(nyc_trct) is another observational display tool that opens the entire dataset within RStudio. This allows manual inspection of all 2323 census tract features and their 76 fields. This does not alter the database; it is strictly observational.

The summary() function provides descriptive statistics for selected fields and verifies that numeric ranges appear plausible. This means fields such as DP05_0044E will print their minimum, maximum, mean, median, 1st quartile, and 3rd quartile. For character fields such as NAME, summary(nyc_trct) this provides information on length, class, and mode. This tool is used to detect missing values and potential abnormalities.

The range was limited to fields [6–11] to avoid clutter.

Spatial Geometry Visualization

Visualizing the tract boundaries confirms that the geometry column is valid and that the dataset represents the expected geographic area. Plotting the geometry also serves as an early quality check, ensuring that spatial features align properly before demographic calculations or mapping are performed.

plot(st_geometry(nyc_trct))

st_geometry() extracts the geometry column from the sf object. Plotting geometry verifies that the layer renders correctly and that the object is spatially coherent. This is a basic integrity check for geometry. This process pulls the dedicated geometry column, GEOID stored within the object database, nyc_trct. The projection information (e.g., + proj = aea) tells us this dataset is in Albers Equal Area Conic projection.

plot(st_geometry(nyc_trct)) uses the geometry and its coordinate reference system to render the map. The map example is NYC census tracts, the attribute table of this projection is nyc_trct.

Variable Requirements and Data Integrity Checks

Before performing demographic calculations, the dataset must contain the variables required for the analysis. This section explicitly defines the columns needed to compute population proportions and densities. Defining required variables improves reproducibility because the script fails immediately if a required column is missing.

The census tract dataset includes a large number of attributes, but this workflow only uses the fields necessary for population counts and land area measurements. The fields selected represent total population and race-specific population counts from the American Community Survey demographic tables. This section defines the required columns for the analysis and enforces structural completeness.

needed_vars <- c(
  "GEOID",       # Unique Geographic ID
  "NAME",        # Name Identifier
  "ALAND",       # Land Area
  "DP05_0033E",  # total population
  "DP05_0037E",  # white
  "DP05_0038E",  # black
  "DP05_0044E",  # asian
  "DP05_0039E",  # AIAN
  "DP05_0052E",  # NHPI
  "DP05_0057E",  # some other race
  "DP05_0058E" # two or more races
)

missing <- setdiff(needed_vars, names(nyc_trct))

if (length(missing) > 0) {
  stop("Missing required columns: ", paste(missing, collapse = ", "))
}

needed_vars<- needed_vars[needed_vars != "DP03_0062E"]

print(needed_vars) # OPTIONAL

##  [1] "GEOID"      "NAME"       "ALAND"      "DP05_0033E" "DP05_0037E"
##  [6] "DP05_0038E" "DP05_0044E" "DP05_0039E" "DP05_0052E" "DP05_0057E"
## [11] "DP05_0058E"

These are the core variables we will use in this assignment.

The object creates a character vector that stores the exact field names required for this analysis. Defining needed_vars makes the analysis explicit and audible. These include GEOID, NAME, land area ALAND, total population DP05_0033E, and race-specific population counts. Defining this vector ensures that only the intended fields are used in later calculations of proportion and density. This does not change the dataset; it simply defines a reference list of required columns.

The setdiff() check prevents downstream failure by stopping early if required columns are missing. The tool is looking for the fields with missing values and then piping <- that difference into missing.

This step is a structural validation check that is meant to ensure the dataset table is complete before analysis, while still not changing the actual dataset.

If(lengh(..) > 0 ) starts this line by ensuring that all geometry lengths are over 0. If a column comes back as >0 stop("Missing required columns: ", paste(missing, collapse = ", ")) will immediately halt execution and print the name of the missing variables for analysis, this didn’t happen in this instant as the dataset was complete. This step prevents calculations from running without complete data. While this step does not change the dataset, it protects it against errors.

needed_vars <- needed_vars[needed_vars != "DP03_0062E"] removes the DP03_0062E form the needed_vars (if listed above as it will not be used). The removal of DP03_0062E ensures the reference vector matches what will be used later.

It is optional to use print(needed_vars) here as it prints out the list of identifiers used in the dataset, This is used to ensure that DP03_0062E is removed.

Missing Data Diagnostics

Even if required variables exist, they may still contain missing values. Missing data can propagate into later calculations and produce incorrect results. This section counts the number of missing values within each required variable to confirm that the dataset is structurally complete.

na_counts <- c()

for (v in needed_vars) { 
  na_counts[v] <- sum(is.na(nyc_trct[[v]]))
}

na_counts

##      GEOID       NAME      ALAND DP05_0033E DP05_0037E DP05_0038E DP05_0044E 
##          0          0          0          0          0          0          0 
## DP05_0039E DP05_0052E DP05_0057E DP05_0058E 
##          0          0          0          0

The function is.na() tests whether a value is missing, returning TRUE when a value is NA. The function sum() then counts how many TRUE values exist in the vector. The loop iterates across each variable listed in needed_vars, storing the number of missing values for each column.

The printed output provides direct verification of data completeness. In the dataset used for this analysis, the result shows zero missing values across all required demographic variables, confirming that all census tract records contain valid population counts and land area values.

If any variables had contained missing values, the output would display the number of NA values, indicating where the dataset requires cleaning before continuing.

Numeric Field Validation and Type Conversion

Many demographic variables should be stored as numeric values so mathematical calculations can be performed. However, spatial data files sometimes import numeric values as character strings. If this occurs, calculations such as division would fail or produce incorrect results.

This section defines which variables must be numeric and verifies that those variables exist in the dataset.

numeric_vars <- c(
  "GEOID",
  "ALAND",
  "DP05_0033E", 
  "DP05_0037E", 
  "DP05_0038E", 
  "DP05_0039E",
  "DP05_0044E",
  "DP05_0052E", 
  "DP05_0057E", 
  "DP05_0058E"
)

setdiff(numeric_vars, names(nyc_trct))

## character(0)

The setdiff() function again compares the expected numeric variables with the dataset columns. When the output returns character(0), it confirms that all expected variables are present and available for conversion.

Next, the variables are explicitly converted to numeric values.

for (v in numeric_vars) {
  before_na <- sum(is.na(nyc_trct[[v]]))
  nyc_trct[[v]] <- suppressWarnings(as.numeric(nyc_trct[[v]]))
  after_na <- sum(is.na(nyc_trct[[v]]))
    
  if (after_na > before_na) {
    cat("NOTE: Converting", v, "to numeric increased NAs from", before_na, "to", after_na, "\n")
  }
}

The function as.numeric() converts a variable into numeric format. The function suppressWarnings() prevents unnecessary warning messages from appearing during conversion. Before and after conversion, the script counts the number of missing values.

If additional missing values appear after conversion, a message is printed identifying the affected variable. In this dataset, no conversion warnings appear, which confirms that the numeric variables were already stored in appropriate numeric formats.

Integrity Checks

Before calculating demographic proportions and densities, several integrity checks are performed to ensure that the dataset does not contain impossible or logically inconsistent values. These checks act as safeguards against errors that could invalidate the statistical results. Because proportions and densities rely on population counts and land area measurements, any incorrect values in these fields would directly distort the derived indicators.

Therefore, the script verifies that key variables contain valid values before any calculations are performed.

bad_pop <- which(!is.na(nyc_trct$DP05_0033E) & nyc_trct$DP05_0033E < 0)
if (length(bad_pop) > 0) stop("Negative total population found.")

bad_aland <- which(!is.na(nyc_trct$ALAND) & nyc_trct$ALAND < 0)
if (length(bad_aland) > 0) stop("Negative ALAND found.")

if ("DP03_0062E" %in% names(nyc_trct)) {
  bad_inc <- which(!is.na(nyc_trct$DP03_0062E) & nyc_trct$DP03_0062E < 0)
  if (length(bad_inc) > 0) stop("Negative median household income found.")
}

dups <- nyc_trct$GEOID[duplicated(nyc_trct$GEOID)]
if (length(dups) > 0) {
  cat("WARNING: Duplicate GEOID values found (first few):\n")
  print(head(dups, 10))
}

The script first checks whether total population (DP05_0033E) or land area (ALAND) contain negative values, which would be logically impossible for these variables. The function which() identifies rows where these conditions occur, and the stop() function halts the script if any invalid values are detected. The script also conditionally checks median household income (DP03_0062E) to ensure that income values are not negative.

Finally, the function dups() or duplicate() identifies whether any census tract identifiers (GEOID) appear more than once in the dataset. Each tract should have a unique identifier, so duplicate values trigger a warning message. Together, these checks confirm that the dataset is structurally valid before demographic calculations are performed.

Deriving Land Area and Aggregated Race Category

Population density requires land area measured in square miles. However, the ALAND variable in the dataset is recorded in square feet. The next step converts land area into square miles. Because the dataset records land area in square feet, the workflow converts these values into square miles.

This section also creates a combined race category labeled General Other, which aggregates several smaller race categories into a single variable. Aggregating these groups simplifies interpretation while preserving the original population counts.

nyc_trct$ALAND_SQMI <- nyc_trct$ALAND / 27878400

nyc_trct$DP05_3958E <- nyc_trct$DP05_0039E +
  nyc_trct$DP05_0052E +
  nyc_trct$DP05_0057E +
  nyc_trct$DP05_0058E

safe_div <- function(num, den) {
  out <- rep(NA_real_, length(num))
  ok <- !is.na(num) & !is.na(den) & den > 0
  out[ok] <- num[ok] / den[ok]
  out
}

The constant 27,878,400 represents the number of square feet in one square mile. Dividing the land area field by this value produces a new variable called ALAND_SQMI, which represents the tract land area in square miles. This is the given integer of the conversion given from the Attribute Table Directory.

This new variable aggregates American Indian/Alaska Native, Native Hawaiian/Pacific Islander, Some Other Race, and Two or More Races populations. Aggregating smaller categories simplifies later visualization and interpretation.

To prevent divide by zero errors during calculations, the script defines a helper function.

The function safe_div() performs division only when both numerator and denominator are valid and when the denominator is greater than zero. If the denominator is zero or missing, the function returns NA instead of producing infinite values.

Population Proportion Calculations

Population proportion measures the share of each racial group within a census tract. This metric is calculated by dividing the population count of each group by the total tract population. These calculations create new variables that describe the demographic composition of each tract, allowing comparisons between neighborhoods regardless of total population size.

Proportion = Individual Population / Total Population

nyc_trct$DP05_0037P <- safe_div(nyc_trct$DP05_0037E, nyc_trct$DP05_0033E)
nyc_trct$DP05_0038P <- safe_div(nyc_trct$DP05_0038E, nyc_trct$DP05_0033E)
nyc_trct$DP05_0044P <- safe_div(nyc_trct$DP05_0044E, nyc_trct$DP05_0033E)
nyc_trct$DP05_3958P <- safe_div(nyc_trct$DP05_3958E, nyc_trct$DP05_0033E)

Each calculation divides a race-specific population count by the total tract population. Because the safe_div() function prevents invalid division, tracts with zero population produce NA values rather than infinite numbers.

0≤P≤1

These new fields represent population composition. Their values range from 0 to 1, representing the share of the tract population belonging to each group.

DP05_0037P represents the proportion of the population identified as White.
DP05_0038P represents the proportion of the population identified as Black.
DP05_0044P represents the proportion of the population identified as Asian.
DP05_3958P represents the proportion of the combined “General Other” population category (AIAN + NHPI + Some Other Race + Two or More Races) created previously.

These calculations create new proportion fields within the dataset while preserving the original population counts. The dataset structure is expanded for analysis, not overwritten.

Population Density Calculations

Population density is calculated by dividing population counts by tract land area measured in square miles. Density measures provide insight into how concentrated different demographic groups are across space and complement the proportional measures calculated earlier.

Density = Individual Population / Area sq-mile

nyc_trct$DP05_0037D <- safe_div(nyc_trct$DP05_0037E, nyc_trct$ALAND_SQMI)
nyc_trct$DP05_0038D <- safe_div(nyc_trct$DP05_0038E, nyc_trct$ALAND_SQMI)
nyc_trct$DP05_0044D <- safe_div(nyc_trct$DP05_0044E, nyc_trct$ALAND_SQMI)
nyc_trct$DP05_3958D <- safe_div(nyc_trct$DP05_3958E, nyc_trct$ALAND_SQMI)

For example, a value of 5000 in DP05_0037D would indicate that approximately 5,000 White residents live per square mile within that tract.

DP05_0037D represents the density of the White population per square mile.
DP05_0038D represents the density of the Black population per square mile.
DP05_0044D represents the density of the Asian population per square mile.
DP05_3958D represents the density of the combined “General Other” category per square mile.

These calculations add new density fields to the dataset while preserving the original population and land area values. The dataset is expanded for spatial analysis rather than altered.

Distribution of Population Proportions

Histograms are used to visualize how racial composition varies across census tracts. These plots display the distribution of proportion values for each population group. Examining these distributions helps identify clustering patterns and confirms that demographic composition varies across neighborhoods within the city.

hist(nyc_trct$DP05_0037P, breaks = 50, main = "White Population", xlab = "Proportion")

summary(nyc_trct$DP05_0037P)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1222  0.3180  0.3718  0.6096  1.0000      81

The histogram displays how White population proportions vary across tracts. The summary statistics provide numerical verification of the distribution.

For example, the summary output shows:

Minimum value near 0, indicating tracts with almost no White population
Maximum value near 1, indicating tracts that are almost entirely White
Mean proportion approximately 0.37

This confirms that demographic composition varies widely across NYC neighborhoods.

hist(nyc_trct$DP05_0038P, breaks = 50, main = "Black Population", xlab = "Proportion")

summary(nyc_trct$DP05_0038P)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.02158 0.10013 0.24031 0.41090 1.00000      81

The summary output here shows a median around 0.10 and a mean around 0.24, indicating that many tracts contain relatively small Black population shares while others contain large concentrations.

hist(nyc_trct$DP05_0044P, breaks = 50, main = "Asian Population", xlab = "Proportion")

summary(nyc_trct$DP05_0044P)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.02674 0.07944 0.15036 0.21546 0.92321      81

The Asian population distribution shows the median 0.079 is lower then the mean 0.15, indicating concentrations in specific tracts rather than even distribution. This also shows there are specific tracts have a very high proportion representative of clustering.

hist(nyc_trct$DP05_3958P, breaks = 50, main = "Other Population", xlab = "Proportion")

summary(nyc_trct$DP05_3958P)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1056  0.1866  0.2376  0.3378  0.9104      81

The combined “General Other” category shows moderate clustering with proportions reaching approximately 0.91 in some tracts. The mean and median values indicate that these populations are unevenly distributed across the city.

Updated Layer to the GeoPackage

After calculating new variables such as land area in square miles, racial proportions, and density measures, the dataset stored in memory now contains fields that were not present in the original GeoPackage.

This section writes the updated dataset back into the GeoPackage as a new layer. The suffix _updated is appended to the layer name so that the original dataset remains unchanged while the modified dataset containing the derived variables is preserved.

layer_name_out <- paste0(layer_name, "_updated") # overwrite the original layer
# layer_name_out <- paste0(layer_name, "_updated")  # <- use this instead to preserve original

st_write(
  nyc_trct,
  dsn = gpkg_path,
  layer = layer_name_out,
  delete_layer = TRUE,
  quiet = TRUE
)

cat("\nDONE. Wrote updated layer to GeoPackage:\n", gpkg_path, "\nLayer:", layer_name_out, "\n")

## 
## DONE. Wrote updated layer to GeoPackage:
##  D:/2026_GES_666_Critical_Maps/Lab_1/26.2.12_R_Data/Green_assignment_one/new_york_city_metro_2022_class.gpkg 
## Layer: cb_2022_us_tract510_500k_nyc_v2a_updated

To prevent accidental loss of the original dataset, a new layer name is created by appending _updated to the original layer name. This naming convention ensures that the original tract layer remains unchanged while the modified dataset is stored as a separate layer containing the additional derived variables. The process therefore preserves the raw input data while clearly distinguishing the processed analytical dataset.

The st_write() function then writes the nyc_trct object to the GeoPackage using this new layer name. Because a GeoPackage can contain multiple named layers, this step alters the internal contents of the .gpkg file by either creating or replacing a layer called *_updated. The argument delete_layer = TRUE ensures that if a layer with this name already exists, it will be removed before the new version is written. This prevents conflicts that would otherwise occur when attempting to write a layer that already exists in the file.

Finally, a confirmation message is printed to the console to document that the updated spatial layer was successfully written. This step does not change the data itself but provides verification during execution that the file path and layer name were correctly applied.

Conclusion

This workflow demonstrated a reproducible method for analyzing demographic patterns across New York City census tracts using spatial data in R. The analysis began by verifying the dataset structure, ensuring that required variables existed, that numeric values were properly formatted, and that no missing or invalid values were present. After confirming the integrity of the data, the workflow calculated population proportions and population densities for several racial groups. These measures describe different aspects of demographic distribution. Proportions represent the share of each tract’s population belonging to a specific group, while densities measure how concentrated those populations are across geographic space.

The results show that demographic composition varies widely across census tracts, with some tracts containing very high proportions of certain racial groups and others showing more mixed populations. These differences reflect the spatial clustering commonly observed in urban environments. By documenting each step of the workflow and verifying the dataset before performing calculations, the analysis ensures that the derived indicators are reliable and reproducible. The resulting measures provide a foundation for future spatial analysis, such as mapping demographic patterns or examining broader neighborhood level processes shaping urban population distributions.

Mapping Example

This example map was produced using the updated spatial dataset stored in new_york_city_metro_2022_class_updated.gpkg. Earlier in the workflow, the census tract dataset was written back into the GeoPackage as a new layer with the suffix _updated. This step preserved the original dataset while storing a modified version that included the newly derived variables created during the analysis, such as land area in square miles, population proportions, and population densities.

Because the _updated layer contains these derived variables, it can be used directly for visualization without requiring additional calculations. The example map uses the variable nyc_trct$DP05_0038P, which represents the proportion of Black residents within each census tract. Mapping this proportion allows demographic composition to be visualized spatially, showing how the relative share of the Black population varies across neighborhoods.

Using the _updated dataset ensures that the map is produced from the validated and processed data generated during the workflow. This approach also maintains reproducibility, because the GeoPackage now stores both the original census tract layer and the processed analytical layer used for mapping. The resulting map therefore demonstrates how the derived demographic indicators can be applied to produce spatial visualizations of population patterns across NYC census tracts.

Basic Statistics With Geography

Keaton Green

2026-03-04