Author

Victoria Campbell

Published

May 11, 2025

Explanation of the Project

The True Commuter’s Ride: Investigating the disparaging air quality hazard of traveling through the NYC’S MTA transit system

This project focuses on looking at New York City’s differing incomes and using it as a factor to look at traveling in the MTA. The data gathered by the end of this project overall aims to open a conversation discussing the responsibility of the MTA, the experiences of commuters and how equity in communities have played a role in the concerning status of the affordability of transportation.

Project work consists of:

• A comparison of income per capita to the spatial relation of census tracts to form conclusions on the economy in varying boroughs.

• Finding the stations with the most ridership for observations.

• Additional factor of air quality to further open up the question on equity based on socioeconomic background. Below are the packages used

MTA Station Data

Finding the stations with the most ridership for observations.

Cleaning & Arrangement:

Show the code
#Ridership Weekday 
wkday <- read_excel("D:/Gtech-331/wkdy_ridership.xlsx")

#  select, filter, arrange, slice and clean to get top 10 stations by 2022 ridership
wkday_clean <- wkday %>%
  filter(!is.na(`2022`)) %>%         # Remove rows with missing 2022 values
  filter(!is.na(Station)) %>%        # Remove rows where station name is missing
  filter(!grepl("total|sum|boro", Station, ignore.case = TRUE)) %>%  # Remove summary rows
  select(Station, Boro, `2022`) %>%
  arrange(desc(`2022`)) %>%          
  slice(1:10)                 

#Final view
print(wkday_clean)
# A tibble: 10 × 3
   Station                                                          Boro  `2022`
   <chr>                                                            <chr>  <dbl>
 1 Manhattan                                                        <NA>  1.70e6
 2 Brooklyn                                                         <NA>  7.34e5
 3 Queens                                                           <NA>  4.88e5
 4 Bronx                                                            <NA>  2.63e5
 5 Times Sq-42 St (N,Q,R,S,W,1,2,3,7)/42 St (A,C,E)/Bryant Pk (B,D… M     1.39e5
 6 Grand Central-42 St (S,4,5,6,7)                                  M     7.47e4
 7 34 St-Herald Sq (B,D,F,M,N,Q,R,W)                                M     6.41e4
 8 14 St-Union Sq (L,N,Q,R,W,4,5,6)                                 M     5.41e4
 9 Fulton St (A,C,J,Z,2,3,4,5)                                      M     4.80e4
10 34 St-Penn Station (A,C,E)                                       M     4.51e4

After finding the results, I looked at them separately.

Show the code
#cleaning and Loading in MTA Ridership transit data
RD2022 <- read.csv("RD2022.csv", stringsAsFactors = FALSE)
names(RD2022)<-c("SubwayStation","Lines","Ridership")
RD2022 <- RD2022[-c(1,12), ]
#Adding GeoIDs to the top subway stations
RD2022$GeoID<-c("36061011300","36061007600","36061009200",
                "36061005000","36061011100","36061001501","36081026700",
                "36061011300","36081087100","36061010100")
Show the code
#Adding tracts 
tractsZTA <- st_read("C:/Users/campb/Downloads/Modified Zip Code Tabulation Areas (MODZCTA)_20250516/geo_export_213dae92-dbda-4a0a-bf58-bfee89c37e04.shp")
Reading layer `geo_export_213dae92-dbda-4a0a-bf58-bfee89c37e04' from data source `C:\Users\campb\Downloads\Modified Zip Code Tabulation Areas (MODZCTA)_20250516\geo_export_213dae92-dbda-4a0a-bf58-bfee89c37e04.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 178 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -74.25559 ymin: 40.49612 xmax: -73.70001 ymax: 40.91553
Geodetic CRS:  WGS 84
Show the code
str(tractsZTA)
Classes 'sf' and 'data.frame':  178 obs. of  5 variables:
 $ modzcta : chr  "10001" "10002" "10003" "10026" ...
 $ label   : chr  "10001, 10118" "10002" "10003" "10026" ...
 $ zcta    : chr  "10001, 10119, 10199" "10002" "10003" "10026" ...
 $ pop_est : num  23072 74993 54682 39363 3028 ...
 $ geometry:sfc_MULTIPOLYGON of length 178; first list element: List of 1
  ..$ :List of 1
  .. ..$ : num [1:84, 1:2] -74 -74 -74 -74 -74 ...
  ..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
 - attr(*, "sf_column")= chr "geometry"
 - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
  ..- attr(*, "names")= chr [1:4] "modzcta" "label" "zcta" "pop_est"

Per Capita:

Show the code
#Loading locational data of all stations.
MTA_SubStations <- st_read("D:/Gtech-331/geo_export_07ac367e-9e83-4387-8cc5-246e23fc1c93.shp")
Reading layer `geo_export_07ac367e-9e83-4387-8cc5-246e23fc1c93' from data source `D:\Gtech-331\geo_export_07ac367e-9e83-4387-8cc5-246e23fc1c93.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 496 features and 0 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -74.25196 ymin: 40.51276 xmax: -73.7554 ymax: 40.90313
CRS:           NA
Show the code
# pure geometry
plot(st_geometry(MTA_SubStations), main='Plotting of MTA Subway Stations')

Show the code
st_crs(MTA_SubStations)
Coordinate Reference System: NA
Show the code
# Load and clean the Income data
Population <- read.csv("IncomeRace.csv")
Population <- Population[, -c(3:5, 7:9, 11:13, 15:17, 19:21)]  # Remove unwanted columns

# Use readr for better parsing if needed
nyc_pop <- read_csv("IncomeRace.csv", lazy = FALSE)
Rows: 2201 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): GEO_ID, Name
dbl (11): GEOID, Total, White, Black, American Indian/Alaska Native, Asian, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Show the code
# Load NYC Census Tracts shapefile
nyc_census_tracts_sf <- st_read("D:/Gtech-331/nyct2020_25a/nyct2020.shp")
Reading layer `nyct2020' from data source `D:\Gtech-331\nyct2020_25a\nyct2020.shp' using driver `ESRI Shapefile'
Simple feature collection with 2325 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 913175.1 ymin: 120128.4 xmax: 1067383 ymax: 272844.3
Projected CRS: NAD83 / New York Long Island (ftUS)
Show the code
# Merge shapefile with population/income data
nyc_sf_merged <- merge(nyc_census_tracts_sf, nyc_pop, by = "GEOID")

# Check for missing or incorrect values in 'Total'
summary(nyc_sf_merged$Total)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    3351    5261    5885    7705   28806 
Show the code
# Create an interactive map
mapview(nyc_sf_merged, zcol = "Total", legend = TRUE,
        layer.name = "Mean Income per Capita", 
        col.regions = viridisLite::viridis)
Show the code
m <- mapview(nyc_sf_merged, zcol = "Total", legend = TRUE)
saveWidget(m@map, "NYC_Income_Interactive_Map.html", selfcontained = TRUE)

Displaying results:

Show the code
#Finding What is the Mean Per Capita in Each Borough
library(ggplot2)
barplot(nyc_sf_merged$Total, names.arg = nyc_sf_merged$BoroName,
        main = "Total by Borough", xlab = "Borough", ylab = "Per Capita Income")

Show the code
# Specifically, Summarizing total by boroname
Boro_summary <- nyc_sf_merged %>%
  group_by(BoroName) %>%
  summarise(total_sum = sum(Total, na.rm = TRUE))
barplot(Boro_summary$total_sum, names.arg = Boro_summary$BoroName,
        main = "Income Summary by Borough", xlab = "Borough", ylab = "Total Per Capita Income")