MTA R Progress Report Assignment

Author

Victoria Campbell

Published

May 2, 2025

Project Thesis

The NYC MTA system prides itself in being the largest transportation system in North America, serving approximately 15 million people a day, meaning it serves as a general example that other developing cities around America may emulate.. As commuters travel to and through subway stations, there has been growing commentary on the extreme conditions experienced in many. New York City has begun taking legislative measures to address the overall influx of heat and debilitating air quality through infrastructure and development around many areas, but there has been little to no action being taken regarding such factors and the MTA. Research has shown that not only have many of the train stations experienced serious extreme heat, but the air quality has also played a part in creating such an unappealing experience for travelers. Some scientists have begun looking into how this is affecting the health of all who enter these stations. The data gathered by the end of this project overall aims to prove the longer you are traveling, the more you will be exposed to the hazardous air quality. R is used to find the trends of the air quality, relate it to specific MTA stations and using income as a variable to show a disparity through background.

Packages

The following packages were used for this project.

Part 1: MTA Data Analysis

The First part of this Project focused on cleaning and examining the MTA Data. It was cleaned and listed by the 2022 Ridership data, sourced from the MTA’s open data. Limitation: GeoIDs had to be manually added because it was not provided. I used the Census Geocorder to find for each.

###cleaning and Loading in MTA Ridership transit data

Show the code
RD2022 <- read.csv("D:/Gtech-331/RD2022.csv", stringsAsFactors = FALSE)
names(RD2022)<-c("SubwayStation","Lines","Ridership")
RD2022 <- RD2022[-c(1,12), ]
#Adding GeoIDs to the top subway stations
RD2022$GeoID<-c("36061011300","36061007600","36061009200",
                "36061005000","36061011100","36061001501","36081026700",
                "36061011300","36081087100","36061010100")

###Understanding and Visualizing the MTA Transit Originally, I hoped to create a chloropeth map to show that the busiest train systems had the worst surrounding air quality. To do this, I loaded in a shape file of all the MTA Subway Stations.

Limitations: Point Values of the Stations did not include GeoIDs, only their coordinates.

Show the code
#Loading locational data of all stations.
MTA_SubStations <- st_read("D:/Gtech-331/geo_export_07ac367e-9e83-4387-8cc5-246e23fc1c93.shp")
Reading layer `geo_export_07ac367e-9e83-4387-8cc5-246e23fc1c93' from data source `D:\Gtech-331\geo_export_07ac367e-9e83-4387-8cc5-246e23fc1c93.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 496 features and 0 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -74.25196 ymin: 40.51276 xmax: -73.7554 ymax: 40.90313
CRS:           NA
Show the code
# pure geometry
plot(st_geometry(MTA_SubStations), main='Plotting of MTA Subway Stations')

Show the code
png("mta_substations_geometry.png", width = 800, height = 600)
plot(st_geometry(MTA_SubStations), main = "Plotting of MTA Subway Stations")
dev.off()
png 
  2 

Part 2: Census Tract Data Manipulation

Once the MTA Data was cleaned, I decided to work with the census data, loading it in as it will be my main point to relate/load my dataframes into.

Show the code
#Loading and census tract in.
nyc_census_tracts_sf <-st_read("D:/Gtech-331/nyct2020_25a/nyct2020.shp")
Reading layer `nyct2020' from data source `D:\Gtech-331\nyct2020_25a\nyct2020.shp' using driver `ESRI Shapefile'
Simple feature collection with 2325 features and 14 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 913175.1 ymin: 120128.4 xmax: 1067383 ymax: 272844.3
Projected CRS: NAD83 / New York Long Island (ftUS)

Part 3: Demographic Data

In my project thesis, I discussed how I believe that people with a “worse: socioeconomic background would be experiencing bad air quality in the MTA longer because they have to travel farther but I did not know if I wanted to base this on income or race. As a result, I decided to look into both for better comparison. (This part is still ongoing as the data I worked with is incorrect. This data shows income per Capita based on race.)

NOTE: The darker the color, the lower the income of that census tract. Total refers to the total income per Capita of all races (combined)

Show the code
#Loading and Cleaning of Income (Per Mean Capita) data of New York City
Population <-read.csv("D:/Gtech-331/IncomeRace.csv")
Population <- Population[,-c(3:5,7:9,11:13,15:17,19:21)]
nyc_pop <- readr::read_csv("D:/Gtech-331/IncomeRace.csv",lazy = FALSE)
Rows: 2201 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): GEO_ID, Name
dbl (11): GEOID, Total, White, Black, American Indian/Alaska Native, Asian, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Show the code
#To join the Income data frame with Census Tracts
nyc_sf_merged <- base::merge(nyc_census_tracts_sf, nyc_pop, by.x = "GEOID", by.y = "GEOID")
names(nyc_sf_merged) 
 [1] "GEOID"                                     
 [2] "CTLabel"                                   
 [3] "BoroCode"                                  
 [4] "BoroName"                                  
 [5] "CT2020"                                    
 [6] "BoroCT2020"                                
 [7] "CDEligibil"                                
 [8] "NTAName"                                   
 [9] "NTA2020"                                   
[10] "CDTA2020"                                  
[11] "CDTANAME"                                  
[12] "PUMA"                                      
[13] "Shape_Leng"                                
[14] "Shape_Area"                                
[15] "GEO_ID"                                    
[16] "Name"                                      
[17] "Total"                                     
[18] "White"                                     
[19] "Black"                                     
[20] "American Indian/Alaska Native"             
[21] "Asian"                                     
[22] "Native Hawaiian and Other Pacific Islander"
[23] "Other"                                     
[24] "Two or more races"                         
[25] "Hispanic or Latino origin (of any race)"   
[26] "White alone, not Hispanic or Latino"       
[27] "geometry"                                  
Show the code
#Loading the Map
mapview(nyc_sf_merged, zcol=c('Total'))

I realized that this data would be better formatted to be discussed later on the disparity of income in relation to the subway stations, once I am able to successfully combine them. For now, I decided to create plots of the data to see the difference amongst the boroughs.

Show the code
#Finding What is the Mean Per Capita in Each Borough
library(ggplot2)
barplot0<-barplot(nyc_sf_merged$Total, names.arg = nyc_sf_merged$BoroName,
        main = "Total by Borough", xlab = "Borough", ylab = "Per Capita Income")

Show the code
png("barplot0_total_by_borough.png", width = 800, height = 600)
barplot0 <- barplot(nyc_sf_merged$Total, names.arg = nyc_sf_merged$BoroName,
        main = "Total by Borough", xlab = "Borough", ylab = "Per Capita Income")
dev.off()
png 
  2 
Show the code
# Specifically, Summarizing total by boroname
Boro_summary <- nyc_sf_merged %>%
  group_by(BoroName) %>%
  summarise(total_sum = sum(Total, na.rm = TRUE))
barplot1<- barplot(Boro_summary$total_sum, names.arg = Boro_summary$BoroName,
        main = "Income Summary by Borough", xlab = "Borough", ylab = "Total Per Capita Income")

Show the code
png("barplot1_income_summary_by_borough.png", width = 800, height = 600)
barplot1 <- barplot(Boro_summary$total_sum, names.arg = Boro_summary$BoroName,
        main = "Income Summary by Borough", xlab = "Borough", ylab = "Total Per Capita Income")
dev.off()
png 
  2 

Based on the graphs: I believe there is a difference being seen in the trend of the Data because Brooklyn has the highest population, followed by Queens etc.It shows the difference in trend which will be important to discuss later.

Part 4a: Air Quality Data

The air quality Data posed the hardest as it was raster data. I tried (unsuccessfully) to convert it using ArcGIS. This is ongoing but the results I have created so far is still promising.

Show the code
#Loading Air Quality Raster Data of New York City
#It is found that Manhattan has the strongest value of 2.5 pm.
class(volcano)
[1] "matrix" "array" 
Show the code
AirQ <- raster("D:/Gtech-331/aa14_pm300m")
plot((AirQ), main='Predicted annual avg. fine particulate matter 
<2.5 microns, Dec 2021-Dec 2022')

Show the code
#Histograms in Raster Data shows the strongest values and outliers.
#While not neccessarily giving specific locational data, it reflects the density data based on known geographic knowledge of NYC.
hist(AirQ)

Show the code
nlayers(AirQ)
[1] 1

Part 4b: Air Quality Data contd.

Below is another code I tried to use to get the Air Quality Data.

Show the code
#Loading in Air Quality Data
nyc_aq <- readr::read_csv("D:/Gtech-331/NYC_EH Data_Portal-Fine particles(PM 2.5).csv")
Rows: 135 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): GeoType, Geography, Borough
dbl (7): TimePeriod, Fips, GeoRank, Annual mean mcg/m3, Summer mean mcg/m3, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Show the code
str(nyc_aq)
spc_tbl_ [135 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ TimePeriod        : num [1:135] 2022 2022 2022 2022 2022 ...
 $ GeoType           : chr [1:135] "CD" "CD" "CD" "CD" ...
 $ Fips              : num [1:135] 101 102 103 104 105 106 107 108 109 110 ...
 $ GeoRank           : num [1:135] 6 6 6 6 6 6 6 6 6 6 ...
 $ Geography         : chr [1:135] "Financial District (CD1)" "Greenwich Village and Soho (CD2)" "Lower East Side and Chinatown (CD3)" "Clinton and Chelsea (CD4)" ...
 $ Annual mean mcg/m3: num [1:135] 7.2 8.4 7.2 7.8 9.1 7.5 6.1 6.4 6.2 6.2 ...
 $ Summer mean mcg/m3: num [1:135] 7.4 8.3 7.4 8 8.7 7.7 7 7.1 7.1 7.1 ...
 $ Winter mean mcg/m3: num [1:135] 8.2 9.4 8.1 8.7 10.1 8.3 6.8 7.1 7.1 7.2 ...
 $ GEOID             : num [1:135] 3.6e+10 3.6e+10 3.6e+10 3.6e+10 3.6e+10 ...
 $ Borough           : chr [1:135] "Bronx" "Bronx" "Bronx" "Bronx" ...
 - attr(*, "spec")=
  .. cols(
  ..   TimePeriod = col_double(),
  ..   GeoType = col_character(),
  ..   Fips = col_double(),
  ..   GeoRank = col_double(),
  ..   Geography = col_character(),
  ..   `Annual mean mcg/m3` = col_double(),
  ..   `Summer mean mcg/m3` = col_double(),
  ..   `Winter mean mcg/m3` = col_double(),
  ..   GEOID = col_double(),
  ..   Borough = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
Show the code
#Joining the Air Quality Data to the Census Tract
nyc_aq_merged <- base::merge(nyc_sf_merged,nyc_aq , by.x = "GEOID", by.y = "GEOID",show_col_types = FALSE)
names(nyc_aq_merged)
 [1] "GEOID"                                     
 [2] "CTLabel"                                   
 [3] "BoroCode"                                  
 [4] "BoroName"                                  
 [5] "CT2020"                                    
 [6] "BoroCT2020"                                
 [7] "CDEligibil"                                
 [8] "NTAName"                                   
 [9] "NTA2020"                                   
[10] "CDTA2020"                                  
[11] "CDTANAME"                                  
[12] "PUMA"                                      
[13] "Shape_Leng"                                
[14] "Shape_Area"                                
[15] "GEO_ID"                                    
[16] "Name"                                      
[17] "Total"                                     
[18] "White"                                     
[19] "Black"                                     
[20] "American Indian/Alaska Native"             
[21] "Asian"                                     
[22] "Native Hawaiian and Other Pacific Islander"
[23] "Other"                                     
[24] "Two or more races"                         
[25] "Hispanic or Latino origin (of any race)"   
[26] "White alone, not Hispanic or Latino"       
[27] "TimePeriod"                                
[28] "GeoType"                                   
[29] "Fips"                                      
[30] "GeoRank"                                   
[31] "Geography"                                 
[32] "Annual mean mcg/m3"                        
[33] "Summer mean mcg/m3"                        
[34] "Winter mean mcg/m3"                        
[35] "Borough"                                   
[36] "geometry"                                  
Show the code
#Loading in the Map
mapview(nyc_aq_merged, zcol=c('Total','Annual mean mcg/m3'))

Part 5: Chloropeth Map

Once the above data is properly cleaned, I hope to get a chloropleth map to show the relation between the air quality and transit system through income relation. It is important to note that this is not my final deliverable. I am using this to show that the farther areas, while not having the higher income, have to travel farther through these areas to get to the busiest stations. I will pick 1 station, with the most ridership: Time-Square 42nd street. And choose from the outer, poor income areas to show the routes and the trend of the air quality traveled. This has been the method of work discussed with the professor.

Limitation: As discussed previously, since the data is raster, I used arcgis to convert it to a feature. The raster used in this code is what is being referenced from export.

Show the code
#Joining the Air Quality using the converted ArcGIS Data To make a chloropleth map comparing
#Air Quality and Income Around New York City

AirQConvert <- st_read("D:/Gtech-331/RasterT_Reclass1.shp")
Reading layer `RasterT_Reclass1' from data source 
  `D:\Gtech-331\RasterT_Reclass1.shp' using driver `ESRI Shapefile'
Simple feature collection with 329 features and 22 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 912342.1 ymin: 119304.8 xmax: 1067741 ymax: 273290.9
Projected CRS: NAD83 / New York Long Island (ftUS)
Show the code
plot(AirQConvert)
Warning: plotting the first 9 out of 22 attributes; use max.plot = 22 to plot
all

Show the code
#Plot a single column (attribute)
plot(AirQConvert$RasterT__2) #plots outliers

Show the code
# this was most accurate to the ArcGIS map
plot(AirQConvert["RasterT__2"]) 

Show the code
AirQConvert_Income <- sf::st_read("D:/Gtech-331/RasterT_Reclass1.shp") %>%sf::st_transform(sf::st_crs(nyc_sf_merged))
Reading layer `RasterT_Reclass1' from data source 
  `D:\Gtech-331\RasterT_Reclass1.shp' using driver `ESRI Shapefile'
Simple feature collection with 329 features and 22 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 912342.1 ymin: 119304.8 xmax: 1067741 ymax: 273290.9
Projected CRS: NAD83 / New York Long Island (ftUS)