Part One: Introduction

Using adverse weather conditions and their impact on transit delays along New Jersey Transit’s Northeast Corridor Rail Line as a case study, this report seeks to demonstrate how data can be combined with visualization and narrative to tell a story. This will be accomplished by:

  1. Examining how weather could affect commuters.

  2. Considering whether if train delays could disproportionately affect certain segments of the population.

However, before any meaningful analysis can take place, it is crucial to first understand a.) why storytelling with data is important, b.) what it can show, and c.) what effective story-telling with data can accomplish.

For a more in-depth, immersive experience, please see the accompanying storymap for this analysis.

Part Two: The Importance of Story Telling with Data, a Brief Literature Review

Effective storytelling with data finds the “sweet spot” that strikes a balance between narative, data, and visual aids; it is designed to engage, enlighten and explain things to its audience (Nacheva, 2018). This method of analysis examines a variety of topics from multiple angles to address a slew of issues; it can be applied to a range of other topics—even things as simple as where to allocate resources and services. For example, if an analyst was using data to tell a story that would help fight racial inequality, they could approach it through a criminal or social justice lens or choose to focus on health and socioeconomic inequalities (The GovLab, 2020). A focus on social justice could cover the problems of access to services, hate crimes and voting rights (The GovLab, 2020). There are a variety of vehicles an analyst could use to tell their story and they could employ a range of tools to do so. Some of these tools include (but are not limited to):

  1. Static maps or Story Maps

  2. Graphics and other visuaizations

  3. Reports

  4. Dashboards and portals

Ideally, an accomplished storyteller would use a combination of these methods. No matter how the story is delivered, the prime objective of the storyteller is to present their audience–be it collegues, a client, or a decisionmaker–with an easily-digestible narrative with a clear message that is simple to understand (Nacheva, 2018). Therefore, using data to tell the story ultimately paves the way for more informed, data-driven decision-making.

If used judiciously, good data can give quantatative or qualitative evidence of observed trends or even someone’s lived experiences. This is perhaps best exemplified by the writings of Peter Saunders, a city planner and urban geographer who used family photos, public records, an inflation calculator and Google Street imagery to show inequalities in his own family’s old neighborhoods (Saunders, 2018). Saunders found that these places (which were and still are majority-minority) “appreciated at a rate less than the rate of inflation for fifty years” (Sanders, 2018), meaning that he was able to use data to add weight to effectively illustrate how segregation and the accumulation and transfer of generational wealth impacted him and his family on a personal level.

Therefore, this makes storytelling with data an important tool when analyzing equity concerns. This is especially true as it can be challenging to get some segments of the population to admit that inequality even exists; it can be a contentious subject (KelloggInsight, 2021). However, “ … if we know that our ideology is shaping our attention, we can learn to be more alert to things we don’t normally see—allowing for new common ground, or if nothing else, empathy.” (KelloggInsight, 2021)

Storytelling with data can shed light on often-overlooked blindspots, but it can also provide evidence of trends that have already been observed and perhaps seem obvious. A recent study used old land use records, Landsat imagery and public records of real estate invenstment over time to demonstrate how disinvestment and redlining exacerbated the Urban Heat Island Effect in communities of color in 108 cities how racist housing policies from the 1930s exacerbate UHI in communities of color in 108 cities (Anderson, 2020). The land use records were used to identify previously redlined areas and the surface temperature and spectral radiance of the Landsat imagery were used to extrapolate changes in temperature, which revealed what analysts long suspected: in most cases, formerly redlined neighborhoods are hotter (Anderson, 2020). More over, most of these communities are lower income, majority minority areas to this day, meaning that these policies were still determining the racial landscape of our cities even 100 years after they were first put into action (Anderson, 2020).

Although the analysts who wrote this report already knew of this phenomenon, the data they gathered to tell their story had one crucial aspect: it gave their claims teeth. Even when an observed trend seems obvious, data is all-important. This is because of the simple fact that the communities who are being impacted are not the ones that need convincing. In order to fulfill its goal to foster data-driven decision making, the data must convince the decisionmakers. If done successfully, “responsible data access and analysis” can be an important tool for progress (The GovLab, 2020). Moreover, although data just one tool we can use to solve problems, it “can have an impact by making inequalities…more quantifiable and inaction less excusable”(The GovLab, 2020).

Part Three: Case Study History and Context

The Beginning of the Story

The Northeast Corridor Line is a commuter rail line owned operated by the NJ Transit and is one of the busiest rail lines in the nation and runs through some of the most densely-populated areas in the United States (NJ Transit, 2020). Additionally, the Norteast Corridor Line provides approximately 122,000 passenger trips daily (NJ Transit, 2020). The line saw dip in ridership during the pandemic, but as ridership comes back to pre-pandemic levels, NJ Transit approved a $2.6 billion dollar budget that includes modernizing the Northeast Corridor Line, thereby emphasizing its importance (Bloomberg, 2022).

The Weather’s Impact on Transit and Equity

Transportation infrastructure is a part of nearly every human being’s day-to-day life and plays a significant role in shaping the built environment (Allen, 2018). Both transportation and weather conditions are place-based; transit moves people from one place to another and weather influences how those populations deal with extreme events associated with a given geography (Allen, 2018).

Therefore, how weather impacts those relying on public transit is a significant interest.

The Populations Impacted by Weather Conditions along NJ’s Northeast Corridor Line

The Northeast Corridor Line runs through the following counties: Hudson, Essex, Mercer, Middlesex, and Union. Therefore, although the line services all of New Jersey, these five counties were chosen for the study corridor. It is these counties that the Census data for our analysis will be derived from.

Part Four: The Analysis

Step One: Downloading the Demographic Data

The data used in this section comes from the Census Bureau and was used for visualization and observed trends in various segments of the population. Its vintage is 2014-2018 ACS 5-year Estimates. The year was selected due to the highly variable wweather patterns as well as a proliferation of storms during this period.

First, let’s call some libraries.

library(tidycensus) 
library(sf) 
## Linking to GEOS 3.9.1, GDAL 3.4.3, PROJ 7.2.1; sf_use_s2() is TRUE
library(tmap) 
library(jsonlite) 
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()
library(httr) 
library(jsonlite)
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(here) 
## here() starts at C:/Users/kwells65
library(yelpr)
library(knitr)
library(ggplot2)
library(skimr)

Now we can grab our Census data using the API!

#To get the census data, you have to call your census API
suppressMessages(census_api_key(Sys.getenv("census_api")))

#Now, I'm going to use the API to download census data for the areas and variables I want. 
nj_tracts18 <- suppressMessages(
  get_acs(geography = "tract",  
          state = "NJ",
          county = c("Mercer", "Middlesex", "Essex", "Union", "Hudson"),
          variables = c(hhincome = "B19019_001", #median household income
                        population = "B01003_001", #raw population numbers
                        race.tot = "B02001_001", #total racial dataset
                        race.white = "B02001_002", #residents self-identifying as white
                        race.black = "B02001_003", #residents self-identifying as black or african american
                        poverty.ratio = "C17002_002", #number of people living below the poverty line
                        unemployed = "C18120_006", #unemployed with or without disability
                        disability.status.emp = "C18120_004", #employed people with disability
                        no.vehic = "B08141_002", #households with no vehicles available
                        real.est.taxes ="B25103_001"), #Median real estate taxes paid for owner-occupied housing units (with or without mortgage)
          year = 2018,
          survey = "acs5", 
          geometry = TRUE, 
          output = "wide"))
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |===================                                                   |  26%
  |                                                                            
  |===================                                                   |  28%
  |                                                                            
  |=====================                                                 |  30%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |==========================                                            |  37%
  |                                                                            
  |===========================                                           |  38%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |=============================                                         |  41%
  |                                                                            
  |===============================                                       |  45%
  |                                                                            
  |=================================                                     |  47%
  |                                                                            
  |==================================                                    |  48%
  |                                                                            
  |====================================                                  |  52%
  |                                                                            
  |======================================                                |  54%
  |                                                                            
  |=======================================                               |  55%
  |                                                                            
  |=========================================                             |  59%
  |                                                                            
  |===========================================                           |  61%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |==============================================                        |  66%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |===================================================                   |  73%
  |                                                                            
  |=====================================================                 |  76%
  |                                                                            
  |========================================================              |  79%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |============================================================          |  85%
  |                                                                            
  |============================================================          |  86%
  |                                                                            
  |===============================================================       |  90%
  |                                                                            
  |=================================================================     |  93%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%
#Now let's keep the variables we want. 

nj_tracts18_tot <- nj_tracts18 %>% #I like giving my data frames individual names rather than writing over them because it is easier to find errors
  select(GEOID, 
         hhincome = hhincomeE, # New name = old name
         population = populationE,
         race.tot = race.totE, 
         race.white = race.whiteE, 
         race.black = race.blackE,
         poverty.ratio = poverty.ratioE,
         unemployed = unemployedE,
         disability.status.emp = disability.status.empE,
         no.vehic = no.vehicE,
         real.est.taxes = real.est.taxesE)

For this part of the analysis, I included several variables for various segments of the population. Households with no vehicles available, low-income households, residents who are a racial or ethnic minority, and residents who are disabled were all identified as members of the population that may rely on transit more even during adverse weather conditions.

In future iterations, I would probably calculate my variables to be a proportion of the population rather than raw numbers. Supplementing the data with the CDC’s Social Vulnerability Index (SVI) may also prove useful. Moreover, New Jersey Transit expanded and redeveloped Morrisville Yard and Trenton Transit Station in 2007 and 2008 respectively (Progressive Railroading, 2006 & NJ Transit 2022). It would be interesting to download the same data for 2008 and calculate the changes in population after these improvements to see if any areas are in transition, gentrifying or struggling with concentrated poverty.

#Let's do some quick exploratory analyses to see what the data is telling us
skim(nj_tracts18_tot)
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
Data summary
Name nj_tracts18_tot
Number of rows 736
Number of columns 12
_______________________
Column type frequency:
character 2
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
GEOID 0 1 11 11 0 736 0
geometry 0 1 150 2805 0 736 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hhincome 5 0.99 79232.00 43758.46 14729 45797.50 69154.0 104235.00 250001 ▇▇▃▁▁
population 0 1.00 4362.38 1772.27 0 3103.75 4128.5 5457.50 14134 ▃▇▃▁▁
race.tot 0 1.00 4362.38 1772.27 0 3103.75 4128.5 5457.50 14134 ▃▇▃▁▁
race.white 0 1.00 2357.66 1581.96 0 1141.50 2248.0 3294.25 9334 ▇▇▂▁▁
race.black 0 1.00 919.84 1026.18 0 161.75 494.0 1381.00 5452 ▇▂▁▁▁
poverty.ratio 0 1.00 238.07 226.62 0 78.00 170.0 337.00 2075 ▇▁▁▁▁
unemployed 0 1.00 147.64 92.58 0 82.00 130.0 196.25 738 ▇▅▁▁▁
disability.status.emp 0 1.00 77.01 59.78 0 36.00 64.5 102.00 517 ▇▂▁▁▁
no.vehic 0 1.00 234.51 298.48 0 38.00 140.5 314.25 3379 ▇▁▁▁▁
real.est.taxes 15 0.98 7971.31 1833.61 2310 6623.00 8167.0 9950.00 10001 ▁▁▃▃▇

While there are some null values, I elected to leave them. This is because sometimes a null value can tell a story just as much as a number. For example, a census tract in DeKalb county has population data, but not income data–this is because the tract includes a prison.

Using the st_write() command, I wrote our output to a shapefile so we can easily map it on ArcGIS Online’s web platform later. As mentioned in prior sections, storytelling with data often involves using a variety of methods and platforms, so analysts will often perform these steps so their data can be more flexible.

Step Two: Weather Pattern Data and Analysis

The weather data used for this analysis was sourced from the National Center for Environmental Information. Analysts used weather stations along the route to gain an understanding of differences in precipitation and temperature and how they impact travel times and delays. Additionally, the transit data was obtained from Kaggle’s NJ Transit + Amtrak (NEC) Rail Performance. This dataset is a cleaned up version of NJT departure vision data and was selected to specifically look at the transit delays between station pairs along the line during the evening commute.

During this part of the analysis, the following steps were performed:

  1. The data was processed for analysis.

  2. Analysts selected 4 months during the year (March, June, September, December) based on seasonality as well as the varied weather conditions observed during these periods.

  3. Analysts identified the dates for significant northeasters (such as March 1-3, 20-22) during the study period. These were also used for analysis.

  4. An ANOVA test and various summary statistics were utilized to determine significance for delays.

However, before this section of the analysis can begin, it is important to identify other causes for delays. Some other possibilities include (but are not limited to):

  1. Engineering and mechanical failure

  2. Programmed track work

  3. Transportation crew failures

  4. Passenger-related issues

  5. Cascades/late starts/other complications (Nelson & O’Neil, 2000)

At the local level, the Portal Bridge frequently gets stuck when it opens, which can cause massive disruptions on New Jersey’s rail lines–especially if the rails on the bridge misalign when the bridge closes again (Higgs, 2018). Bearing these phenomena in mind, it is now possible to move on to our analysis and results.

A full version of this part of the analysis can be read below.

knitr::include_url("https://rpubs.com/sbaghel3/981797")

Analysis Results

Analysts ran a series of ANOVA and Tukey tests to determine the significance of the variables. When looking at all months combined, our results don’t appear to have any statistical significance. However, the months that were selected for the seasonal dataset were a bit more promising.

These results for the seasonal dataset had a p-value of 0.0514, which is low enough to suggest that there is some level of significance. Additionally, consider the following summary statistics:

  1. Median delay for March: 4,207.7 minutes/70.1 hours

  2. Median delay for June: 4,441.633 minutes/74 hours

  3. Median delay for September: 4,255.417 minutes/71.9 hours

  4. Median delay for December: 5,335.083 minutes/88.9 hours

December clearly has the highest delays, and when the data was visualized to gain more context, the corresponding plot showed that most delays for each month fall under 5,000 minutes (approximately 83 hours). Additionally, there are a few interesting-looking outliers in March, September and December. The month of March had several notable storms known as northeasters that impacted the region. To that end, analysts elected to isolate outliers in the data and/or significant weather events to see what patterns emerged.

The ANOVA test for this dataset had a very small p-value (<2e-16), indicating that there was a significant relationship in these cases. A Tukey test was then performed to see which means were the most significant. The results of this test found that several of the dates where storms were hitting the areas along the Northeast corridors showed a statistically significant relationship.

The visualization isolating days with specific weather delays (such as the northeasters identified in previous sections) made the influence of inclement weather conditions on delays along the Northeast Corridor Line were much clearer. March 2nd is especially interesting as our data indicates that the storm also caused mechanical difficulties that day, showing how weather-related delays can compound with other causes.

This is especially important when one is considering who will be the most impacted by delays.

Step Three: Equity Analysis

For the equity analysis portion of this report, analysts struggled with an unforseen limitation; there wasn’t a spatio-temporal component that easily connected the train delay data with the census tracts. However, they were still able to map the census data to show various segments of the population and perform a separate GTFS analysis.

First, using the Census data from earlier sections, a series of summary statistics and maps comparing certain segments of the population to vehicle ownership were created.

tmap_mode("view")
## tmap mode set to interactive viewing
  #tmap mode set to interactive viewing

tm_shape(nj_tracts18_tot) + tm_borders()+
   tm_fill(col = "disability.status.emp", title = 'Disability Status', palette = "Blues")
summary(nj_tracts18_tot$disability.status.emp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   36.00   64.50   77.01  102.00  517.00

The map showing the number of disabled people currently in workforce revealed that the areas around Newark, Elizabeth, and New Brunswick all appear to have comparatively high numbers of disabled people currently in the workforce. This means that these people may depend on transit more–even in cases of inclement weather.

tm_shape(nj_tracts18_tot) + tm_borders()+
  tm_fill(col = "hhincome", title = 'Median Household Income', palette = "Greens")   
summary(nj_tracts18_tot$hhincome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14729   45798   69154   79232  104235  250001       5

This map shows median household incomes. Once again, Newark, Elizabeth, New Brunswick and Trenton show high numbers of lower-income households. As there is a jump between the median and mean values, this indicates that the mean is being influenced by outliers, which could point to a certain degree of income inequality, but more context is needed.

tm_shape(nj_tracts18_tot) + tm_borders()+
  tm_polygons(col = "poverty.ratio", title = 'Poverty Status', palette = "Reds")
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
 summary(nj_tracts18_tot$poverty.ratio) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    78.0   170.0   238.1   337.0  2075.0

This map shows the number of households below the poverty line at the census tract level. As with the previous map, there is a jump in households below the poverty line, which may indicate a certain level of concentrated poverty, but studying poverty rates relative to the population would yield more meaningful results.

In addition to this basic analysis, a separate GTFS analysis independent of the original analysis was also conducted for more context. The entire analysis can be read in the embedded webpage below:

knitr::include_url("https://rpubs.com/lmardv/981945")

The most relevant results are discussed below.

GTFS Results

The regression analysis on departures revealed that majority white neighborhoods had twice as many on-time departures as neighborhoods that were majority black or African American percentages. Furthermore, there is a positive correlation between departure times and incomes, suggesting that higher-income (and lower poverty) neighborhoods have more on-time departures and thus are less influenced by delays. The results of the Pearson’s test suggests that this relationship not only exists, but is statistically significant.

Therefore, it is reasonable to assume that lower-income and/or majority black or African American neighborhoods along the Northeast Corridor Line are disproportionately impacted by delayed departures.

Part Five: The Story

All of these analyses are all fine and good, but what is the data telling us? What is the story here?

Although it is not as obvious on a monthly or annual level, adverse weather conditions do influence delays along New Jersey Transit’s Northeast Corridor Line. This is especially apparent when serious storms such as northeasters are considered in isolation. This is hardly news, but it does call attention to how someone who relies on transit and does not have another choice of commute can be effected by adverse weather in their day-to-day lives. Communities facing historic disinvestment–such as majority-minority and low-income neighborhoods–are even more at-risk for the adverse outcomes associated with delays.

This is particularly concerning in the context of climate change, which is associated with an increasing severity and frequency of storm activity (Environmental Protection Agency, 2021). A recent study found that the most socially vulnerable populations in the United States are the “least able to prepare and cope are disproportionately exposed” to the effects of climate change (Environmental Protection Agency, 2021). To that end, as storm activity increases, the most vulnerable residents along the Northeast Corridor Line will likely face a an increasing multitude of adverse outcomes that they cannot weather. Therefore, policy intervention is necessary to ensure more equitable action.

Part Six: Conclusions and Next Steps

This analysis was a good first step into investigating a deeper story. Based on the results in the previous section, the analysts identified the following key points:

  1. The Trenton and New Brunswick areas as well as the stretch from Elizabeth to the end of the line could be a potentially interesting corridor studies as they appear to have comparatively higher numbers of disabled people in the workforce, households below the poverty line, and households with a lower median household incomes. More analysis is needed.

  2. When looking at the influence of inclement weather on transit delays along the Northeast Corridor Line by month, the results only show the slightest significance.

  3. However, when major weather events are analyzed in isolation, it is easy to see how much delays are impacted by weather in this region. Especially when compounded by other types of delays.

  4. This is important from an equity standpoint considering one of the trends associated with climate change is more intense and frequent natural disasters.

Next Steps

The scripts used to obtain and process the data were written with the idea of expanding the groundwork laid down. This may include (but not be limited to):

  1. Expanding the Census data to include a.) proportions of the population to get more meaningful results and b.) multiple years to calculate changes over time. This may also lend itself to some predictive modeling, which would be useful in the context of climate change.

  2. Adapting future reports to include a more explicit focus on climate change in general and in any predictive models in particular.

  3. Analyzing the relationship between weather conditions and delays before and after significant expansions and improvements to the Northeast Corridor Line.

Works Cited

  1. Allen, Diane Jones. (2018). ‘Transit and Climate Adaptation = Transit and Equity’. Meeting of the Minds. Link

  2. Anderson, Meg. 2020. ‘Racist Housing Practices From The 1930s Linked To Hotter Neighborhoods Today’. NPR. Link

  3. Bloomberg. (2022). ‘NJ Transit Approves $2.6 Billion Budget as Ridership Climbs’. Link

  4. The Environmental Protection Agency. (2021). ‘EPA Report Shows Disproportionate Impacts of Climate Change on Socially Vulnerable Populations in the United States’. EPA. Link

  5. The GovLab. (2020). ‘How Data Can Map and Make Racial Inequality More Visible (If Done Responsibly)’. The GovLab. Link

  6. Higgs, Larry. (2018). ‘What the heck is the Portal Bridge and why does it keep getting stuck open?’. nj.com. Link

  7. KelloggInsight. (2021). ‘Why Do Some People See Inequality Where Others Don’t?’. Kellogg School of Management at Northwestern University. Link

  8. Nacheva, Mina. 2018. ‘Crunching Data Is Not Enough. You Need to Be Telling Stories.’ Adverity. Link

  9. Nelson, David, O’Neil, Kay. (2000). ‘Commuter Rail Service Reliability: On-Time Performance and Causes for Delays’. Transportation Research Board Record. Volume 1704, Issue 1. Link

  10. NJ Transit. (2020). ‘NJT 2030: A 10-year Strategic Plan’. NJ Transit. Link

  11. NJ Transit (2022). ‘NJ TRANSIT MOVES TO IMPROVE TRENTON TRANSIT CENTER’. NJ Transit. Link

  12. Progressive Railroading. (2006). ‘NJ Transit to expand 2-year-old Morrisville yard’. Progressive Railroading. Link

  13. Saunders, Peter. 2018. ‘A Personal Segregation Story’. New Geography. Link