##Part One: Introduction##
Using adverse weather conditions and their impact on transit delays along New Jersey Transit’s Northeast Corridor Rail Line as a case study, this report seeks to demonstrate how data can be combined with visualization and narrative to tell a story. This will be accomplished by: * Examining how weather could affect commuters. * Considering whether if train delays could disproportionately affect certain segments of the population.
However, before any meaningful analysis can take place, it is crucial to first understand 1.) why storytelling with data is important, 2.) what it can show, and 3.) what effective story-telling with data can accomplish.
For a more indepth-immersive experience, please see the accompanying storymap for this analysis.
##Part Two: The Importance of Story Telling with Data, a Brief Literature Review##
Effective storytelling with data finds the “sweet spot” that strikes a balance between narative, data, and visual aids; it is designed to engage, enlighten and explain things to its audience (Nacheva, 2018). This method of analysis examines a variety of topics from multiple angles to address a slew of issues; it can be applied to a range of other topics—even things as simple as where to allocate resources and services. For example, if an analyst was using data to tell a story that would help fight racial inequality, they could approach it through a criminal or social justice lens or choose to focus on health and socioeconomic inequalities (The GovLab, 2020). A focus on social justice could cover the problems of access to services, hate crimes and voting rights (The GovLab, 2020). There are a variety of vehicles an analyst could use to tell their story and they could employ a range of tools to do so. Some of these tools include (but are not limited to): * Static maps or Story Maps * Graphics and other visuaizations * Reports * Dashboards and portals
Ideally, an accomplished storyteller would use a combination of these methods. No matter how the story is delivered, the prime objective of the storyteller is to present their audience–be it collegues, a client, or a decisionmaker–with an easily-digestible narrative with a clear message that is simple to understand (Nacheva, 2018). Therefore, using data to tell the story ultimately paves the way for more informed, data-driven decision-making.
If used judiciously, good data can give quantatative or qualitative evidence of observed trends or even someone’s lived experiences. This is perhaps best exemplified by the writings of Peter Saunders, a city planner and urban geographer who used family photos, public records, an inflation calculator and Google Street imagery to show inequalities in his own family’s old neighborhoods (Saunders, 2018). Saunders found that these places (which were and still are majority-minority) “appreciated at a rate less than the rate of inflation for fifty years” (Sanders, 2018), meaning that he was able to use data to add weight to effectively illustrate how segregation and the accumulation and transfer of generational wealth impacted him and his family on a personal level.
Therefore, this makes storytelling with data an important tool when analyzing equity concerns. This is especially true as it can be challenging to get some segments of the population to admit that inequality even exists; it can be a contentious subject (KelloggInsight, 2021). However, “ … if we know that our ideology is shaping our attention, we can learn to be more alert to things we don’t normally see—allowing for new common ground, or if nothing else, empathy.” (KelloggInsight, 2021)
Storytelling with data can shed light on often-overlooked blindspots, but it can also provide evidence of trends that have already been observed and perhaps seem obvious. A recent study used old land use records, Landsat imagery and public records of real estate invenstment over time to demonstrate how disinvestment and redlining exacerbated the Urban Heat Island Effect in communities of color in 108 cities how racist housing policies from the 1930s exacerbate UHI in communities of color in 108 cities (Anderson, 2020). The land use records were used to identify previously redlined areas and the surface temperature and spectral radiance of the Landsat imagery were used to extrapolate changes in temperature, which revealed what analysts long suspected: in most cases, formerly redlined neighborhoods are hotter (Anderson, 2020). More over, most of these communities are lower income, majority minority areas to this day, meaning that these policies were still determining the racial landscape of our cities even 100 years after they were first put into action (Anderson, 2020).
Although the analysts who wrote this report already knew of this phenomenon, the data they gathered to tell their story had one crucial aspect: it gave their claims teeth. Even when an observed trend seems obvious, data is all-important. This is because of the simple fact that the communities who are being impacted are not the ones that need convincing. In order to fulfill its goal to foster data-driven decision making, the data must convince the decisionmakers. If done successfully, “responsible data access and analysis” can be an important tool for progress (The GovLab, 2020). Moreover, although data just one tool we can use to solve problems, it “can have an impact by making inequalities…more quantifiable and inaction less excusable”(The GovLab, 2020).
##Part Three: Case Study History and Context
The Beginning of the Story
The Northeast Corridor Line is a commuter rail line owned operated by the NJ Transit and is one of the busiest rail lines in the nation and runs through some of the most densely-populated areas in the United States (NJ Transit, 2020). Additionally, the Norteast Corridor Line provides approximately 122,000 passenger trips daily (NJ Transit, 2020). The line saw dip in ridership during the pandemic, but as ridership comes back to pre-pandemic levels, NJ Transit approved a $2.6 billion dollar budget that includes modernizing the Northeast Corridor Line, thereby emphasizing its importance (Bloomberg, 2022).
The Weather’s Impact on Transit and Equity
Transportation infrastructure is a part of nearly every human being’s day-to-day life and plays a significant role in shaping the built environment (Allen, 2018). Both transportation and weather conditions are place-based; transit moves people from one place to another and weather influences how those populations deal with extreme events associated with a given geography (Allen, 2018).
Therefore, how weather impacts those relying on public transit is a significant interest.
The Populations Impacted by Weather Conditions along NJ’s Northeast Corridor Line
The Northeast Corridor Line runs through the following counties: Hudson, Essex, Mercer, Middlesex, and Union. Therefore, although the line services all of New Jersey, these five counties were chosen for the study corridor. It is these counties that the Census data for our analysis will be derived from.
##Part Four: The Analysis##
#Step One: Downloading the Demographic Data#
The data used in this section comes from the Census Bureau and was used for visualization and observed trends in various segments of the population. Its vintage is 2014-2018 ACS 5-year Estimates. The year was selected due to the highly variable wweather patterns as well as a proliferation of storms during this period.
The goal of this script is to: * Download demographic characteristics for the study corridor counties using the Census API. * Do some basic mapping and exploratory analyses. * Convert the data to a shapefile for easy web mapping using the st_write() command.
First, let’s call some libraries.
Now we can grab our Census data using the API!
For this part of the analysis, I included several variables for various segments of the population. Households with no vehicles available, low-income households, residents who are a racial or ethnic minority, and residents who are disabled were all identified as members of the population that may rely on transit more during inclimate weather conditions.
In future iterations, I would probably calculate my variables to be a proportion of the population rather than raw numbers. Supplementing the data with the CDC’s Social Vulnerability Index (SVI) may also prove useful. Moreover, New Jersey Transit expanded and redeveloped Morrisville Yard and Trenton Transit Station in 2007 and 2008 respectively (Progressive Railroading, 2006 & NJ Transit 2020b). It would be interesting to download the same data for 2008 and calculate the changes in population after these improvements to see if any areas are in transition, gentrifying or struggling with concentrated poverty.
#Let's do some quick exploratory analyses to see what the data is telling us
library(skimr)
skim(nj_tracts18_tot)
## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No user-
## defined `sfl` provided. Falling back to `character`.
| Name | nj_tracts18_tot |
| Number of rows | 736 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 10 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| GEOID | 0 | 1 | 11 | 11 | 0 | 736 | 0 |
| geometry | 0 | 1 | 150 | 2805 | 0 | 736 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| hhincome | 5 | 0.99 | 79232.00 | 43758.46 | 14729 | 45797.50 | 69154.0 | 104235.00 | 250001 | ▇▇▃▁▁ |
| population | 0 | 1.00 | 4362.38 | 1772.27 | 0 | 3103.75 | 4128.5 | 5457.50 | 14134 | ▃▇▃▁▁ |
| race.tot | 0 | 1.00 | 4362.38 | 1772.27 | 0 | 3103.75 | 4128.5 | 5457.50 | 14134 | ▃▇▃▁▁ |
| race.white | 0 | 1.00 | 2357.66 | 1581.96 | 0 | 1141.50 | 2248.0 | 3294.25 | 9334 | ▇▇▂▁▁ |
| race.black | 0 | 1.00 | 919.84 | 1026.18 | 0 | 161.75 | 494.0 | 1381.00 | 5452 | ▇▂▁▁▁ |
| poverty.ratio | 0 | 1.00 | 238.07 | 226.62 | 0 | 78.00 | 170.0 | 337.00 | 2075 | ▇▁▁▁▁ |
| unemployed | 0 | 1.00 | 147.64 | 92.58 | 0 | 82.00 | 130.0 | 196.25 | 738 | ▇▅▁▁▁ |
| disability.status.emp | 0 | 1.00 | 77.01 | 59.78 | 0 | 36.00 | 64.5 | 102.00 | 517 | ▇▂▁▁▁ |
| no.vehic | 0 | 1.00 | 234.51 | 298.48 | 0 | 38.00 | 140.5 | 314.25 | 3379 | ▇▁▁▁▁ |
| real.est.taxes | 15 | 0.98 | 7971.31 | 1833.61 | 2310 | 6623.00 | 8167.0 | 9950.00 | 10001 | ▁▁▃▃▇ |
While there are some null values, I elected to leave them. This is because sometimes a null value can tell a story just as much as a number. For example, a census tract in DeKalb county has population data, but not income data–this is because the tract includes a prison.
#Let's map it
tmap_mode("view")
## tmap mode set to interactive viewing
#tmap mode set to interactive viewing
tm_shape(nj_tracts18_tot) + tm_borders()+
tm_polygons(col = "population", title = 'Population', palette = "Blues")
## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).
Not too shabby! Let’s make sure we save our output and write it to a shapefile so we can map it later. As mentioned in prior sections, storytelling with data often involves using a variety of methods and platforms, so analysts will often perform these steps so their data can be more flexible.
#Save to RDS
saveRDS(nj_tracts18_tot, file = "nj_tracts18_final.rds")
#Write to shapefile
#st_write(dsn = 'nj_tracts18_tot', layer = 'nj_tracts18_shp', driver = 'ESRI Shapefile')
#Step Two: Weather Pattern Data and Analysis#
The weather data used for this analsis was sourced from the National Center for Environmental Information. Analysts used weather stations along the route to gain an understanding of differences in precipitation and temperature and how they impact travel times and delays. Additionally, the transit data was obtained from Kaggle’s NJ Transit + Amtrak (NEC) Rail Performance. This dataset is a cleaned up version of NJT departure vision data and was selected to specifically look at the transit delays between station pairs along the line during the evening commute.
During this part of the analysis, the following steps were performed: * The data was processed for analysis. * Analysts selected 4 months during the year (March, June, September, December) based on seasonality as well as the varied weather conditions observed during these periods. * Analysts identified the dates for significant nor’easters (such as March 1-3, 20-22) during the study period. These were also used for analysis. * An ANOVA test and various summary statistics were utilized to determine significance for delays.
However, before this section of the analysis can begin, it is important to identify other causes for delays. Some other possibilities include (but are not limited to): * Engineering and mechanical failure * Programmed track work * Transportation crew failures * Passenger-related issues * Cascades/late starts/other complications (Nelson & O’Neil, 2000)
At the local level, the Portal Bridge frequently gets stuck when it opens, which can cause massive disruptions on New Jersey’s rail lines–especially if the rails on the bridge misalign when the bridge closes again (Higgs, 2018). Bearing these phenomena in mind, it is now possible to move on to our analysis and results.
#For this part of the analysis, we will need the following Libraries
library(tidyverse)
library(tidycensus)
library(osmdata)
## Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
library(sfnetworks)
library(units)
## udunits database from C:/Program Files/R/R-4.2.1/library/units/share/udunits/udunits2.xml
library(sf)
library(tidygraph)
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:stats':
##
## filter
library(tmap)
library(dotenv)
library(viridis)
## Loading required package: viridisLite
library(dplyr)
library(ggplot2)
library(units)
library(leaflet)
library(leafsync)
library(dbscan)
library(tigris)
## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:httr':
##
## config
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(here)
library(patchwork)
library(rgdal)
## Loading required package: sp
## Please note that rgdal will be retired by the end of 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
##
## rgdal: version: 1.5-32, (SVN revision 1176)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.4.3, released 2022/04/22
## Path to GDAL shared files: C:/Program Files/R/R-4.2.1/library/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ runtime: Rel. 7.2.1, January 1st, 2021, [PJ_VERSION: 721]
## Path to PROJ shared files: C:/Program Files/R/R-4.2.1/library/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.5-0
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.
library(mapview)
library(tidyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Now, we can load in the dataset.
local3861Data <-
read.csv("3861.csv")
#Let's take a look at it before moving on
head(local3861Data)
## date train_id stop_sequence from from_id
## 1 3/1/2018 3861 1 New York Penn Station 105
## 2 3/1/2018 3861 2 New York Penn Station 105
## 3 3/1/2018 3861 3 Secaucus Upper Lvl 38187
## 4 3/1/2018 3861 4 Newark Penn Station 107
## 5 3/1/2018 3861 5 Newark Airport 37953
## 6 3/1/2018 3861 6 North Elizabeth 109
## to to_id scheduled_time actual_time delay_minutes
## 1 New York Penn Station 105 3/1/2018 16:29 3/1/2018 16:29 0.2833333
## 2 Secaucus Upper Lvl 38187 3/1/2018 16:38 3/1/2018 16:41 3.2833333
## 3 Newark Penn Station 107 3/1/2018 16:50 3/1/2018 16:50 0.2333333
## 4 Newark Airport 37953 3/1/2018 16:56 3/1/2018 16:56 0.3000000
## 5 North Elizabeth 109 3/1/2018 16:59 3/1/2018 16:59 0.3000000
## 6 Elizabeth 41 3/1/2018 17:02 3/1/2018 17:02 0.2000000
## status line type
## 1 departed Northeast Corrdr NJ Transit
## 2 departed Northeast Corrdr NJ Transit
## 3 departed Northeast Corrdr NJ Transit
## 4 departed Northeast Corrdr NJ Transit
## 5 departed Northeast Corrdr NJ Transit
## 6 departed Northeast Corrdr NJ Transit
Looks good! But first, let’s explore our dataset a little more before getting into our analysis.
typeof(local3861Data) #shows us what kind of data we are working with. In this case, the data is a list.
## [1] "list"
local3861Data$delay_minutes #lets us see what format the numerical data is stored in
## [1] 0.28333333 3.28333333 0.23333333 0.30000000 0.30000000 0.20000000
## [7] 0.20000000 0.23333333 1.21666667 1.20000000 0.21666667 0.20000000
## [13] 0.00000000 0.00000000 0.00000000 0.00000000 0.16666667 0.00000000
## [19] 0.21666667 0.33333333 0.23333333 0.33333333 0.18333333 1.28333333
## [25] 1.28333333 0.33333333 0.23333333 0.00000000 0.00000000 0.00000000
## [31] 0.00000000 0.21666667 0.21666667 0.11666667 1.11666667 0.20000000
## [37] 1.18333333 0.06666667 2.13333333 1.18333333 0.10000000 1.15000000
## [43] 0.00000000 0.00000000 0.00000000 0.00000000 0.18333333 0.00000000
## [49] 0.00000000 0.13333333 0.20000000 0.15000000 0.23333333 1.15000000
## [55] 0.23333333 0.16666667 0.25000000 0.00000000 0.00000000 0.00000000
Before moving on to the bulk of our analysis, let’s run a quick test.
local3861DelayinMinutesforMarch1 <- with(local3861Data, sum(delay_minutes[date == '3/1/2018']))
local3861DelayinMinutesforMarch1
## [1] 7.866667
with(local3861Data, sum(delay_minutes[date == '3/1/2018' & (stop_sequence == '1' | stop_sequence == '2')]))
## [1] 3.566667
local3861DelayinMinutesforMarch15 <- with(local3861Data, sum(delay_minutes[date == '3/15/2018']))
local3861DelayinMinutesforMarch22 <- with(local3861Data, sum(delay_minutes[date == '3/22/2018']))
local3861DelayinMinutesforMarch29 <- with(local3861Data, sum(delay_minutes[date == '3/29/2018']))
delayinMinutesofTrain <- c(local3861DelayinMinutesforMarch1, local3861DelayinMinutesforMarch15, local3861DelayinMinutesforMarch22, local3861DelayinMinutesforMarch29)
dateOfTrain <- c("March 1", "March 15", "March 22", "March 29")
png(file = "barchart_delay3861Local.png")
barplot(delayinMinutesofTrain, names.arg=dateOfTrain, xlab="Date", ylab="Delay In Min", col="red", main="Delay Chart for Local 3861", border="blue")
dev.off
## function (which = dev.cur())
## {
## if (which == 1)
## stop("cannot shut down device 1 (the null device)")
## .External(C_devoff, as.integer(which))
## dev.cur()
## }
## <bytecode: 0x0000027581288790>
## <environment: namespace:grDevices>
Now we can extract some data for various months and store them as objects.
#This chunk of code stores multiple csvs as objects so we can analyze things by month.
allTrainsMarch2018Data <- read.csv("march2018Delays.csv")
allTrainsApril2018Data <- read.csv("april2018Delays.csv")
allTrainsAugust2018Data <- read.csv("august2018Delays.csv")
allTrainsJune2018Data <- read.csv("june2018Delays.csv")
allTrainsSeptember2018Data <- read.csv("september2018Delays.csv")
allTrainsDecember2018Data <- read.csv("december2018Delays.csv")
#totalinMinutesforMarch <- "totalDelayinMinutesforMarch"
totalDelayperDayinMinutes <- c()
totalMarchDelaydf <- data.frame(totalDelayperDayinMinutes)
totalMarchDelaydf
## data frame with 0 columns and 0 rows
x = 1
paste0('3/',as.character(x),'/2018')
## [1] "3/1/2018"
with(allTrainsMarch2018Data, sum(delay_minutes[date == paste0('3/',as.character(x),'/2018')]))
## [1] 9190.117
#delayStore <- numeric(31)
sumofMarch1Delays <- with(allTrainsMarch2018Data, sum(delay_minutes[date == '3/1/2018']))
typeof(sumofMarch1Delays)
## [1] "double"
totalrowsMarch12018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/1/2018")
totalrowsMarch152018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/15/2018")
totalrowsMarch222018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/22/2018")
totalrowsMarch292018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/29/2018")
totalrowsMarch212018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/21/2018")
totalrowsMarch22018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/2/2018")
totalrowsMarch72018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/7/2018")
#totalrowsMarch12018
Now that our data is filtered, we can get some summary statistics.
mean(totalrowsMarch12018$delay_minutes)
## [1] 6.277402
mean(totalrowsMarch152018$delay_minutes)
## [1] 3.909953
mean(totalrowsMarch222018$delay_minutes)
## [1] 3.862648
mean(totalrowsMarch292018$delay_minutes)
## [1] 2.848815
#Now lets get the mean value for all the dates combined.
marchDaysCombined <- rbind(totalrowsMarch12018,totalrowsMarch152018,totalrowsMarch222018,totalrowsMarch292018,totalrowsMarch212018,totalrowsMarch22018, totalrowsMarch72018)
#marchDaysCombined
mean(totalrowsMarch152018$delay_minutes)
## [1] 3.909953
#Step Three: Equity Analysis#
This part of the analysis had a significant, unforeseen limitation: there wasn’t a spatio-temporal component that easily connected the train delay data with the census tracts. However, analysts were still able to run various summary statistics and map Census data to show various segements of the population against households with no vehicles available as these groups may rely on transit more.
##Part Five: The Story##
##Part Six: Conclusions and Next Steps##
This analysis was a good first step into investigating a deeper story. Based on the results in the previous section, the analysts identified the following key points:
Next Steps The scripts used to obtain and process the data were written with the idea of expanding the groundwork laid down. This may include (but not be limited to):
##Works Cited## 1. Allen, Diane Jones. (2018). ‘Transit and Climate Adaptation = Transit and Equity’. Meeting of the Minds. Link
Anderson, Meg. 2020. ‘Racist Housing Practices From The 1930s Linked To Hotter Neighborhoods Today’. NPR. Link
Bloomberg. (2022). ‘NJ Transit Approves $2.6 Billion Budget as Ridership Climbs’. Link
The GovLab. (2020). ‘How Data Can Map and Make Racial Inequality More Visible (If Done Responsibly)’. The GovLab. Link
Higgs, Larry. (2018). ‘What the heck is the Portal Bridge and why does it keep getting stuck open?’. nj.com. Link
KelloggInsight. (2021). ‘Why Do Some People See Inequality Where Others Don’t?’. Kellogg School of Management at Northwestern University. Link
Nacheva, Mina. 2018. ‘Crunching Data Is Not Enough. You Need to Be Telling Stories.’ Adverity. Link
Nelson, David, O’Neil, Kay. (2000). ‘Commuter Rail Service Reliability: On-Time Performance and Causes for Delays’. Transportation Research Board Record. Volume 1704, Issue 1. Link
NJ Transit. (2020). ‘NJT 2030: A 10-year Strategic Plan’. NJ Transit. Link
NJ Transit (2022). ‘NJ TRANSIT MOVES TO IMPROVE TRENTON TRANSIT CENTER’. NJ Transit. Link
Progressive Railroading. (2006). ‘NJ Transit to expand 2-year-old Morrisville yard’. Progressive Railroading. Link
Saunders, Peter. 2018. ‘A Personal Segregation Story’. New Geography. Link