##Part One: Introduction

Using adverse weather conditions and their impact on transit delays along New Jersey Transit’s Northeast Corridor Rail Line as a case study, this report seeks to demonstrate how data can be combined with visualization and narrative to tell a story. This will be accomplished by: * Examining how weather could affect commuters. * Considering whether if train delays could disproportionately affect certain segments of the population.

However, before any meaningful analysis can take place, it is crucial to first understand 1.) why storytelling with data is important, 2.) what it can show, and 3.) what effective story-telling with data can accomplish.

For a more indepth-immersive experience, please see the accompanying storymap for this analysis.

##Part Two: The Importance of Story Telling with Data, a Brief Literature Review

Effective storytelling with data finds the “sweet spot” that strikes a balance between narative, data, and visual aids; it is designed to engage, enlighten and explain things to its audience (Nacheva, 2018). This method of analysis examines a variety of topics from multiple angles to address a slew of issues; it can be applied to a range of other topics—even things as simple as where to allocate resources and services. For example, if an analyst was using data to tell a story that would help fight racial inequality, they could approach it through a criminal or social justice lens or choose to focus on health and socioeconomic inequalities (The GovLab, 2020). A focus on social justice could cover the problems of access to services, hate crimes and voting rights (The GovLab, 2020). There are a variety of vehicles an analyst could use to tell their story and they could employ a range of tools to do so. Some of these tools include (but are not limited to): * Static maps or Story Maps * Graphics and other visuaizations * Reports * Dashboards and portals

Ideally, an accomplished storyteller would use a combination of these methods. No matter how the story is delivered, the prime objective of the storyteller is to present their audience–be it collegues, a client, or a decisionmaker–with an easily-digestible narrative with a clear message that is simple to understand (Nacheva, 2018). Therefore, using data to tell the story ultimately paves the way for more informed, data-driven decision-making.

If used judiciously, good data can give quantatative or qualitative evidence of observed trends or even someone’s lived experiences. This is perhaps best exemplified by the writings of Peter Saunders, a city planner and urban geographer who used family photos, public records, an inflation calculator and Google Street imagery to show inequalities in his own family’s old neighborhoods (Saunders, 2018). Saunders found that these places (which were and still are majority-minority) “appreciated at a rate less than the rate of inflation for fifty years” (Sanders, 2018), meaning that he was able to use data to add weight to effectively illustrate how segregation and the accumulation and transfer of generational wealth impacted him and his family on a personal level.

Therefore, this makes storytelling with data an important tool when analyzing equity concerns. This is especially true as it can be challenging to get some segments of the population to admit that inequality even exists; it can be a contentious subject (KelloggInsight, 2021). However, “ … if we know that our ideology is shaping our attention, we can learn to be more alert to things we don’t normally see—allowing for new common ground, or if nothing else, empathy.” (KelloggInsight, 2021)

Storytelling with data can shed light on often-overlooked blindspots, but it can also provide evidence of trends that have already been observed and perhaps seem obvious. A recent study used old land use records, Landsat imagery and public records of real estate invenstment over time to demonstrate how disinvestment and redlining exacerbated the Urban Heat Island Effect in communities of color in 108 cities how racist housing policies from the 1930s exacerbate UHI in communities of color in 108 cities (Anderson, 2020). The land use records were used to identify previously redlined areas and the surface temperature and spectral radiance of the Landsat imagery were used to extrapolate changes in temperature, which revealed what analysts long suspected: in most cases, formerly redlined neighborhoods are hotter (Anderson, 2020). More over, most of these communities are lower income, majority minority areas to this day, meaning that these policies were still determining the racial landscape of our cities even 100 years after they were first put into action (Anderson, 2020).

Although the analysts who wrote this report already knew of this phenomenon, the data they gathered to tell their story had one crucial aspect: it gave their claims teeth. Even when an observed trend seems obvious, data is all-important. This is because of the simple fact that the communities who are being impacted are not the ones that need convincing. In order to fulfill its goal to foster data-driven decision making, the data must convince the decisionmakers. If done successfully, “responsible data access and analysis” can be an important tool for progress (The GovLab, 2020). Moreover, although data just one tool we can use to solve problems, it “can have an impact by making inequalities…more quantifiable and inaction less excusable”(The GovLab, 2020).

##Part Three: Case Study History and Context

The Beginning of the Story

The Northeast Corridor Line is a commuter rail line owned operated by the NJ Transit and is one of the busiest rail lines in the nation and runs through some of the most densely-populated areas in the United States (NJ Transit, 2020). Additionally, the Norteast Corridor Line provides approximately 122,000 passenger trips daily (NJ Transit, 2020). The line saw dip in ridership during the pandemic, but as ridership comes back to pre-pandemic levels, NJ Transit approved a $2.6 billion dollar budget that includes modernizing the Northeast Corridor Line, thereby emphasizing its importance (Bloomberg, 2022).

The Weather’s Impact on Transit and Equity

Transportation infrastructure is a part of nearly every human being’s day-to-day life and plays a significant role in shaping the built environment (Allen, 2018). Both transportation and weather conditions are place-based; transit moves people from one place to another and weather influences how those populations deal with extreme events associated with a given geography (Allen, 2018).

Therefore, how weather impacts those relying on public transit is a significant interest.

The Populations Impacted by Weather Conditions along NJ’s Northeast Corridor Line

The Northeast Corridor Line runs through the following counties: Hudson, Essex, Mercer, Middlesex, and Union. Therefore, although the line services all of New Jersey, these five counties were chosen for the study corridor. It is these counties that the Census data for our analysis will be derived from.

##Part Four: The Analysis

Step One: Downloading the Demographic Data

The data used in this section comes from the Census Bureau and was used for visualization and observed trends in various segments of the population. Its vintage is 2014-2018 ACS 5-year Estimates. The year was selected due to the highly variable wweather patterns as well as a proliferation of storms during this period.

The goal of this script is to: * Do some basic mapping and exploratory analyses. * Convert the data to a shapefile for easy web mapping using the st_write() command.

First, let’s call some libraries.

Now we can grab our Census data using the API!

For this part of the analysis, I included several variables for various segments of the population. Households with no vehicles available, low-income households, residents who are a racial or ethnic minority, and residents who are disabled were all identified as members of the population that may rely on transit more even during adverse weather conditions.

In future iterations, I would probably calculate my variables to be a proportion of the population rather than raw numbers. Supplementing the data with the CDC’s Social Vulnerability Index (SVI) may also prove useful. Moreover, New Jersey Transit expanded and redeveloped Morrisville Yard and Trenton Transit Station in 2007 and 2008 respectively (Progressive Railroading, 2006 & NJ Transit 2022). It would be interesting to download the same data for 2008 and calculate the changes in population after these improvements to see if any areas are in transition, gentrifying or struggling with concentrated poverty.

#Let's do some quick exploratory analyses to see what the data is telling us
skim(nj_tracts18_tot)

## Warning: Couldn't find skimmers for class: sfc_MULTIPOLYGON, sfc; No
## user-defined `sfl` provided. Falling back to `character`.

Data summary
Name	nj_tracts18_tot
Number of rows	736
Number of columns	12
_______________________
Column type frequency:
character	2
numeric	10
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
GEOID	0	1	11	11	0	736	0
geometry	0	1	150	2805	0	736	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
hhincome	5	0.99	79232.00	43758.46	14729	45797.50	69154.0	104235.00	250001	▇▇▃▁▁
population	0	1.00	4362.38	1772.27	0	3103.75	4128.5	5457.50	14134	▃▇▃▁▁
race.tot	0	1.00	4362.38	1772.27	0	3103.75	4128.5	5457.50	14134	▃▇▃▁▁
race.white	0	1.00	2357.66	1581.96	0	1141.50	2248.0	3294.25	9334	▇▇▂▁▁
race.black	0	1.00	919.84	1026.18	0	161.75	494.0	1381.00	5452	▇▂▁▁▁
poverty.ratio	0	1.00	238.07	226.62	0	78.00	170.0	337.00	2075	▇▁▁▁▁
unemployed	0	1.00	147.64	92.58	0	82.00	130.0	196.25	738	▇▅▁▁▁
disability.status.emp	0	1.00	77.01	59.78	0	36.00	64.5	102.00	517	▇▂▁▁▁
no.vehic	0	1.00	234.51	298.48	0	38.00	140.5	314.25	3379	▇▁▁▁▁
real.est.taxes	15	0.98	7971.31	1833.61	2310	6623.00	8167.0	9950.00	10001	▁▁▃▃▇

While there are some null values, I elected to leave them. This is because sometimes a null value can tell a story just as much as a number. For example, a census tract in DeKalb county has population data, but not income data–this is because the tract includes a prison.

Using the st_write() command, I wrote our output and write it to a shapefile so we can easily map it on ArcGIS Online’s web platform later. As mentioned in prior sections, storytelling with data often involves using a variety of methods and platforms, so analysts will often perform these steps so their data can be more flexible.

Step Two: Weather Pattern Data and Analysis

The weather data used for this analysis was sourced from the National Center for Environmental Information. Analysts used weather stations along the route to gain an understanding of differences in precipitation and temperature and how they impact travel times and delays. Additionally, the transit data was obtained from Kaggle’s NJ Transit + Amtrak (NEC) Rail Performance. This dataset is a cleaned up version of NJT departure vision data and was selected to specifically look at the transit delays between station pairs along the line during the evening commute.

During this part of the analysis, the following steps were performed: * The data was processed for analysis. * Analysts selected 4 months during the year (March, June, September, December) based on seasonality as well as the varied weather conditions observed during these periods. * Analysts identified the dates for significant northeasters (such as March 1-3, 20-22) during the study period. These were also used for analysis. * An ANOVA test and various summary statistics were utilized to determine significance for delays.

However, before this section of the analysis can begin, it is important to identify other causes for delays. Some other possibilities include (but are not limited to): * Engineering and mechanical failure * Programmed track work * Transportation crew failures * Passenger-related issues * Cascades/late starts/other complications (Nelson & O’Neil, 2000)

At the local level, the Portal Bridge frequently gets stuck when it opens, which can cause massive disruptions on New Jersey’s rail lines–especially if the rails on the bridge misalign when the bridge closes again (Higgs, 2018). Bearing these phenomena in mind, it is now possible to move on to our analysis and results.

#For this part of the analysis, we will need the following Libraries
library(tidyverse)
library(tidycensus)
library(osmdata)

## Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright

library(sfnetworks)
library(units)

## udunits database from C:/Users/Mary Jane Leach/AppData/Local/R/win-library/4.2/units/share/udunits/udunits2.xml

library(sf)
library(tidygraph)

## 
## Attaching package: 'tidygraph'

## The following object is masked from 'package:stats':
## 
##     filter

library(tmap)
library(viridis)

## Loading required package: viridisLite

library(dplyr)
library(ggplot2)
library(units)
library(leaflet)
library(leafsync)
library(dbscan)

## 
## Attaching package: 'dbscan'

## The following object is masked from 'package:stats':
## 
##     as.dendrogram

library(tigris)

## To enable caching of data, set `options(tigris_use_cache = TRUE)`
## in your R script or .Rprofile.

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:httr':
## 
##     config

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(here)
library(rgdal)

## Loading required package: sp

## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-2, (SVN revision 1183)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.5.2, released 2022/09/02
## Path to GDAL shared files: C:/Users/Mary Jane Leach/AppData/Local/R/win-library/4.2/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: C:/Users/Mary Jane Leach/AppData/Local/R/win-library/4.2/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.5-1
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.

library(mapview)
library(tidyr)
library(lubridate)

## Loading required package: timechange

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Now, we can load in the dataset.

local3861Data <-
  read.csv(here("3861.csv"))

#Let's take a look at it before moving on

head(local3861Data)

##       date train_id stop_sequence                  from from_id
## 1 3/1/2018     3861             1 New York Penn Station     105
## 2 3/1/2018     3861             2 New York Penn Station     105
## 3 3/1/2018     3861             3    Secaucus Upper Lvl   38187
## 4 3/1/2018     3861             4   Newark Penn Station     107
## 5 3/1/2018     3861             5        Newark Airport   37953
## 6 3/1/2018     3861             6       North Elizabeth     109
##                      to to_id scheduled_time    actual_time delay_minutes
## 1 New York Penn Station   105 3/1/2018 16:29 3/1/2018 16:29     0.2833333
## 2    Secaucus Upper Lvl 38187 3/1/2018 16:38 3/1/2018 16:41     3.2833333
## 3   Newark Penn Station   107 3/1/2018 16:50 3/1/2018 16:50     0.2333333
## 4        Newark Airport 37953 3/1/2018 16:56 3/1/2018 16:56     0.3000000
## 5       North Elizabeth   109 3/1/2018 16:59 3/1/2018 16:59     0.3000000
## 6             Elizabeth    41 3/1/2018 17:02 3/1/2018 17:02     0.2000000
##     status             line       type
## 1 departed Northeast Corrdr NJ Transit
## 2 departed Northeast Corrdr NJ Transit
## 3 departed Northeast Corrdr NJ Transit
## 4 departed Northeast Corrdr NJ Transit
## 5 departed Northeast Corrdr NJ Transit
## 6 departed Northeast Corrdr NJ Transit

Looks good! But first, let’s explore our dataset a little more before getting into our analysis.

typeof(local3861Data) #shows us what kind of data we are working with. In this case, the data is a list.

## [1] "list"

local3861Data$delay_minutes #lets us see what format the numerical data is stored in

##  [1] 0.28333333 3.28333333 0.23333333 0.30000000 0.30000000 0.20000000
##  [7] 0.20000000 0.23333333 1.21666667 1.20000000 0.21666667 0.20000000
## [13] 0.00000000 0.00000000 0.00000000 0.00000000 0.16666667 0.00000000
## [19] 0.21666667 0.33333333 0.23333333 0.33333333 0.18333333 1.28333333
## [25] 1.28333333 0.33333333 0.23333333 0.00000000 0.00000000 0.00000000
## [31] 0.00000000 0.21666667 0.21666667 0.11666667 1.11666667 0.20000000
## [37] 1.18333333 0.06666667 2.13333333 1.18333333 0.10000000 1.15000000
## [43] 0.00000000 0.00000000 0.00000000 0.00000000 0.18333333 0.00000000
## [49] 0.00000000 0.13333333 0.20000000 0.15000000 0.23333333 1.15000000
## [55] 0.23333333 0.16666667 0.25000000 0.00000000 0.00000000 0.00000000

Before moving on to the bulk of our analysis, let’s run a quick test.

local3861DelayinMinutesforMarch1 <- with(local3861Data, sum(delay_minutes[date == '3/1/2018']))
local3861DelayinMinutesforMarch1

## [1] 7.866667

with(local3861Data, sum(delay_minutes[date == '3/1/2018' & (stop_sequence == '1' | stop_sequence == '2')]))

## [1] 3.566667

local3861DelayinMinutesforMarch15 <- with(local3861Data, sum(delay_minutes[date == '3/15/2018']))
local3861DelayinMinutesforMarch22 <- with(local3861Data, sum(delay_minutes[date == '3/22/2018']))
local3861DelayinMinutesforMarch29 <- with(local3861Data, sum(delay_minutes[date == '3/29/2018']))

delayinMinutesofTrain <- c(local3861DelayinMinutesforMarch1, local3861DelayinMinutesforMarch15, local3861DelayinMinutesforMarch22, local3861DelayinMinutesforMarch29)

dateOfTrain <- c("March 1", "March 15", "March 22", "March 29")

png(file = "barchart_delay3861Local.png")

barplot(delayinMinutesofTrain, names.arg=dateOfTrain, xlab="Date", ylab="Delay In Min", col="red", main="Delay Chart for Local 3861", border="blue")

dev.off

## function (which = dev.cur()) 
## {
##     if (which == 1) 
##         stop("cannot shut down device 1 (the null device)")
##     .External(C_devoff, as.integer(which))
##     dev.cur()
## }
## <bytecode: 0x000002ac34622640>
## <environment: namespace:grDevices>

Now we can extract some data for various months and store them as objects.

#This chunk of code stores multiple csvs as objects so we can analyze things by month.

allTrainsMarch2018Data <- read.csv(here("march2018Delays.csv"))
allTrainsApril2018Data <- read.csv(here("april2018Delays.csv"))
allTrainsAugust2018Data <- read.csv(here("august2018Delays.csv"))
allTrainsJune2018Data <- read.csv(here("june2018Delays.csv"))
allTrainsSeptember2018Data <- read.csv(here("september2018Delays.csv"))
allTrainsDecember2018Data <- read.csv(here("december2018Delays.csv"))


#totalinMinutesforMarch <- "totalDelayinMinutesforMarch"

totalDelayperDayinMinutes <- c()

totalMarchDelaydf <- data.frame(totalDelayperDayinMinutes)

totalMarchDelaydf

## data frame with 0 columns and 0 rows

x = 1

paste0('3/',as.character(x),'/2018')

## [1] "3/1/2018"

with(allTrainsMarch2018Data, sum(delay_minutes[date == paste0('3/',as.character(x),'/2018')]))

## [1] 9190.117

#delayStore <- numeric(31)

sumofMarch1Delays <- with(allTrainsMarch2018Data, sum(delay_minutes[date == '3/1/2018']))

typeof(sumofMarch1Delays)

## [1] "double"

totalrowsMarch12018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/1/2018")

totalrowsMarch152018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/15/2018")

totalrowsMarch222018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/22/2018")

totalrowsMarch292018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/29/2018")

totalrowsMarch212018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/21/2018")

totalrowsMarch22018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/2/2018")

totalrowsMarch72018 <- allTrainsMarch2018Data %>% select(delay_minutes, date) %>% filter(date == "3/7/2018")

#totalrowsMarch12018

Now that our data is filtered, we can get some summary statistics.

mean(totalrowsMarch12018$delay_minutes)

## [1] 6.277402

mean(totalrowsMarch152018$delay_minutes)

## [1] 3.909953

mean(totalrowsMarch222018$delay_minutes)

## [1] 3.862648

mean(totalrowsMarch292018$delay_minutes)

## [1] 2.848815

#Now lets get the mean value for all the dates combined. 

marchDaysCombined <- rbind(totalrowsMarch12018,totalrowsMarch152018,totalrowsMarch222018,totalrowsMarch292018,totalrowsMarch212018,totalrowsMarch22018, totalrowsMarch72018)

#marchDaysCombined

mean(totalrowsMarch152018$delay_minutes)

## [1] 3.909953

mean(totalrowsMarch222018$delay_minutes)

## [1] 3.862648

mean(totalrowsMarch292018$delay_minutes)

## [1] 2.848815

marchDaysCombined$date <- factor(marchDaysCombined$date, levels=c("3/1/2018", "3/2/2018", "3/7/2018", "3/15/2018", "3/21/2018", "3/22/2018", "3/29/2018"))

Now we can plot for monthly delays.

#plot(pressure)

delayStore <- list()

delayStoreApril2018 <- list()

delayStoreAugust2018 <- list()

delayStoreJune2018 <- list()

delayStoreSeptember2018 <- list()

delayStoreDecember2018 <- list()



typeof(delayStore)

## [1] "list"

for( i in 1:31){
  delayStore[i] <- with(allTrainsMarch2018Data, sum(delay_minutes[date == paste0('3/',as.character(i),'/2018')]))
}


for( i in 1:30){
  delayStoreApril2018[i] <- with(allTrainsApril2018Data, sum(delay_minutes[date == paste0('4/',as.character(i),'/2018')]))
}

for( i in 1:31){
  delayStoreAugust2018[i] <- with(allTrainsAugust2018Data, sum(delay_minutes[date == paste0('8/',as.character(i),'/2018')]))
}

for( i in 1:30){
  delayStoreJune2018[i] <- with(allTrainsJune2018Data, sum(delay_minutes[date == paste0('6/',as.character(i),'/2018')]))
}

for( i in 1:30){
  delayStoreSeptember2018[i] <- with(allTrainsSeptember2018Data, sum(delay_minutes[date == paste0('9/',as.character(i),'/2018')]))
}

for( i in 1:31){
  delayStoreDecember2018[i] <- with(allTrainsDecember2018Data, sum(delay_minutes[date == paste0('12/',as.character(i),'/2018')]))
}

do.call(rbind, delayStoreApril2018)

##            [,1]
##  [1,]  1376.350
##  [2,]  4502.483
##  [3,]  4172.800
##  [4,]  9157.350
##  [5,]  8168.667
##  [6,]  5657.267
##  [7,]  3713.733
##  [8,]  3779.683
##  [9,]  5164.817
## [10,]  5214.100
## [11,]  3692.283
## [12,]  6017.033
## [13,]  7226.850
## [14,]  2645.267
## [15,]  2184.267
## [16,]  6047.333
## [17,]  5854.350
## [18,]  4680.267
## [19,]  4656.650
## [20,]  6193.667
## [21,]  2718.883
## [22,]  2766.917
## [23,] 10752.400
## [24,]  8108.917
## [25,]  5242.883
## [26,]  6436.033
## [27,]  4392.317
## [28,]  2875.817
## [29,]  3345.650
## [30,]  4844.867

#Now let's do some checks. 
delayCheckApril2018 <- do.call(rbind, delayStoreApril2018)
colnames(delayCheckApril2018)[1] <- "DelayinMin"
data_delayCheckApril2018 <- as.data.frame(delayCheckApril2018)
data_delayCheckApril2018$Month <- ("April")
#head(data_delayCheckApril2018)

do.call(rbind, delayStoreAugust2018)

##            [,1]
##  [1,]  6116.917
##  [2,]  5436.050
##  [3,]  6485.200
##  [4,]  3275.867
##  [5,]  3538.000
##  [6,]  5870.083
##  [7,]  5115.717
##  [8,]  4399.233
##  [9,] 12797.483
## [10,]  7003.783
## [11,]  2866.017
## [12,]  2780.700
## [13,]  5663.867
## [14,]  4921.450
## [15,]  5689.933
## [16,]  4902.033
## [17,]  6486.000
## [18,]  2953.833
## [19,]  3068.433
## [20,]  4981.683
## [21,]  4919.650
## [22,]  9248.350
## [23,]  5448.533
## [24,]  6558.667
## [25,]  4684.500
## [26,]  5892.817
## [27,]  4131.767
## [28,]  5864.983
## [29,]  4347.300
## [30,]  3581.617
## [31,]  4588.233

delayCheckAugust2018 <- do.call(rbind, delayStoreAugust2018)
colnames(delayCheckAugust2018)[1] <- "DelayinMin"
data_delayCheckAugust2018 <- as.data.frame(delayCheckAugust2018)
data_delayCheckAugust2018$Month <- ("August")
#head(data_delayCheckAugust2018)

do.call(rbind, delayStoreJune2018)

##           [,1]
##  [1,] 7953.150
##  [2,] 2422.250
##  [3,] 2454.067
##  [4,] 4954.817
##  [5,] 4644.067
##  [6,] 3649.200
##  [7,] 3856.150
##  [8,] 5093.917
##  [9,] 2493.467
## [10,] 2082.733
## [11,] 4613.333
## [12,] 5453.117
## [13,] 3890.450
## [14,] 4442.817
## [15,] 4760.650
## [16,] 2698.733
## [17,] 2596.617
## [18,] 6079.967
## [19,] 7197.050
## [20,] 4570.867
## [21,] 4674.567
## [22,] 4440.450
## [23,] 2657.983
## [24,] 2669.500
## [25,] 4426.517
## [26,] 3421.717
## [27,] 5777.850
## [28,] 5488.583
## [29,] 6605.117
## [30,] 2647.583

delayCheckJune2018 <- do.call(rbind, delayStoreJune2018)
colnames(delayCheckJune2018)[1] <- "DelayinMin"
data_delayCheckJune2018 <- as.data.frame(delayCheckJune2018)
data_delayCheckJune2018$Month <- ("June")
#head(data_delayCheckJune2018)

do.call(rbind, delayStoreSeptember2018)

##            [,1]
##  [1,]  2184.150
##  [2,]  1998.067
##  [3,]  3131.483
##  [4,]  4263.833
##  [5,] 12103.000
##  [6,]  4247.000
##  [7,]  7552.167
##  [8,]  3489.200
##  [9,]  5826.100
## [10,]  7387.383
## [11,]  5296.333
## [12,]  4144.283
## [13,]  4897.000
## [14,]  4682.417
## [15,]  2322.367
## [16,]  2933.400
## [17,]  5439.917
## [18,]  4148.667
## [19,]  5102.317
## [20,]  5780.800
## [21,]  3868.450
## [22,]  3181.617
## [23,]  2300.283
## [24,]  5347.333
## [25,]  5182.967
## [26,]  4320.583
## [27,]  4180.833
## [28,]  5717.100
## [29,]  2377.333
## [30,]  2639.667

delayCheckSeptember2018 <- do.call(rbind, delayStoreSeptember2018)
colnames(delayCheckSeptember2018)[1] <- "DelayinMin"
data_delayCheckSeptember2018 <- as.data.frame(delayCheckSeptember2018)
data_delayCheckSeptember2018$Month <- ("September")
#head(data_delayCheckSeptember2018)

do.call(rbind, delayStoreDecember2018)

##            [,1]
##  [1,]  3099.517
##  [2,]  3824.100
##  [3,]  5335.083
##  [4,]  5334.167
##  [5,]  7374.283
##  [6,]  6724.917
##  [7,] 11893.800
##  [8,]  7758.600
##  [9,]  6290.250
## [10,] 14459.150
## [11,]  6025.633
## [12,]  6701.533
## [13,]  7259.517
## [14,]  6510.267
## [15,]  3588.083
## [16,]  5029.700
## [17,]  7328.017
## [18,]  7155.933
## [19,]  6740.667
## [20,]  6060.100
## [21,]  5634.783
## [22,]  4211.650
## [23,]  3184.667
## [24,]  4867.667
## [25,]  2876.150
## [26,]  4221.783
## [27,]  4178.767
## [28,]  4785.667
## [29,]  3875.267
## [30,]  2787.850
## [31,]  4293.333

delayCheckDecember2018 <- do.call(rbind, delayStoreDecember2018)
colnames(delayCheckDecember2018)[1] <- "DelayinMin"
data_delayCheckDecember2018 <- as.data.frame(delayCheckDecember2018)
data_delayCheckDecember2018$Month <- ("December")
#head(data_delayCheckDecember2018)

print(delayStore)

## [[1]]
## [1] 9190.117
## 
## [[2]]
## [1] 12944.98
## 
## [[3]]
## [1] 2235
## 
## [[4]]
## [1] 3433.717
## 
## [[5]]
## [1] 4316.183
## 
## [[6]]
## [1] 5415.733
## 
## [[7]]
## [1] 6528.033
## 
## [[8]]
## [1] 1996.367
## 
## [[9]]
## [1] 12790.55
## 
## [[10]]
## [1] 3357.1
## 
## [[11]]
## [1] 3524.217
## 
## [[12]]
## [1] 7172.55
## 
## [[13]]
## [1] 5945.95
## 
## [[14]]
## [1] 3745.683
## 
## [[15]]
## [1] 5775
## 
## [[16]]
## [1] 11008.83
## 
## [[17]]
## [1] 3255.317
## 
## [[18]]
## [1] 2272.35
## 
## [[19]]
## [1] 5044.833
## 
## [[20]]
## [1] 5342.967
## 
## [[21]]
## [1] 2817.983
## 
## [[22]]
## [1] 5878.95
## 
## [[23]]
## [1] 4142.2
## 
## [[24]]
## [1] 4017.633
## 
## [[25]]
## [1] 3999.833
## 
## [[26]]
## [1] 3997.95
## 
## [[27]]
## [1] 4286.5
## 
## [[28]]
## [1] 3695.867
## 
## [[29]]
## [1] 4207.7
## 
## [[30]]
## [1] 5123.317
## 
## [[31]]
## [1] 2721.417

do.call(rbind.data.frame, delayStore)

##    c.9190.116666678..12944.983333328..2234.999999982..3433.71666668..
## 1                                                            9190.117
## 2                                                           12944.983
## 3                                                            2235.000
## 4                                                            3433.717
## 5                                                            4316.183
## 6                                                            5415.733
## 7                                                            6528.033
## 8                                                            1996.367
## 9                                                           12790.550
## 10                                                           3357.100
## 11                                                           3524.217
## 12                                                           7172.550
## 13                                                           5945.950
## 14                                                           3745.683
## 15                                                           5775.000
## 16                                                          11008.833
## 17                                                           3255.317
## 18                                                           2272.350
## 19                                                           5044.833
## 20                                                           5342.967
## 21                                                           2817.983
## 22                                                           5878.950
## 23                                                           4142.200
## 24                                                           4017.633
## 25                                                           3999.833
## 26                                                           3997.950
## 27                                                           4286.500
## 28                                                           3695.867
## 29                                                           4207.700
## 30                                                           5123.317
## 31                                                           2721.417

do.call(rbind, delayStore)

##            [,1]
##  [1,]  9190.117
##  [2,] 12944.983
##  [3,]  2235.000
##  [4,]  3433.717
##  [5,]  4316.183
##  [6,]  5415.733
##  [7,]  6528.033
##  [8,]  1996.367
##  [9,] 12790.550
## [10,]  3357.100
## [11,]  3524.217
## [12,]  7172.550
## [13,]  5945.950
## [14,]  3745.683
## [15,]  5775.000
## [16,] 11008.833
## [17,]  3255.317
## [18,]  2272.350
## [19,]  5044.833
## [20,]  5342.967
## [21,]  2817.983
## [22,]  5878.950
## [23,]  4142.200
## [24,]  4017.633
## [25,]  3999.833
## [26,]  3997.950
## [27,]  4286.500
## [28,]  3695.867
## [29,]  4207.700
## [30,]  5123.317
## [31,]  2721.417

delayCheck <- do.call(rbind, delayStore)

colnames(delayCheck)[1] <- "DelayinMin"


data_delayCheck <- as.data.frame(delayCheck)

data_delayCheck

##    DelayinMin
## 1    9190.117
## 2   12944.983
## 3    2235.000
## 4    3433.717
## 5    4316.183
## 6    5415.733
## 7    6528.033
## 8    1996.367
## 9   12790.550
## 10   3357.100
## 11   3524.217
## 12   7172.550
## 13   5945.950
## 14   3745.683
## 15   5775.000
## 16  11008.833
## 17   3255.317
## 18   2272.350
## 19   5044.833
## 20   5342.967
## 21   2817.983
## 22   5878.950
## 23   4142.200
## 24   4017.633
## 25   3999.833
## 26   3997.950
## 27   4286.500
## 28   3695.867
## 29   4207.700
## 30   5123.317
## 31   2721.417

Next, let’s run some summary statistics and tests.

data_delayCheck$Month <- ("March")

#head(data_delayCheck)
#head(data_delayCheckApril2018)

summarise(data_delayCheck, mean(DelayinMin))

##   mean(DelayinMin)
## 1         5167.253

summarise(data_delayCheckApril2018, mean(DelayinMin))

##   mean(DelayinMin)
## 1         5052.997

#head(delayCheck)

#delayCheck$DelayinMin

#summarise(delayCheck, mean(DelainMin))
median(data_delayCheck$DelayinMin)

## [1] 4207.7

#Mean values can tell us a lot about our dataset, but medians are also important as they are less influenced by outliers. Let's get the medians for our data. 

median(data_delayCheckJune2018$DelayinMin)

## [1] 4441.633

median(data_delayCheckSeptember2018$DelayinMin)

## [1] 4255.417

median(data_delayCheckDecember2018$DelayinMin)

## [1] 5335.083

median(data_delayCheckApril2018$DelayinMin)

## [1] 4762.567

median(data_delayCheckAugust2018$DelayinMin)

## [1] 4981.683

#Now let's take a look at all the months combined by collapsing them into a single dataframe. 

allMonthsCombined2018 <- rbind(data_delayCheck,data_delayCheckApril2018,data_delayCheckAugust2018)

allSeasonsCombined2018 <- rbind(data_delayCheck,data_delayCheckJune2018,data_delayCheckSeptember2018,data_delayCheckDecember2018)

#head(allMonthsCombined2018)
#allMonthsCombined2018

allMonthsCombined2018$Month <- factor(allMonthsCombined2018$Month, levels=c("March", "April", "August"))

allSeasonsCombined2018$Month <- factor(allSeasonsCombined2018$Month, levels=c("March", "June", "September", "December"))

Next, analysts ran a series of ANOVA and Tukey tests to determine the significance of the variables.

allMonthsCombined2018 %>%
  aov(DelayinMin ~ Month, data = .) %>%
  summary()

##             Df    Sum Sq Mean Sq F value Pr(>F)
## Month        2    772002  386001    0.07  0.932
## Residuals   89 490102846 5506774

allMonthsCombined2018 %>%
  aov(DelayinMin ~ Month, data = .) %>%
  TukeyHSD() #%>%

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DelayinMin ~ Month, data = .)
## 
## $Month
##                   diff       lwr      upr     p adj
## April-March  -114.2560 -1546.762 1318.250 0.9802769
## August-March  110.7699 -1309.945 1531.485 0.9811446
## August-April  225.0259 -1207.480 1657.532 0.9257036

#When looking at all months combined, our results don't appear to have any significance. Let's take a look at the seasonal data we selected. 

allSeasonsCombined2018 %>%
  aov(DelayinMin ~ Month, data = .) %>%
  summary()

##              Df    Sum Sq  Mean Sq F value Pr(>F)  
## Month         3  41417572 13805857    2.66 0.0514 .
## Residuals   118 612368054  5189560                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

allSeasonsCombined2018 %>%
  aov(DelayinMin ~ Month, data = .) %>%
  TukeyHSD()

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = DelayinMin ~ Month, data = .)
## 
## $Month
##                         diff         lwr       upr     p adj
## June-March         -876.6766 -2397.10687  643.7537 0.4391922
## September-March    -632.3844 -2152.81465  888.0459 0.7000249
## December-March      620.1957  -887.72055 2128.1119 0.7073185
## September-June      244.2922 -1288.54995 1777.1344 0.9757542
## December-June      1496.8723   -23.55802 3017.3026 0.0553515
## December-September 1252.5801  -267.85024 2773.0103 0.1444953

These results are definitely more promising as they suggest that there is some level of significance where the seasonal data is considered. Additionally, consider the following summary statistics:

Median delay for March: 4,207.7 minutes/70.1 hours
Median delay for June: 4,441.633 minutes/74 hours
Median delay for September: 4,255.417 minutes/71.9 hours
Median delay for December: 5,335.083 minutes/88.9 hours

December clearly has the highest delays, but let’s visualize our data to gain more context.

ggplot(allSeasonsCombined2018, aes(x = Month, y = DelayinMin, fill = Month)) +
geom_violin(show.legend = FALSE)+
geom_boxplot(width = .5, show.legend = FALSE)

The plot shows that most delays for each month fall under 5,000 minutes (approximately 83 hours), but there are a few interesting-looking outliers in March, September and December. The month of March had several notable storms known as northeasters that impacted the region. To that end, analysts elected to isolate outliers in the data and/or significant weather events to see what patterns emerged.

marchDaysCombined %>%
  aov(delay_minutes ~ date, data = .) %>%
  summary()

##               Df Sum Sq Mean Sq F value Pr(>F)    
## date           6  38654    6442   69.87 <2e-16 ***
## Residuals   9273 855053      92                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

marchDaysCombined %>%
  aov(delay_minutes ~ date, data = .) %>%
  TukeyHSD()

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = delay_minutes ~ date, data = .)
## 
## $date
##                            diff        lwr         upr     p adj
## 3/2/2018-3/1/2018    2.11752873  1.0842002  3.15085723 0.0000000
## 3/7/2018-3/1/2018    1.04102092 -0.1617733  2.24381511 0.1410963
## 3/15/2018-3/1/2018  -2.36744949 -3.4117930 -1.32310598 0.0000000
## 3/21/2018-3/1/2018  -3.16704521 -4.3640508 -1.97003962 0.0000000
## 3/22/2018-3/1/2018  -2.41475426 -3.4513840 -1.37812448 0.0000000
## 3/29/2018-3/1/2018  -3.42858693 -4.4729304 -2.38424342 0.0000000
## 3/7/2018-3/2/2018   -1.07650781 -2.2677287  0.11471313 0.1073709
## 3/15/2018-3/2/2018  -4.48497822 -5.5159713 -3.45398510 0.0000000
## 3/21/2018-3/2/2018  -5.28457395 -6.4699498 -4.09919812 0.0000000
## 3/22/2018-3/2/2018  -4.53228299 -5.5554617 -3.50910424 0.0000000
## 3/29/2018-3/2/2018  -5.54611566 -6.5771088 -4.51512254 0.0000000
## 3/15/2018-3/7/2018  -3.40847041 -4.6092589 -2.20768197 0.0000000
## 3/21/2018-3/7/2018  -4.20806614 -5.5437539 -2.87237833 0.0000000
## 3/22/2018-3/7/2018  -3.45577519 -4.6498610 -2.26168942 0.0000000
## 3/29/2018-3/7/2018  -4.46960785 -5.6703963 -3.26881941 0.0000000
## 3/21/2018-3/15/2018 -0.79959573 -1.9945859  0.39539441 0.4318976
## 3/22/2018-3/15/2018 -0.04730477 -1.0816066  0.98699708 0.9999995
## 3/29/2018-3/15/2018 -1.06113744 -2.1031703 -0.01910463 0.0427170
## 3/22/2018-3/21/2018  0.75229095 -0.4359638  1.94054571 0.5024870
## 3/29/2018-3/21/2018 -0.26154171 -1.4565318  0.93344842 0.9952607
## 3/29/2018-3/22/2018 -1.01383267 -2.0481345  0.02046919 0.0590118

The ANOVA test for this dataset had a very small p-value, indicating that there was a significant relationship in these cases. A Tukey test was then performed to see which means were the most significant. The results of this Tukey test found that several of the dates where storms were hitting the areas along the Northeast corridors showed a statistically significant relationship.

ggplot(marchDaysCombined, aes(x = date, y = delay_minutes, fill = date)) +
geom_violin(show.legend = FALSE)+
geom_boxplot(width = .5, show.legend = FALSE)

When days where specific weather delays (such as the northeasters identified in previous sections) were isolated, the influence of inclement weather conditions on delays along the Northeast Corridor Line were much cleaerer. March 2nd is especially interesting as our data indicates that the storm also caused mechanical difficulties that day, showing how weather-related delays can compound with other causes.

This is especially important when one is considering who will be the most impacted by delays.

Step Three: Equity Analysis

For the equity analysis portion of this report, analysts struggled with an unforseen limitation; there wasn’t a spatio-temporal component that easily connected the train delay data with the census tracts. However, they were still able to map the census data to show various segments of the population and perform a separate GTFS analysis.

First, using the Census data from earlier sections, a series of summary statistics and maps comparing certain segments of the population to vehicle ownership were created.

tmap_mode("view")

## tmap mode set to interactive viewing

  #tmap mode set to interactive viewing

tm_shape(nj_tracts18_tot) + tm_borders()+
   tm_fill(col = "disability.status.emp", title = 'Disability Status', palette = "Blues")

summary(nj_tracts18_tot$disability.status.emp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   36.00   64.50   77.01  102.00  517.00

The map showing the number of disabled people currently in workforce revealed that the areas around Newark, Elizabeth, and New Brunswick all appear to have comparatively high numbers of disabled people currently in the workforce. This means that these people may depend on transit more–even in cases of inclement weather.

tm_shape(nj_tracts18_tot) + tm_borders()+
  tm_fill(col = "hhincome", title = 'Median Household Income', palette = "Greens")

summary(nj_tracts18_tot$hhincome)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   14729   45798   69154   79232  104235  250001       5

This map shows median household incomes. Once again, Newark, Elizabeth, New Brunswick and Trenton show high numbers of lower-income households. As there is a jump between the median and mean values, this indicates that the mean is being influenced by outliers, which could point to a certain degree of income inequality, but more context is needed.

tm_shape(nj_tracts18_tot) + tm_borders()+
  tm_polygons(col = "poverty.ratio", title = 'Poverty Status', palette = "Reds")

## Warning: One tm layer group has duplicated layer types, which are omitted. To
## draw multiple layers of the same type, use multiple layer groups (i.e. specify
## tm_shape prior to each of them).

 summary(nj_tracts18_tot$poverty.ratio)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    78.0   170.0   238.1   337.0  2075.0

This map shows the number of households below the poverty line at the census tract level. As with the previous map, there is a jump in households below the poverty line, which may indicate a certain level of concentrated poverty, but studying poverty rates relative to the population would yield more meaningful results.

In addition to this basic analysis, a separate GTFS analysis independent of the original analysis was also conducted for more context. The entire analysis can be found below:

981945

GTFS Results

The regression analysis on departures revealed that majority white neighborhoods had twice as many on-time departures as neighborhoods that were majority black or African American percentages. Furthermore, there is a positive correlation between departure times and incomes, suggesting that higher-income (and lower poverty) neighborhoods have more on-time departures and thus are less influenced by delays. The results of the Pearson’s test suggests that this relationship not only exists, but is statistically significant.

Therefore, it is reasonable to assume that lower-income and/or majority black or African American neighborhoods along the Northeast Corridor Line are disproportionately impacted by delayed departures.

##Part Five: The Story

All of these analyses are all fine and good, but what is the data telling us? What is the story here?

Although it is not as obvious on a monthly or annual level, adverse weather conditions do influence delays along New Jersey Transit’s Northeast Corridor Line. This is especially apparent when serious storms such as northeasters are considered in isolation. This is hardly news, but it does call attention to how someone who relies on transit and does not have another choice of commute can be effected by adverse weather in their day-to-day lives. Communities facing historic disinvestment–such as majority-minority and low-income neighborhoods–are even more at-risk for the adverse outcomes associated with delays.

This is particularly concerning in the context of climate change, which is associated with an increasing severity and frequency of storm activity (Environmental Protection Agency, 2021). A recent study found that the most socially vulnerable populations in the United States are the “least able to prepare and cope are disproportionately exposed” to the effects of climate change (Environmental Protection Agency, 2021). To that end, as storm activity increases, the most vulnerable residents along the Northeast Corridor Line will likely face a an increasing multitude of adverse outcomes that they cannot weather. Therefore, policy intervention is necessary to ensure more equitable action.

##Part Six: Conclusions and Next Steps

This analysis was a good first step into investigating a deeper story. Based on the results in the previous section, the analysts identified the following key points:

The Trenton and New Brunswick areas as well as the stretch from Elizabeth to the end of the line could be a potentially interesting corridor studies as they appear to have comparatively higher numbers of disabled people in the workforce, households below the poverty line, and households with a lower median household incomes. More analysis is needed.
When looking at the influence of inclement weather on transit delays along the Northeast Corridor Line by month, the results only show the slightest significance.
However, when major weather events are analyzed in isolation, it is easy to see how much delays are impacted by weather in this region. Especially when compounded by other types of delays.
This is important from an equity standpoint considering one of the trends associated with climate change is more intense and frequent natural disasters.

Next Steps The scripts used to obtain and process the data were written with the idea of expanding the groundwork laid down. This may include (but not be limited to):

Expanding the Census data to include a.) proportions of the population to get more meaningful results and b.) multiple years to calculate changes over time. This may also lend itself to some predictive modeling, which would be useful in the context of climate change.
Adapting future reports to include a more explicit focus on climate change in general and in any predictive models in particular.
Analyzing the relationship between weather conditions and delays before and after significant expansions and improvements to the Northeast Corridor Line.

##Works Cited

Allen, Diane Jones. (2018). ‘Transit and Climate Adaptation = Transit and Equity’. Meeting of the Minds. Link
Anderson, Meg. 2020. ‘Racist Housing Practices From The 1930s Linked To Hotter Neighborhoods Today’. NPR. Link
Bloomberg. (2022). ‘NJ Transit Approves $2.6 Billion Budget as Ridership Climbs’. Link
The Environmental Protection Agency. (2021). ‘EPA Report Shows Disproportionate Impacts of Climate Change on Socially Vulnerable Populations in the United States’. EPA. Link
The GovLab. (2020). ‘How Data Can Map and Make Racial Inequality More Visible (If Done Responsibly)’. The GovLab. Link
Higgs, Larry. (2018). ‘What the heck is the Portal Bridge and why does it keep getting stuck open?’. nj.com. Link
KelloggInsight. (2021). ‘Why Do Some People See Inequality Where Others Don’t?’. Kellogg School of Management at Northwestern University. Link
Nacheva, Mina. 2018. ‘Crunching Data Is Not Enough. You Need to Be Telling Stories.’ Adverity. Link
Nelson, David, O’Neil, Kay. (2000). ‘Commuter Rail Service Reliability: On-Time Performance and Causes for Delays’. Transportation Research Board Record. Volume 1704, Issue 1. Link
NJ Transit. (2020). ‘NJT 2030: A 10-year Strategic Plan’. NJ Transit. Link
NJ Transit (2022). ‘NJ TRANSIT MOVES TO IMPROVE TRENTON TRANSIT CENTER’. NJ Transit. Link
Progressive Railroading. (2006). ‘NJ Transit to expand 2-year-old Morrisville yard’. Progressive Railroading. Link
Saunders, Peter. 2018. ‘A Personal Segregation Story’. New Geography. Link

Weathering the Storm: An Analysis of Weather-Related Performance Delays on New Jersey’s Northeast Corridor Line (Final Report)

Siddhartha Baghel, Laura Diaz Villaquiran, and K.P. Wells

12/9/2022

https://rpubs.com/lmardv/981945