1 Overview: Railroad Casualties on the Long Island Railroad

The Long Island Railroad (LIRR) is one of the 4 major railroad systems serving the New York City metropolitan area. It serves the southeastern part of NY State from Manhattan to the eastern tip of Suffolk County on Long Island. Founded in 1834, the train system remains the busiest commuter railroad system in North America.
https://en.wikipedia.org/wiki/Long_Island_Rail_Road

This project will gather, wrangle and model the LIRR 2018 railroad casualties data from the Federal Railroad Administration (FRA) Office of Safety Analysis. In the next section, I outline the nature of the dataset gather from the FRA. The physical file and some of its complications are discussed in the following section. Finally, a data model and data wrangling is undertaken. A brief report and summary statistics round out the discussion.

1.1 Overview of Railroad Data

The Federal Railroad Administration has regulatory oversight of all railroads in the United States including the LIRR. FRA by Federal regulation requires all accidents, deaths and incidents to be reported through an incident filing process. There are several types of incident forms required. Our project focuses on one Form 6180.55A which is used for Railroad casualties.

Form 6180.55A Defines Data Collection of Human Casualties

One can gather the casualty data from an online portal where the data is publicly available.
The FRA attempts to assist both railroads to meet their reporting obligations and the public in its right to information.

https://safetydata.fra.dot.gov/OfficeofSafety/publicsite/on_the_fly_download.aspx

FRA Download Site for Incident Data

We filtered on just the LIRR for the 2018 calendar year. The file was downloaded in CSV format and follows a standard precanned format. The actual data will be parsed from an ASCII file named “142913142942030.csv” that was downloaded from the above link. Each row of the ASCII file describes the contents of a Form 6180.55A incident report.

The LIRR reported 462 incidents in 2018. The downloaded file has 50 columns.

There are several complications worth noting below

1.1.1 Choosing the Data Columns of Interest

The data file contains numerous columns but only a select number of those will be of interest for my purposes. So we are going to boil down the relevant data to a small subset of columns. The data descriptions below come from the reference materials provided by the FRA in the form of a data file specification for the output.

data_cols	data_desc
IMO	Incident Month
RAILROAD	Railroad Code
INCDTNO	Railroad assigned number
TYPPERS	Type of person whose injury is being reported (See Codes)
CASFATAL	Fatality: Yes or No
DAY	Day of incident
YEAR4	4 digit year of incident
COUNTY	County
EVENT	Event code for what caused the injury
NARR1	Narrative 1
NARR2	Narrative 2 continuation of 1
NARR3	Narrative 3 continuation of 2
LATITUDE	Latitude in decimal degrees WGS84
LONGITUD	Longitude in decimal degrees WGS84

1.1.2 Some Data Is Spread Out

Some issues are obvious. The incident date is not a field of the data file. Instead, the year, month, day are scattered from multiple non-adjacent fields. The incident narrative is often the most useful description of the event. This is split across 3 text columns called Narr1, Narr2, Narr3. The reason for this split is probably historical to allow backward compatibility. The Form is revised periodically to gather new data demands. This makes legacy Form 6180.55a filings obselete in format. The length of the narrative has expanded over time as computing power and storage has increased. Thus we see that narrative fields being used to split a single narrative into 3 components for storage.

1.1.3 Many Attributes are Foreign Keys

2 of the columns (TYPPERS, EVENT) will require cross reference to other FRA data tables to figure out what is meant Luckily, these other data tables are available to download in csv or Excel format to allow cross references.

1.2 Defining Tidy Data

Tidy data requires clearly defining an observation. It requires ensuring each column is a variable and each row is an observation.

In this case, we define each observation as being comprised of:

A filed incident report and the information it contains therein.

We also need to define supplementary data tables to fully construct the data model.

A data table of person type
A data table of events causing incidents

1.3 Location of the Data Files

There are several files associated with the study. * RMD File is on github * Raw data file is likewise. * 3 Output files are on github to represent normalized tidy data * HTML report is on RPubs.

library(readr)

urlFile = "https://raw.githubusercontent.com/completegraph/607_DATAAcquisition/master/1429413142942030.csv"
csv_data <- read_csv(urlFile)

1.4 Wrangling the Data File

Let us choose only the columns of interest from the raw data frame. These are going to be the ones we specified earlier.

csv_data %>% select( one_of( data_cols ) ) %>% # a clever way to specific column names in a vector of string
          unite("narrative", c("NARR1", "NARR2", "NARR3") ) %>%   # Combine the 3 narrative columns
          mutate( incident_date = make_date(YEAR4, IMO, DAY) ) %>%    # construct the incident date using lubridate
          select( -DAY, -YEAR4, -IMO) %>%   # drop the legacy date fragments
          select(incident_date, INCDTNO, everything()) %>%   # put the incident date and identifier in front of the rest
          arrange( incident_date) -> tidydata   # sort chronlogically

## Warning: package 'bindrcpp' was built under R version 3.4.4

str(tidydata)

## Classes 'tbl_df', 'tbl' and 'data.frame':    462 obs. of  10 variables:
##  $ incident_date: Date, format: "2018-01-02" "2018-01-02" ...
##  $ INCDTNO      : chr  "W20180101" "W20180102" "M20180101" "TE20180101" ...
##  $ RAILROAD     : chr  "LI" "LI" "LI" "LI" ...
##  $ TYPPERS      : chr  "A" "A" "A" "A" ...
##  $ CASFATAL     : chr  "N" "N" "N" "N" ...
##  $ COUNTY       : chr  "NASSAU" "NASSAU" "QUEENS" "NASSAU" ...
##  $ EVENT        : chr  "71" "38" "61" "52" ...
##  $ narrative    : chr  "WHILE LOADING SALT BINS FELT A PINCH IN BACK_NA_NA" "LOADING AND UNLOADING ICE MELT BAGS ONTO SIDE OF 91B BOOM TRUCK WHEN EMPLOYEE FELT BACK TIGHTEN UP_NA_NA" "EE WAS IN TELECOM JCC STOCKROOM AND HIT HEAD ON LIGHT FIXTURE ABOVE PALLETS. LACERATION, BLEEDING, S_WELLING. HEADACHES._NA" "EMPLOYEE SLIPPED ON ICE ON THE PLAFROM CAUSING HIMTO FALL AND INJURED HIS LEFT KNEE.EMPLOYEE SUSTAIN_ED SCRAPES"| __truncated__ ...
##  $ LATITUDE     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LONGITUD     : num  0 0 0 0 0 0 0 0 0 0 ...

1.5 Supplementary Codes and Files

Because the report is very large, we selectively illustrate the use of supplementary codes with two values: Event and Person Type.

They are really supporting data tables for the primary one involves incident reporting. However, they also represent a form of data standardization and governance over the problem domain. We will briefly describe each value and then wrangle those data sets to allow joins with the primary data set.

Event refers to the event which caused the injury. It is defined in Appendix F of the FRA Guide for preparing incident reports. The data dictionary for Events show 82 ranging from animal bites (Event 6) to horseplay and practical jokes (Event 33) to stabbing and knifing (Event 55). To develop effective reporting and analysis, the codes used in the tidy data set must be converted to human readable format by cross referencing all codes with the human readable counterparts.

library(readxl)

## Warning: package 'readxl' was built under R version 3.4.4

urlFileName = "https://raw.githubusercontent.com/completegraph/607_DATAAcquisition/master/appendix%20F%20-%20events.csv"

events_data = read_csv(urlFileName)

## Parsed with column specification:
## cols(
##   Type = col_character(),
##   ccode = col_character(),
##   Circumstance = col_character()
## )

There is no external data source for person codes. So we type the data manually from the PDF documentation. Luckily there are only 10 types of people.

persons = LETTERS[1:10]

person_desc = c("worker on duty-employee",
                "employee not on duty",
                "passenger on train" ,
                "nontrepassers-on railroad property" ,
                "trespassers" ,
                "worker on duty - contractor" ,
                "contractor - other" ,
                "worker on duty - volunteer" ,
                "volunteer - other" ,
                "nontrespassers-off railroad property" )

person_types = tibble( persons = persons, person_desc = person_desc )

Adding events information to replace the Events code giving the following set of transformations:

tidydata %>% left_join(events_data, by = c("EVENT" = "ccode")) %>%
            select(-Type)  %>%
            select(incident_date:EVENT, Circumstance, everything() ) %>%
            arrange( EVENT) %>%
            left_join(person_types, by = c("TYPPERS" = "persons") ) -> tidy_enhanced

tidy_enhanced %>% filter(incident_date == "2018-05-03") -> sample_rows

knitr::kable(sample_rows)

incident_date	INCDTNO	RAILROAD	TYPPERS	CASFATAL	COUNTY	EVENT	Circumstance	narrative	LATITUDE	LONGITUD	person_desc
2018-05-03	20138	LI	A	N	SUFFOLK	07	Bodily function/sudden movement, e.g., sneezing, twisting	WHILE CLEANING A PUKE JOB ON CAR 5012 LOWER LEVEL, EMPLOYEE FELT A PULL IN LOW BACK._NA_NA	0	0	worker on duty-employee
2018-05-03	67865	LI	C	N	QUEENS	99	Other (describe in narrative)	Aided had a seizure on board train #156 arriving on track #4. He passed out on the train for an unkn_own amount of time. He was sweating heavily._NA	0	0	passenger on train

1.6 Exporting the Data

To define the data in a tidy format, we require exporting each of the 3 normalized data frames into flat file.

write_csv(tidydata, "Tidy_LIRR_Railroad_Incidents_2018.csv" )
write_csv(person_types, "Tidy_Person_Types.csv")
write_csv(events_data, "Tidy_Events_data.csv")

1.7 Exploring the Data

We will illustrate three exploratory data analyses with the enhanced data set.

How are incidents distributed amongst the counties served by LIRR?
What incidents were fatal and how did the events contribute to fatality?
What was the most frequent type of incident by event?

tidy_enhanced %>% group_by( COUNTY ) %>% summarize( count= n() ) %>% arrange( desc(count))

From the above table, we see that most frequent incidents occurred in Queens.

tidy_enhanced %>% filter( CASFATAL == "Y") %>% group_by(Circumstance, COUNTY) %>% summarize( n() )

tidy_enhanced %>% filter( CASFATAL == "Y")%>% summarize( n() )

From the above, we see that there were 23 fatal incidents in 2018 distributed roughly evenly between Queens, Nassau and Suffolk County. Kings County had no fatalities.

tidy_enhanced %>% group_by( Circumstance, person_desc ) %>% summarize( count = n() )  %>% arrange( desc( count) ) %>% filter( count > 10 )-> top_event_types
  
knitr::kable(top_event_types)

Circumstance	person_desc	count
Slipped, fell, stumbled, other	nontrepassers-on railroad property	79
Overexertion	worker on duty-employee	41
Slipped, fell, stumbled, other	worker on duty-employee	36
Other (describe in narrative)	worker on duty-employee	28
Slipped, fell, stumbled, other	passenger on train	25
Struck by on-track equipment	trespassers	22
Struck by object	worker on duty-employee	21
Other (describe in narrative)	passenger on train	19
Lost balance	nontrepassers-on railroad property	17
Caught, crushed, pinched, other	worker on duty-employee	12
Lost balance	worker on duty-employee	12
Slipped, fell, stumbled, etc. due to object, e.g.,ballast, spike, material, etc.	worker on duty-employee	11

1.8 Conclusion

Tidy data allows the rapid and flexible analysis of the data. However, supplementary data is still required to do useful interpretation or joining. For the LIRR data example, we find a moderate number of data corrections needed but a significant amount of business analysis is still required to make sense of the information.

Project 2- LIRR Railroad Human Casualties in 2018

Alexander Ng

March 10, 2019