The Long Island Railroad (LIRR) is one of the 4 major railroad systems serving the New York City metropolitan area. It serves the southeastern part of NY State from Manhattan to the eastern tip of Suffolk County on Long Island. Founded in 1834, the train system remains the busiest commuter railroad system in North America.
https://en.wikipedia.org/wiki/Long_Island_Rail_Road
This project will gather, wrangle and model the LIRR 2018 railroad casualties data from the Federal Railroad Administration (FRA) Office of Safety Analysis. In the next section, I outline the nature of the dataset gather from the FRA. The physical file and some of its complications are discussed in the following section. Finally, a data model and data wrangling is undertaken. A brief report and summary statistics round out the discussion.
The Federal Railroad Administration has regulatory oversight of all railroads in the United States including the LIRR. FRA by Federal regulation requires all accidents, deaths and incidents to be reported through an incident filing process. There are several types of incident forms required. Our project focuses on one Form 6180.55A which is used for Railroad casualties.
Form 6180.55A Defines Data Collection of Human Casualties
One can gather the casualty data from an online portal where the data is publicly available.
The FRA attempts to assist both railroads to meet their reporting obligations and the public in its right to information.
https://safetydata.fra.dot.gov/OfficeofSafety/publicsite/on_the_fly_download.aspx
FRA Download Site for Incident Data
We filtered on just the LIRR for the 2018 calendar year. The file was downloaded in CSV format and follows a standard precanned format. The actual data will be parsed from an ASCII file named “142913142942030.csv” that was downloaded from the above link. Each row of the ASCII file describes the contents of a Form 6180.55A incident report.
The LIRR reported 462 incidents in 2018. The downloaded file has 50 columns.
There are several complications worth noting below
The data file contains numerous columns but only a select number of those will be of interest for my purposes. So we are going to boil down the relevant data to a small subset of columns. The data descriptions below come from the reference materials provided by the FRA in the form of a data file specification for the output.
data_cols | data_desc |
---|---|
IMO | Incident Month |
RAILROAD | Railroad Code |
INCDTNO | Railroad assigned number |
TYPPERS | Type of person whose injury is being reported (See Codes) |
CASFATAL | Fatality: Yes or No |
DAY | Day of incident |
YEAR4 | 4 digit year of incident |
COUNTY | County |
EVENT | Event code for what caused the injury |
NARR1 | Narrative 1 |
NARR2 | Narrative 2 continuation of 1 |
NARR3 | Narrative 3 continuation of 2 |
LATITUDE | Latitude in decimal degrees WGS84 |
LONGITUD | Longitude in decimal degrees WGS84 |
Some issues are obvious. The incident date is not a field of the data file. Instead, the year, month, day are scattered from multiple non-adjacent fields. The incident narrative is often the most useful description of the event. This is split across 3 text columns called Narr1, Narr2, Narr3. The reason for this split is probably historical to allow backward compatibility. The Form is revised periodically to gather new data demands. This makes legacy Form 6180.55a filings obselete in format. The length of the narrative has expanded over time as computing power and storage has increased. Thus we see that narrative fields being used to split a single narrative into 3 components for storage.
2 of the columns (TYPPERS, EVENT) will require cross reference to other FRA data tables to figure out what is meant Luckily, these other data tables are available to download in csv or Excel format to allow cross references.
Tidy data requires clearly defining an observation. It requires ensuring each column is a variable and each row is an observation.
In this case, we define each observation as being comprised of:
We also need to define supplementary data tables to fully construct the data model.
There are several files associated with the study. * RMD File is on github * Raw data file is likewise. * 3 Output files are on github to represent normalized tidy data * HTML report is on RPubs.
library(readr)
urlFile = "https://raw.githubusercontent.com/completegraph/607_DATAAcquisition/master/1429413142942030.csv"
csv_data <- read_csv(urlFile)
Let us choose only the columns of interest from the raw data frame. These are going to be the ones we specified earlier.
csv_data %>% select( one_of( data_cols ) ) %>% # a clever way to specific column names in a vector of string
unite("narrative", c("NARR1", "NARR2", "NARR3") ) %>% # Combine the 3 narrative columns
mutate( incident_date = make_date(YEAR4, IMO, DAY) ) %>% # construct the incident date using lubridate
select( -DAY, -YEAR4, -IMO) %>% # drop the legacy date fragments
select(incident_date, INCDTNO, everything()) %>% # put the incident date and identifier in front of the rest
arrange( incident_date) -> tidydata # sort chronlogically
## Warning: package 'bindrcpp' was built under R version 3.4.4
str(tidydata)
## Classes 'tbl_df', 'tbl' and 'data.frame': 462 obs. of 10 variables:
## $ incident_date: Date, format: "2018-01-02" "2018-01-02" ...
## $ INCDTNO : chr "W20180101" "W20180102" "M20180101" "TE20180101" ...
## $ RAILROAD : chr "LI" "LI" "LI" "LI" ...
## $ TYPPERS : chr "A" "A" "A" "A" ...
## $ CASFATAL : chr "N" "N" "N" "N" ...
## $ COUNTY : chr "NASSAU" "NASSAU" "QUEENS" "NASSAU" ...
## $ EVENT : chr "71" "38" "61" "52" ...
## $ narrative : chr "WHILE LOADING SALT BINS FELT A PINCH IN BACK_NA_NA" "LOADING AND UNLOADING ICE MELT BAGS ONTO SIDE OF 91B BOOM TRUCK WHEN EMPLOYEE FELT BACK TIGHTEN UP_NA_NA" "EE WAS IN TELECOM JCC STOCKROOM AND HIT HEAD ON LIGHT FIXTURE ABOVE PALLETS. LACERATION, BLEEDING, S_WELLING. HEADACHES._NA" "EMPLOYEE SLIPPED ON ICE ON THE PLAFROM CAUSING HIMTO FALL AND INJURED HIS LEFT KNEE.EMPLOYEE SUSTAIN_ED SCRAPES"| __truncated__ ...
## $ LATITUDE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LONGITUD : num 0 0 0 0 0 0 0 0 0 0 ...
Because the report is very large, we selectively illustrate the use of supplementary codes with two values: Event and Person Type.
They are really supporting data tables for the primary one involves incident reporting. However, they also represent a form of data standardization and governance over the problem domain. We will briefly describe each value and then wrangle those data sets to allow joins with the primary data set.
Event refers to the event which caused the injury. It is defined in Appendix F of the FRA Guide for preparing incident reports. The data dictionary for Events show 82 ranging from animal bites (Event 6) to horseplay and practical jokes (Event 33) to stabbing and knifing (Event 55). To develop effective reporting and analysis, the codes used in the tidy data set must be converted to human readable format by cross referencing all codes with the human readable counterparts.
library(readxl)
## Warning: package 'readxl' was built under R version 3.4.4
urlFileName = "https://raw.githubusercontent.com/completegraph/607_DATAAcquisition/master/appendix%20F%20-%20events.csv"
events_data = read_csv(urlFileName)
## Parsed with column specification:
## cols(
## Type = col_character(),
## ccode = col_character(),
## Circumstance = col_character()
## )
There is no external data source for person codes. So we type the data manually from the PDF documentation. Luckily there are only 10 types of people.
persons = LETTERS[1:10]
person_desc = c("worker on duty-employee",
"employee not on duty",
"passenger on train" ,
"nontrepassers-on railroad property" ,
"trespassers" ,
"worker on duty - contractor" ,
"contractor - other" ,
"worker on duty - volunteer" ,
"volunteer - other" ,
"nontrespassers-off railroad property" )
person_types = tibble( persons = persons, person_desc = person_desc )
Adding events information to replace the Events code giving the following set of transformations:
tidydata %>% left_join(events_data, by = c("EVENT" = "ccode")) %>%
select(-Type) %>%
select(incident_date:EVENT, Circumstance, everything() ) %>%
arrange( EVENT) %>%
left_join(person_types, by = c("TYPPERS" = "persons") ) -> tidy_enhanced
tidy_enhanced %>% filter(incident_date == "2018-05-03") -> sample_rows
knitr::kable(sample_rows)
incident_date | INCDTNO | RAILROAD | TYPPERS | CASFATAL | COUNTY | EVENT | Circumstance | narrative | LATITUDE | LONGITUD | person_desc |
---|---|---|---|---|---|---|---|---|---|---|---|
2018-05-03 | 20138 | LI | A | N | SUFFOLK | 07 | Bodily function/sudden movement, e.g., sneezing, twisting | WHILE CLEANING A PUKE JOB ON CAR 5012 LOWER LEVEL, EMPLOYEE FELT A PULL IN LOW BACK._NA_NA | 0 | 0 | worker on duty-employee |
2018-05-03 | 67865 | LI | C | N | QUEENS | 99 | Other (describe in narrative) | Aided had a seizure on board train #156 arriving on track #4. He passed out on the train for an unkn_own amount of time. He was sweating heavily._NA | 0 | 0 | passenger on train |
To define the data in a tidy format, we require exporting each of the 3 normalized data frames into flat file.
write_csv(tidydata, "Tidy_LIRR_Railroad_Incidents_2018.csv" )
write_csv(person_types, "Tidy_Person_Types.csv")
write_csv(events_data, "Tidy_Events_data.csv")
We will illustrate three exploratory data analyses with the enhanced data set.
tidy_enhanced %>% group_by( COUNTY ) %>% summarize( count= n() ) %>% arrange( desc(count))
From the above table, we see that most frequent incidents occurred in Queens.
tidy_enhanced %>% filter( CASFATAL == "Y") %>% group_by(Circumstance, COUNTY) %>% summarize( n() )
tidy_enhanced %>% filter( CASFATAL == "Y")%>% summarize( n() )
From the above, we see that there were 23 fatal incidents in 2018 distributed roughly evenly between Queens, Nassau and Suffolk County. Kings County had no fatalities.
tidy_enhanced %>% group_by( Circumstance, person_desc ) %>% summarize( count = n() ) %>% arrange( desc( count) ) %>% filter( count > 10 )-> top_event_types
knitr::kable(top_event_types)
Circumstance | person_desc | count |
---|---|---|
Slipped, fell, stumbled, other | nontrepassers-on railroad property | 79 |
Overexertion | worker on duty-employee | 41 |
Slipped, fell, stumbled, other | worker on duty-employee | 36 |
Other (describe in narrative) | worker on duty-employee | 28 |
Slipped, fell, stumbled, other | passenger on train | 25 |
Struck by on-track equipment | trespassers | 22 |
Struck by object | worker on duty-employee | 21 |
Other (describe in narrative) | passenger on train | 19 |
Lost balance | nontrepassers-on railroad property | 17 |
Caught, crushed, pinched, other | worker on duty-employee | 12 |
Lost balance | worker on duty-employee | 12 |
Slipped, fell, stumbled, etc. due to object, e.g.,ballast, spike, material, etc. | worker on duty-employee | 11 |
Tidy data allows the rapid and flexible analysis of the data. However, supplementary data is still required to do useful interpretation or joining. For the LIRR data example, we find a moderate number of data corrections needed but a significant amount of business analysis is still required to make sense of the information.