SET DATE_TIME VARS i.e. combine YEAR + MONTH + DAY + HOUR + MINUTE –> NEW POSIXlt value
combine NOT_HOUR, ARR_HOUR, etc.
COUNTY/CITY factors – worth the time?
MAKE_MOD combined factor – worth the time?
Create calculated values for DATE_TIME variables – is there a difference in lag time between regions/states?
Clean Up the Tableau Viz
My final Tableau Visualization can be found here
Note: This visualization will be ammended/cleaned up in the future.
We can find an article about the 10 most dangerous roads in the U.S. here
Data for this analysis can be found in zip file format here from the Fatal Analysis Reporting System’s (FARS) nationwide census. You may be asked to connect to the government server as a guest in order to access the data. For the sake of simplicity, I will only be looking at the 2015 FARS data, although data from 1975 to the present is available. This analysis will be limited to the ACCIDENT, PERSON and VEHICLE files, as they are the primary data files of the comprehensive FARS data set.
The FARS system is a nationwide census of fatal traffic collisions that are police-reported. The primary goal of the system is to help reduce fatalities on U.S. highways and to aid in legislation and regulation of traffic safety programs.
Here is an explanation of crash data collection methodology from the National Highway Traffic Safety Administration (NHTSA) in their 2010 congressional report.
Before diving into the data, we should reference the Analytical User’s Manual from the NHTSA. This user’s manual includes data directories for all 19 data files, but I will only be looking at 3 of the files. Specifically, I will be merging the ACCIDENT, PERSON and VEHICLE files.
General Service Administration’s (GSA) codings for the city and county level can be found in an Excel Spreadsheet that includes Geographic Locator Codes (GLCs) for the United State and District of Columbia. These codings are used in the ACCIDENT data file.
Detailed coding guides and variables labels can be found in the FARS/NASS GES Coding and Validation Manual.
The FARS/NASS GES Coding and Validation Manual includes vehicle make and model names for the VEHICLE data file found in pp.215-324. Specific Drug Listings for the PERSON data file can be found in pp.775-786. These listings are grouped both alphabetically and by drug type (i.e. narcotic, stimulant, etc.). For the sake of this analysis we will likely be grouping drugs into specific types and not individual drug names.
All other variable labels that will be used in final visualizations are included in the Analytical User’s Manual or FARS/NASS GES Coding and Validation Manual
Some things of note about this data set:
I primarily selected this data due to the inclusion of GPS coordinates. Mapping collisions should be an effective way to present the data and hopefully tell a compelling story that is accesible to a wide audience. A majority of people in the U.S. drive a vehicle, so traffic collisions are a relevant and interesting topic.
This data set provides a wide variety of variables to consider for analysis, while also covering a comprehensive sample of the United States. Government data is standardized and likely to be more accurate than other sources. Having data for individual accidents, vehicles and persons allows for greater detail than if we were limited to looking at accidents alone.
I would like to be able to merge these data files in order to retain GPS coordinates found in the ACCIDENT file while also looking at variables such as drug use and vehicle type found in the PERSON and VEHICLE files. The end goal in my visualization/story is to map out collisions and have action filters based upon categorical data. I would like to examine how alcohol and drug use is related to fatal collisions and if there is geographical variation in drug usage patterns among fatal collision drivers. Weather and road conditions, time of collision, vehicle make/model/type, EMS response time and hospital delivery time are also going to be considered for visualization.
How do we properly merge these 3 data files? Each file contains a different number of observations, primarily because the PERSON and VEHICLE data files contain rows for each person and vehicle that were in the collision. The accident data file only contains one row for each collision, while the other files may contain many rows for person and vehicle for each individual collision. For example, an individual collision may involve 3 vehicles and 5 persons. In this case, the ACCIDENT file would contain only one row for this collision, while the PERSON and VEHICLE files would contain 5 and 3 rows respectively.
Encoding of categorical variables is also going to be an issue. While some variables such as STATE can easily be recoded into 49 discrete groupings (48 contiguous states and D.C.), others will be more difficult (i.e. drug test results with individual drugs). Over 100 different individual drugs are encoded within the data, which could make labeling difficult and time consuming. I would like to be able to display individual drug names in the Tableau tooltip, but I forsee issues in loading Tableau dashboards/stories if this many discrete categorical values are to be placed on a map.
While I think there is ample information contained within the 2015 data, it would be interesting to see how the past year compares with previous years in the FARS collection system. Time series analysis is an option if I decide to merge data from previous years. For the time being, I am going to work with only 2015 data because of the changing nature of variable codings and names in previous data years and the time investment involved in adding many new data files. I may look at time series type graphs within the 2015 data to see if particular dates or seasons have notable rates of fatal accidents (i.e. holidays).
The ACCIDENT data file contains specific GPS coordinates that can easily be used to create Tableau mappings. Creating high level maps of accident locations is not difficult, but I would like to be able to present geographical detail outside of latitude and longitude within these mappings. This is possible with the GLC city and county codes, but over half of the data for the CITY variable are null due to the collision occuring in a rural area. There are 2,074 unique city names in the ACCIDENT data file, so labeling each of these for inclusion in the Tableau Tooltip may be difficult. Ideally I would be able to create summary statistics for each city to determine if certain cities had a disproportionate number of fatal collisions in 2015. This may require additional outside data sources to determine accident rates per capita.
The raw data files are not “Tableau friendly”, so I imported the ACCIDENT, PERSON and VEHICLE data files into R for cleanup.
I started by limiting the number of variables for analysis, as many of the variables will not be used in the final visualization. For example, the ACCIDENT data file contains 52 unique variables, but I have trimmed this data to 29 variable columns. This was done for reasons of simplicity and efficiency. Many of the variables contain potentially interesting information that I have decided to exclude, as I don’t think they would be well suited for visualization within Tableau. I also hope to limit load times within Tableau as a result of excluding variables.
In the raw FARS csv data, categorical variables are encoded in integer format and do not have specific labels indicating what the integer represents. I have recoded these integer values based upon their labels in the Analytical User’s Manual (i.e. days of the week coded as 1-7 now print as Sunday, Monday, etc.). This should simplify the process of creating dimensions within Tableau which we can use to filter mappings. Ideally categorical labels will be placed in the Tableau Tooltip for easy user reference, but load times for variables with many categories (i.e. drug names) may make this difficult.
For the sake of Tableau map clarity, only fatal accidents occuring in the 48 contiguous states and Washington D.C. are included in the amended data set. Alaska, Hawaii, Puerto Rico and the Virgin Islands have been excluded from the data sets.
Variables for hour and minute measurements have been combined within R as date time variables for analysis. I would like to create computed variables examining the time delays associated with accident reporting, EMS arrival and hospital arrival. These computed variables could potentially be used to map out geographic differences in EMS response and hospital delivery times. I have yet to work with dates and times in Tableau, so this formatting may change in the future.
While this visualization is intended to be submitted for a class project, it should be relevant for a wider audience. Traffic accidents are a common event that are relatable to a majority of people. I believe that my visualization will be insightful for users who may not have previously thought about how fatal collisions are distributed throughout the country. By providing further detail on drunk driving, vehicle types and EMS arrival and hospital delivery times, my hope is that the visualization will allow viewers to see factors related to accidents that they had not considered previously.