Analysis of winter road safety probabilities for Santiam Pass, Oregon

Summary

It’s hard for the average person to find historical information about road safety, such as accident rate information based on location, time of day, and weather conditions.

However, the Oregon Dept of Transportation regularly publishes, as a public service, live road reports via Twitter. This creates, in effect, a record of conditions and traffic incidents which can be mined.

This analysis repurposes the ODOT twitter feed to recontruct historical events on a specific section of road. Here we look at accident rates taking the last 3000 tweets from the Tripcheck Highway 20B twitter account as an example. The focus area is US Highway 20 at Santiam Pass, a 4800 foot (1450 meters) mountain pass in the Cascade Range (milepost 79) of Oregon.

As the main route from the Central Oregon city of Bend to the Willamette Valley cities of Portland, Eugene, and Salem, Highway20 has high traffic year round and is the site of frequent accidents.

Getting the Tweets

The twitter feed for this analysis is downloaded by a separate R program and stored locally in a .csv file.

Data Cleaning

The data analyzed cover the dates from 2012-01-19 to 2016-01-25. There are 3000 tweets during this period.

Data are cleaned by searching the text for the strings “crash” and “snow” and then filtered for location “Santiam Pass Summit”. Dates are converted to decimal hours (AM and PM) and also a “timeperiod.” Below is an example of a very simple “text filter” function.

trip_incident_filter <- function(data_df, incident = "crash"){
##FUNCTION filtering tweets for incidents 
            ## accepts: data frame requiring text field to search
            ## returns: selected rows containing serach string

            ## filter incidents
            hwy_inc <-data_df[grep(tolower(incident), tolower(data_df$text)),]

            return(hwy_inc)
            }

A few instances of cleaned and engineered raw data are shown below.

created text dayperiod
1 2016-01-25 03:05:47 US20, 1 Mi N of Bend, Unconfirmed, An unconfirmed report of a crash has been received, use caution…. https://t.co/JQVp0bZ2Ik night
2 2016-01-24 18:01:52 US20 , tombstone, Slush or snow pack, Carry chains/Traction tires… https://t.co/6K2PHCZKLG midday
3 2016-01-24 18:01:52 US20 , santiam pass smt, Slush or snow pack, Carry chains/Traction tires… https://t.co/vuOylEMp31 midday
4 2016-01-24 12:05:50 US20 , santiam pass smt, Packed snow, Carry chains/Traction tires… https://t.co/vuOylEMp31 morning
5 2016-01-24 04:43:50 US20 , tombstone, Packed snow, Carry chains/Traction tires… https://t.co/6K2PHCZKLG night
6 2016-01-24 04:29:50 US20 , santiam pass smt, Slush or snow pack, Carry chains/Traction tires… https://t.co/vuOylEMp31 night
7 2016-01-23 18:07:49 US20 , santiam pass smt, Packed snow, Carry chains/Traction tires… https://t.co/vuOylEMp31 midday
8 2016-01-23 12:19:50 US20 , santiam pass smt, Slush or snow pack, Carry chains/Traction tires… https://t.co/vuOylEMp31 morning

Data Analysis

Highway 20

First let’s look at the accident data for the entire Highway20B twitter feed.
A histogram of accidents versus hour in the day shows what might be expected behavior. Low levels overnight, showing a sharper increase at the start of the morning commute and then rising throughout the workday to a peak at 6:00 PM followed by a sharp decline into late evening.

The accident rates above show little day to day variation. Another visualization makes variations from horu to hour and day to day clearer.

Morning commute hours tend to show higher accident rates, as do most evening hours between 4 and 8 pm.

Zooming in on Santiam Pass

We can also zoom in on Santiam Pass using the same filter above.

## Filter crash data for santiam pass
hwy_crash_santiam<-trip_incident_filter(hwy_crash, "santiam")

Here the histogram shows different variation that the overall highway, supporting the hypothesis that accident conditions are highly location specific.

And the familiar “clock” plot shows variations that are distinct from teh overall highway pattern For instance Monday early morning appear to have excpetionally high rates of accident, whereas midday on Monday appears to have fewer accidents. Late Friday afternoon also appears to stand out as a period of higher frequency.

The Role of Weather

To analyze the data we can create a dataframe of dates from 19 Jan 2012 to 25 Jan 2016 and then add features for whether a crash occured, day of the week, whether weather conditions, etc.

date crash snow weekday holiday weekend longweekend
1 25 Jan 2016 FALSE FALSE Monday FALSE FALSE FALSE
2 24 Jan 2016 FALSE TRUE Sunday FALSE TRUE FALSE
3 23 Jan 2016 FALSE TRUE Saturday FALSE TRUE FALSE
4 22 Jan 2016 FALSE FALSE Friday FALSE FALSE FALSE
5 21 Jan 2016 FALSE TRUE Thursday FALSE FALSE FALSE
6 20 Jan 2016 TRUE TRUE Wednesday FALSE FALSE FALSE
7 19 Jan 2016 FALSE TRUE Tuesday FALSE FALSE FALSE
8 18 Jan 2016 FALSE TRUE Monday FALSE FALSE FALSE
9 17 Jan 2016 TRUE TRUE Sunday FALSE TRUE FALSE
10 16 Jan 2016 TRUE TRUE Saturday FALSE TRUE FALSE

Confusion Matrix looking at Snow Correlation to Crashes

The confusion table shows there is a substantial error rate in using snow as a predictor of crashes.

FALSE TRUE
FALSE 1022 77
TRUE 265 104

There are 369 snow days and 181 crashes.

The numbers are highly skewed. The model (crash ~ snow) thus very low negative predictive value (the model does predcts a lot of false accidents) and specificity (the fraction of actual accident cases predicted correctly) are therefore expected.

Nevertheless, it does show that snow conditions contribute to a much higher accident rate. Obviously other factors must be playing a role.

Value
Sensitivity 0.79
Specificity 0.57
Pos Pred Value 0.93
Neg Pred Value 0.28
Prevalence 0.88
Detection Rate 0.70
Detection Prevalence 0.75
Balanced Accuracy 0.68

Decision Tree Analysis of multiple factors

The decision tree analysis adds several other factors to the model. Given the earlier analysis I chose to look at the variables:
- snow
- weekday
- holiday
- long weekend

We can see that snow is the largest factor and the tree analysis captures the results of the simple confusion matrix approach above correctly predicting 1211 True False cases and 71 False Postitves.

A rough English interpretation of this tree analysis is farily powerful. To avoid accidents you should:
- avoid snowy days.
- avoid driving on Friday Saturday, and Monday in that order.
- and if you must drive on a long weekend, drive on a holiday.

This seems like a relatively sane set of rules. Additionally, paying attention to the time of day would also contribute to added safety.

Accident Location

The tweets also contain some rough location information (for example “2 Miles E of Santiam Pass Summit”). It makes sense to see if any additional insight, such as specifically dangerous sections of road, can be identified and correlated with specific road conditions.

The raw data cover a date range from 2012-03-01 to 2016-01-20 and there were 0.7690141 accidents per week.

Since the distance over which the accidents occured is 11 miles, the accident density is 3.6353393 accident/year/mile.

Timeline and Location

We can graph a timeline of accidents, with the location of the accident (measured in distance from the summit) represented by the y-axis. The positive direction is West.

In the above plot dots with blue centers occurred on days when snow was reported over the pass.
It’s significant to note that most accidents happen in the Winter months. An even more surprising artifact is that accidents East of the Summit appear to occur predominantly in the Winter months. Roughly 70% of accidents on the East Side of the pass are correlated to snow, a rate much higher than average.

Accident Frequency Location Hot-Spots

We also can use the location information to analyze the location of “accident hot spots.” The location information provided on twitter is only precise to the nearest mile, so tighter accuracy is impossible with the data. Nevertheless, as the graph below shows, the a realtively short four mile stretch of road accounts for a majority of accidents.

In the plot above the dot area represent the number of accidents in each locale. Since Tripcheck tweets are only precise to a given mile, more specific analysis cannot be done.

Summary and Conclusions

This analysis shows that the Twitter feeds of ODOT Tripcheck can be used to analyze aspects of road safety, specifically in looking at the detailed behavior of “accident hot spots” as correlated to various events.

Data acquisition is easily accomplished thru the Twitter API and cleaning the data, once acquired, is straight forward.

Santiam Pass is an area with high density of accidents.

Snow is the strongest correlator to accidents, with time of day also playing a strong role, though weekday and time of day are also highly correlated to accident frequency.

A tree analysis shows that if one’s desire is to avoid accidents, following some simple rules like avoiding snow and driving on less busy days like Tuesday, Wednesday, Thursday, and Sunday is advisable.

We also found that certain sections of road are more prone to snow-related accidents. For instance the few miles East of Santiam pass roughtly 70% of the accidents occur when there is snow on the road.

The Twitter Feed from ODOT indeed contains siginficant information relevant to public safety, especially for drivers wanting to understand risk factors on roads outside their control. This analysis can easily be extended to other road areas around the state.