Weather and Crash Data

#Read data into tables
monthly_summary_raw <- read.csv("~/data/monthly_crash_summary.csv", sep=",", header=TRUE)
oct_crash_raw <- read.csv("~/data/oct_crash_data.csv", sep=",", header=TRUE)
oct_weather_raw <- read.csv("~/data/weather_oct.csv", sep=",", header=TRUE)

crash <- oct_crash_raw %>% 
  mutate(
    DATE_TIME = ymd_hms(OCCURRENCE_DATETIME), 
#Add DATE column to allow for join with weather data
    DATE = ymd(str_sub(OCCURRENCE_DATETIME, start=1, end = 10))
  )  %>%
  select(-OCCURRENCE_DATETIME)


weather <- oct_weather_raw %>%
  rename(precip = percip) %>%
#Add DATE column to allow for join with crash data
  mutate(DATE = ymd(date)) %>%
  select(-date)

# Join weather, crash, and riseset data by DATE
df <- inner_join(crash, weather)

## Joining, by = "DATE"

#rehsape dataset to summarizes by date
df.date <- df %>% group_by(DATE, max.temp, mean.temp, min.temp, precip) %>% 
  summarize(
    CRASHES = n(), 
    PEDS = sum(PED_INJ),
    BIKES = sum(BIKE_INJ),
    MV = sum(MV_INJ),
    ALL = sum(TOTAL_INJ)
    )
#reshape data to use ggplot for graphics
df_long <- df %>% 
  select(ends_with("INJ"), ends_with("temp"), precip, DATE_TIME) %>% 
  gather(key=INJ.TYPE, value=COUNT, ends_with("INJ")) %>% 
  gather(key=WEATHER, value=MEASURE, ends_with("temp"))

Overview

Data given includes a weather table describing daily mean, maximum, and minimum temperature and precipitation for the month of October, 2009; a crash data table organized by incident and including date and time of the incident, number of injuries by mode, and an undefined age variable for the same month; and, a monthly summary table describing monthly totals of injuries by mode and total number of crashes for each month in 2009.

This data was used to perform a preliminary exploration to determine what, if any, predictive value weather has on crash and injury data when analyzed at the level of incident, daily totals, and monthly totals. The age variable was not included in the analysis as only one age was given for each incident, thus obscuring whose age or what property of the ages of those involved, this variable referred to.

This analysis was performed in R. For clarity a version in which the underlying R code used to manipulate data tables and create graphics is not shownis available here: http://rpubs.com/aliceafriedman/crash-weather

Key Findings

With a high degree of uncertainty, rainer days was associated with fewer injuries to passengers and drivers in cars and trucks. Although not statistically significant, a negative relationship was found between daily precipitation and daily total injuries to motor vehicle occupants. This is a counterintuitive finding which, if verified by further analysis using a larger dataset, could perhaps be explained by fewer trips being taken (e.g. lower vehicle volumes) and possibly by the use of extra caution on the part of drivers who perceive dark and wet conditions as dangerous; however, due to the low sample size when aggregated by date and the very few days with significant rainfall in the sample, this finding is highly tentative.

df.date %>%
  ggplot (aes(x=precip, y=MV)) +
  geom_jitter() +
  geom_smooth(method="lm")+
  labs(title="Motor Vehicle Occupant Injuries by Date, Daily Precipitaiton", subtitle="NYC: October, 2009", x="Precipitation in Inches", y="Number of Injuries per Day")

Ordinary variance in weather may not be predictive of crashes with injuries to pedestrians, cyclists, or overall number of crashes. No relationship at any level was found between weather variables and either pedestrian or cyclist injuries or total number of crashes. Weather for the sample month did not appear to include any extreme weather events. Pedestrian injuries per month were very consistent across each month in 2009, further indicating that weather may not be a significant factor in predicting pedestrian injuries.

boxplot(monthly_summary_raw$PED_INJ,monthly_summary_raw$BIKE_INJ,
main="2009 Monthly Total Pedestrian and Bicycle Injuries",
names = c("Pedestrian", "Bicycle"),
ylab="Number",
border="darkgreen"
)

More data is needed to meaningfully evaluate the relationship between weather, crashes, and injuries. All conclusions reached are subject to significant error due to the sample size and format of the data. As there are 31 days in October, analysis of the data aggregated by date produced only 31 observations. Additionally, precipitation data may or may not be specific enough to generate useful insights as precipitation levels can vary significantly within a day and between neighborhoods across the city. Precipitation data was given as a daily, citywide total while crashes were reported with a specific date and time with no location information. Further analysis of weather data broken down by smaller time and geographical increments is recommended for further exploration. For example, crash and weather data broken down by hour and borough could provide better insight into whether precipitation is actually predictive of fewer crashes with injuries to motor vehicle occupants.

Policy Implications

Bad weather, e.g. rain, was tentatively found to be associated with lower levels of crashes resulting in injuries to motor vehicle occupants (MVO injuries), especially in the dark. This finding, if corroborated through additional data analysis as recommended below, aligns with an earlier DOT finding that crashes increase on warm weather weekend days. Pleasant, dry weather may encourage more people to take optional trips, increasing levels of vehicular traffic on the road, or it may reduce the caution drivers take in what are perceived to be safe driving conditions. This finding, then, could provide additional support to the continuing a recent Vision Zero campaign to increase enforcement on warmer days, perhaps with the additional emphasis on what are predicted to be warmer, dry days.

Analysis

Monthly Summary Data

Aggregated by month, the variance of injuries and total number of crashes can mostly be attributed to variance in motor vehicle occupant (MVO) injuries, which are higher in summer. Bike injuries, while a much smaller fraction of the whole, follow a similar pattern. Because weather data is not available within the sample set at the monthly level, the weather-related explanations will be explored for data aggregated by date, in October, in the following section.

Monthly pedestrian injuries, on the other hand, hew quite closely to the mean. Because weather varies significantly by month, this first pass analysis alone would seem to indicate that weather does not strongly affect the number of pedestrians injured; however, if there are significantly different number of pedestrians on the streets under different weather conditions (a plausible scenario), this could indicate that there is a higher risk to individual pedestrians out when pedestrian volumes are lower. In other words, if there are many fewer pedestrians out in cold weather, but cold weather months see the same number of pedestrian injuries, then cold weather pedestrians are at a higher risk of injury.

Because weather data is not available within the sample set at the monthly level, the weather-related explanations will be explored for data aggregated by date, in October, in the following section.

monthly_summary_raw %>% 
  mutate(`Crashes` = CRASHES, 
         `Ped. Injuries` = PED_INJ, 
         `Bike Injuries` = BIKE_INJ,
         `MVO Injuries` = MV_INJ,
         `Total Injuries` = TOTAL_INJ) %>%
  gather(key=`Count Type`, 
    value=Count, 
    `Crashes`,   
    `Ped. Injuries`, 
    `Bike Injuries`,
    `MVO Injuries`,
    `Total Injuries`) %>%
  mutate(`Count Type`= fct_reorder(`Count Type`, Count, .desc = TRUE))%>%
ggplot(aes(x=as.factor(MONTH), y=Count, fill=`Count Type`)) +
  geom_col(position = "dodge")+
  labs( 
    title="2009 Total Crashes and Injuries, and Injuries by Mode and Month",
     x="Month",
     y="Monthly Total")

Daily data

The following scatterplots show a linear regression, with uncertainty bands in grey, for each of the injury variables as well as total number of crashes by date plotted against each of the weather variables.

#create exploratory graphics
df.date %>% 
  gather(key=COUNT.TYPE, value=COUNT, BIKES, PEDS, ALL, MV, CRASHES) %>% 
  gather(key=WEATHER, value=MEASURE, ends_with("temp")) %>%
  ggplot(aes(x=MEASURE, y=COUNT, col=COUNT.TYPE))+
  geom_point()+
#adds linear regression line with standard dev.
  geom_smooth(method = lm)+
  facet_wrap(~WEATHER)+
  labs(title="Daily Temperature vs. Crashes and Injuries", 
       subtitle="By Mode, Oct. 2009", 
       x="Temperature in Degrees Fahrenheit",
       y="Daily Injury/Crash Count")

df.date %>% 
  gather(key=COUNT.TYPE, value=COUNT, BIKES, PEDS, ALL, MV, CRASHES) %>% 
  ggplot(aes(x=precip, y=COUNT, col=COUNT.TYPE))+
  geom_point()+
#adds linear regression line with standard dev.
  geom_smooth(method = lm)+
  labs(title="Daily Precipitation vs. Crashes and Injuries", 
       subtitle="By Mode, Oct. 2009", 
       x="Precipitation in Inches",
       y="Daily Injury/Crash Count")

The data shows, with a high level of uncertainty, a negative relationship between each of the weather variables and total injuries, number or crashes, and total motor vehicle occupant injuries. A very slightly negative relationship was found between these variables and injuries to pedestrians and cyclists. However, because of the very small samples size of the aggregated data, each of these findings should be considered to be an interesting direction for further analysis rather than a firm conclusion. The relationship between temperature data and crashes is, in particular, suspect because warmer months are associated with more rather than fewer crashes with MVO injuries.

No statistical signficance was found for any of these relationships. Besides small sample size, at least two additional aspects of this analysis which limit its reliability:

Weather data is given at the level of the date while incidents occur at a specific date and time.
Weather data is given for NYC as a whole, while incidents occur at a single location, which may have been subject to different local weather. (e.g., If it rains in Brooklyn, it may still be sunny in the Bronx.)

Incident-Level Data

At the level of the incident, weather variables have no discernible predictive value on the number or mode of injuries per crash.

#scatterplots with regression lines, SD, to explore relationships
df_long %>% 
  #mutate(`Count Type`= fct_reorder(`Count Type`, Count, .desc = TRUE)) %>%
  ggplot(mapping = aes(x=MEASURE, y=COUNT, col=INJ.TYPE)) + 
  geom_jitter(alpha=.5) + 
  facet_wrap(~WEATHER) +
  geom_smooth(method=lm)+
  labs(title="Daily Temperature vs. Crashes and Injuries", 
       subtitle="By Incident and Mode, Oct. 2009", 
       x="Temperature in Degrees Fahrenheit",
       y="Number of Injuries Per Crash")

One the one hand, the sample size of the data for this analysis level is quite high (3689 observations); on the other hand, weather data is given only for the level of the date and the city as a whole. This means that an incident-level analysis assumes that daily weather data will have a constant impact on how many injuries and on what mode occur in a given crash over the course of the day, and that total precipitation will impact incidents in all 5 boroughs. The reality is that it could very well rain in one location and not another. This “noise” in the data could be obscuring real relationships.

Recommendations for further study

Increase the date range of the study in include more years, dates as observations
Consider normalizing against traffic counts (if possible) to determine if effect is due to directly to weather or to changes in traffic counts related to weather
Evaluate crash and weather data (if available) at the borough level, to minimize the error associated with overgeneralizing data, especially precipitation data. If borough-level weather data is not available, limit locations included in the analysis to within some distance of the weather station data source (e.g. 1 mile from Central Park), and increase the date range to see whether the effect, particularly on precipitation, holds.