#Read data into tables
monthly_summary_raw <- read.csv("~/data/monthly_crash_summary.csv", sep=",", header=TRUE)
oct_crash_raw <- read.csv("~/data/oct_crash_data.csv", sep=",", header=TRUE)
oct_weather_raw <- read.csv("~/data/weather_oct.csv", sep=",", header=TRUE)
crash <- oct_crash_raw %>%
mutate(
DATE_TIME = ymd_hms(OCCURRENCE_DATETIME),
#Add DATE column to allow for join with weather data
DATE = ymd(str_sub(OCCURRENCE_DATETIME, start=1, end = 10))
) %>%
select(-OCCURRENCE_DATETIME)
weather <- oct_weather_raw %>%
rename(precip = percip) %>%
#Add DATE column to allow for join with crash data
mutate(DATE = ymd(date)) %>%
select(-date)
# Join weather, crash, and riseset data by DATE
df <- inner_join(crash, weather)
## Joining, by = "DATE"
#rehsape dataset to summarizes by date
df.date <- df %>% group_by(DATE, max.temp, mean.temp, min.temp, precip) %>%
summarize(
CRASHES = n(),
PEDS = sum(PED_INJ),
BIKES = sum(BIKE_INJ),
MV = sum(MV_INJ),
ALL = sum(TOTAL_INJ)
)
#reshape data to use ggplot for graphics
df_long <- df %>%
select(ends_with("INJ"), ends_with("temp"), precip, DATE_TIME) %>%
gather(key=INJ.TYPE, value=COUNT, ends_with("INJ")) %>%
gather(key=WEATHER, value=MEASURE, ends_with("temp"))
Data given includes a weather table describing daily mean, maximum, and minimum temperature and precipitation for the month of October, 2009; a crash data table organized by incident and including date and time of the incident, number of injuries by mode, and an undefined age variable for the same month; and, a monthly summary table describing monthly totals of injuries by mode and total number of crashes for each month in 2009.
This data was used to perform a preliminary exploration to determine what, if any, predictive value weather has on crash and injury data when analyzed at the level of incident, daily totals, and monthly totals. The age variable was not included in the analysis as only one age was given for each incident, thus obscuring whose age or what property of the ages of those involved, this variable referred to.
This analysis was performed in R. For clarity a version in which the underlying R code used to manipulate data tables and create graphics is not shownis available here: http://rpubs.com/aliceafriedman/crash-weather
df.date %>%
ggplot (aes(x=precip, y=MV)) +
geom_jitter() +
geom_smooth(method="lm")+
labs(title="Motor Vehicle Occupant Injuries by Date, Daily Precipitaiton", subtitle="NYC: October, 2009", x="Precipitation in Inches", y="Number of Injuries per Day")
boxplot(monthly_summary_raw$PED_INJ,monthly_summary_raw$BIKE_INJ,
main="2009 Monthly Total Pedestrian and Bicycle Injuries",
names = c("Pedestrian", "Bicycle"),
ylab="Number",
border="darkgreen"
)
Bad weather, e.g. rain, was tentatively found to be associated with lower levels of crashes resulting in injuries to motor vehicle occupants (MVO injuries), especially in the dark. This finding, if corroborated through additional data analysis as recommended below, aligns with an earlier DOT finding that crashes increase on warm weather weekend days. Pleasant, dry weather may encourage more people to take optional trips, increasing levels of vehicular traffic on the road, or it may reduce the caution drivers take in what are perceived to be safe driving conditions. This finding, then, could provide additional support to the continuing a recent Vision Zero campaign to increase enforcement on warmer days, perhaps with the additional emphasis on what are predicted to be warmer, dry days.
Aggregated by month, the variance of injuries and total number of crashes can mostly be attributed to variance in motor vehicle occupant (MVO) injuries, which are higher in summer. Bike injuries, while a much smaller fraction of the whole, follow a similar pattern. Because weather data is not available within the sample set at the monthly level, the weather-related explanations will be explored for data aggregated by date, in October, in the following section.
Monthly pedestrian injuries, on the other hand, hew quite closely to the mean. Because weather varies significantly by month, this first pass analysis alone would seem to indicate that weather does not strongly affect the number of pedestrians injured; however, if there are significantly different number of pedestrians on the streets under different weather conditions (a plausible scenario), this could indicate that there is a higher risk to individual pedestrians out when pedestrian volumes are lower. In other words, if there are many fewer pedestrians out in cold weather, but cold weather months see the same number of pedestrian injuries, then cold weather pedestrians are at a higher risk of injury.
Because weather data is not available within the sample set at the monthly level, the weather-related explanations will be explored for data aggregated by date, in October, in the following section.
monthly_summary_raw %>%
mutate(`Crashes` = CRASHES,
`Ped. Injuries` = PED_INJ,
`Bike Injuries` = BIKE_INJ,
`MVO Injuries` = MV_INJ,
`Total Injuries` = TOTAL_INJ) %>%
gather(key=`Count Type`,
value=Count,
`Crashes`,
`Ped. Injuries`,
`Bike Injuries`,
`MVO Injuries`,
`Total Injuries`) %>%
mutate(`Count Type`= fct_reorder(`Count Type`, Count, .desc = TRUE))%>%
ggplot(aes(x=as.factor(MONTH), y=Count, fill=`Count Type`)) +
geom_col(position = "dodge")+
labs(
title="2009 Total Crashes and Injuries, and Injuries by Mode and Month",
x="Month",
y="Monthly Total")
The following scatterplots show a linear regression, with uncertainty bands in grey, for each of the injury variables as well as total number of crashes by date plotted against each of the weather variables.
#create exploratory graphics
df.date %>%
gather(key=COUNT.TYPE, value=COUNT, BIKES, PEDS, ALL, MV, CRASHES) %>%
gather(key=WEATHER, value=MEASURE, ends_with("temp")) %>%
ggplot(aes(x=MEASURE, y=COUNT, col=COUNT.TYPE))+
geom_point()+
#adds linear regression line with standard dev.
geom_smooth(method = lm)+
facet_wrap(~WEATHER)+
labs(title="Daily Temperature vs. Crashes and Injuries",
subtitle="By Mode, Oct. 2009",
x="Temperature in Degrees Fahrenheit",
y="Daily Injury/Crash Count")
df.date %>%
gather(key=COUNT.TYPE, value=COUNT, BIKES, PEDS, ALL, MV, CRASHES) %>%
ggplot(aes(x=precip, y=COUNT, col=COUNT.TYPE))+
geom_point()+
#adds linear regression line with standard dev.
geom_smooth(method = lm)+
labs(title="Daily Precipitation vs. Crashes and Injuries",
subtitle="By Mode, Oct. 2009",
x="Precipitation in Inches",
y="Daily Injury/Crash Count")
The data shows, with a high level of uncertainty, a negative relationship between each of the weather variables and total injuries, number or crashes, and total motor vehicle occupant injuries. A very slightly negative relationship was found between these variables and injuries to pedestrians and cyclists. However, because of the very small samples size of the aggregated data, each of these findings should be considered to be an interesting direction for further analysis rather than a firm conclusion. The relationship between temperature data and crashes is, in particular, suspect because warmer months are associated with more rather than fewer crashes with MVO injuries.
No statistical signficance was found for any of these relationships. Besides small sample size, at least two additional aspects of this analysis which limit its reliability:
Weather data is given at the level of the date while incidents occur at a specific date and time.
Weather data is given for NYC as a whole, while incidents occur at a single location, which may have been subject to different local weather. (e.g., If it rains in Brooklyn, it may still be sunny in the Bronx.)
At the level of the incident, weather variables have no discernible predictive value on the number or mode of injuries per crash.
#scatterplots with regression lines, SD, to explore relationships
df_long %>%
#mutate(`Count Type`= fct_reorder(`Count Type`, Count, .desc = TRUE)) %>%
ggplot(mapping = aes(x=MEASURE, y=COUNT, col=INJ.TYPE)) +
geom_jitter(alpha=.5) +
facet_wrap(~WEATHER) +
geom_smooth(method=lm)+
labs(title="Daily Temperature vs. Crashes and Injuries",
subtitle="By Incident and Mode, Oct. 2009",
x="Temperature in Degrees Fahrenheit",
y="Number of Injuries Per Crash")
One the one hand, the sample size of the data for this analysis level is quite high (3689 observations); on the other hand, weather data is given only for the level of the date and the city as a whole. This means that an incident-level analysis assumes that daily weather data will have a constant impact on how many injuries and on what mode occur in a given crash over the course of the day, and that total precipitation will impact incidents in all 5 boroughs. The reality is that it could very well rain in one location and not another. This “noise” in the data could be obscuring real relationships.
Increase the date range of the study in include more years, dates as observations
Consider normalizing against traffic counts (if possible) to determine if effect is due to directly to weather or to changes in traffic counts related to weather
Evaluate crash and weather data (if available) at the borough level, to minimize the error associated with overgeneralizing data, especially precipitation data. If borough-level weather data is not available, limit locations included in the analysis to within some distance of the weather station data source (e.g. 1 mile from Central Park), and increase the date range to see whether the effect, particularly on precipitation, holds.