options(warn=-1)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
#All collision by date
CollisionByDate <- read.csv(file="109PoliceMotorCollision.csv", header=TRUE, sep=",")
CollisionByDate$DATE <- format(as.Date(CollisionByDate$DATE, "%m/%d/%Y"), "%Y-%m-%d" )
# some car accident record doens't specify the reason, let's use vehicle type instead
#Weather By Date
WeatherByDate <- read.csv("NYCWeather.csv", header=TRUE, sep=",")
WeatherByDate$DATE <- format(as.Date(WeatherByDate$DATE, "%m/%d/%Y"), "%Y-%m-%d" )
#NOAA weather data
#SNOW - Snowfall
#PRCP - Precipitation
Is car accident predictable by weather conditions?
109 Precinct in Flushing, NY (https://goo.gl/maps/9aPSt9yqDp42), from one of my previous project “NYPD Motor Vehicle Collisions”, I found that it has most of the car accidents in the entire New York City (http://rpubs.com/nyjon2k/316465/). Therefore, I am interested: Does the weather condition lead to more car accidents?
Research question:
What will be car accident frequency near 109 Police station?
What are the car accident frequencies between good and bad weather?
Is there any correlation between number of car accidents and weather condition?
Can we predict how many car accidents with given weather condition?
What are the cases, and how many are there?
Each case represent the car accidents with weather conditions. About 3 years of daily data are be collected in observations in the given data set.
NYPD Moter Vehicle Collison Data is maintained by NYPD department (https://catalog.data.gov/dataset/nypd-motor-vehicle-collisions-07420)
National Ocean and Atmosperic Adiminstration (NOAA) Historical data. (https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table)
This is an observational study.
2 dataset are mergedtogether by date before analysis.
Both data set are daily data.
NYPD Moter Vehicle Collision Data can be queried with URL/JSON request: (https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95/data) or it can be downloaded from Google Cloud (https://cloud.google.com/bigquery/public-data/)
NOAA historical weather data: weather data of area near 109 police station. (https://www.climate.gov/maps-data/dataset/past-weather-zip-code-data-table). Data can be collected with zip code or country information. -Click the link above, then click “Data Access” and choose “CDO Search” -In the Climate Data Online Search page -Choose “Daily Summaries”, Choose “Date Range”, Choose “Counties” for station -Enter “Queens, New York” and then click search -Filter data by weather station “USW00094789” or “JFK INTERNATIONAL AIRPORT, NY US” -Once in the result page, click “Add to cart” in Queens County, NY section -Click View Cart, Choose “Custom GHCN-Daily CSV” for output format -Once again confirm date range, click “continue” -In custom output format, click “Select All” and “continue” -enter email address and confirm the subscription, the data file will be sent to the email address.
Data defination could be found:
https://catalog.data.gov/harvest/object/c3268bdd-a141-4ab6-808d-e0d1f23b7201
What is the response variable, and what type is it (numerical/categorical)?
The response variable is car accident frequency per day and is numerical.
What is the explanatory variable, and what type is it (numerical/categorival)?
The explanatory variable is weather condition and is numerical in rain and snow level. It is also converted to categorival for later discussion.
Short description of the vehicle accidents could be found in contributing factor vehicle 1. If it is “unspecified”, use vehicle type code 1 instead. Summarize the basic accident data by date.
AccidentByDate <- CollisionByDate %>%
select (DATE, CONTRIBUTING.FACTOR.VEHICLE.1, VEHICLE.TYPE.CODE.1) %>%
mutate(
Type = ifelse(CONTRIBUTING.FACTOR.VEHICLE.1 =="Unspecified", paste0("Unspecified-", VEHICLE.TYPE.CODE.1),
ifelse(CONTRIBUTING.FACTOR.VEHICLE.1 =="", paste0("Unspecified-", VEHICLE.TYPE.CODE.1), as.character(CONTRIBUTING.FACTOR.VEHICLE.1)
)
)
) %>%
group_by(DATE, Type) %>%
summarize(
Freq = n()
)
AccidentFreq <- AccidentByDate %>%
group_by(Type) %>%
summarize(
Total = sum(Freq)
)
ggplot(AccidentFreq, aes(x=AccidentFreq$Type, AccidentFreq$Total, color=AccidentFreq$Total)) + geom_bar(stat = "identity",fill='lightblue', color="lightblue") + xlab("Types") +
ylab("Frequency") +
theme(axis.text.x=element_text(angle=90,hjust=0.2,vjust=0.2))
The most frequent accident near 109 police station is “unspecified-PASSENGER VEHICLE” from year 2013-2017.
Passenger vehicles have the highest accident frequency in this area, does it correlate to weather condition?
For simplification, create “water” level by combining both Precipitation and amount of snowfall.
Determine if the frequency of the accident is correlated to weather condition.
The accident frequency data for unspecified-Passenger vehicle are summarized by date.
#create water column
WeatherByDate$Water <- WeatherByDate$PRCP + WeatherByDate$SNOW
Passenger_Vehicle_AccidentByDate <- AccidentByDate %>%
filter(Type=="Unspecified-PASSENGER VEHICLE") %>%
select(DATE, Freq)
pv_Accident_Weather_ByDate <- merge(Passenger_Vehicle_AccidentByDate,WeatherByDate,by="DATE")
Num_of_Passenger_Vehicle_Accident <- pv_Accident_Weather_ByDate %>%
summarize(
Total = sum(Freq)
)
Num_of_Passenger_Vehicle_Accident_GoodWeather <- pv_Accident_Weather_ByDate %>%
filter(Water ==0) %>%
summarize(
Total = sum(Freq)
)
Num_of_Passenger_Vehicle_Accident_BadWeather <- pv_Accident_Weather_ByDate %>%
filter(Water > 0) %>%
summarize(
Total = sum(Freq)
)
The total number of accident (Unspecified-Passenger Vehicle):
(Num_of_Passenger_Vehicle_Accident)
## Total
## 1 2346
The total number of accident (Unspecified-Passenger Vehicle) in good weather:
(Num_of_Passenger_Vehicle_Accident_GoodWeather)
## Total
## 1 1548
The total number of accident (Unspecified-Passenger Vehicle) in bad weather:
(Num_of_Passenger_Vehicle_Accident_BadWeather)
## Total
## 1 798
#only select bad weather or water > 0
pv_Accident_Weather_ByDate <- pv_Accident_Weather_ByDate %>% filter(Water >0)
pv_Accident_Weather_noDate <- pv_Accident_Weather_ByDate[,2:dim(pv_Accident_Weather_ByDate)[2]]
pv_cor_accident_Weather <- cor(pv_Accident_Weather_noDate, use = "complete.obs")
(pv_cor_accident_Weather)
## Freq PRCP SNOW Water
## Freq 1.000000000 0.01203250 -0.01317857 -0.004761645
## PRCP 0.012032497 1.00000000 0.02839568 0.546011469
## SNOW -0.013178571 0.02839568 1.00000000 0.852944244
## Water -0.004761645 0.54601147 0.85294424 1.000000000
From the above correlation maxtrix for vehicle accidents vs weathers, the vehicle accident frequency doesn’t have any strong correlation with rain, snow or both.
lm(pv_Accident_Weather_noDate$Water~ pv_Accident_Weather_noDate$Freq)
##
## Call:
## lm(formula = pv_Accident_Weather_noDate$Water ~ pv_Accident_Weather_noDate$Freq)
##
## Coefficients:
## (Intercept) pv_Accident_Weather_noDate$Freq
## 0.574624 -0.002981
plot(pv_Accident_Weather_noDate$Water,pv_Accident_Weather_noDate$Freq, main="Rain or Snow vs Accident Frequency (Unspecified Passenger Car)",
xlab="Water(Rain or snow) amount", ylab="Accident Frequency on 'bad' weather day" )
lines(lowess(pv_Accident_Weather_noDate$Water , pv_Accident_Weather_noDate$Freq), col="blue") # lowess line (x,y)
From the chart above, the accidents are concentrated on Water(rain or snow amount) from 0.01 to 2. It indicates that drivers are probably less caution when the weather starts to get bad but still visible.
When Water amount goes higher,there are fewer days (observation) with the accident and the frequency are around 1-5. It indicates that drivers are more caution or probably fewer drivers are on the road when the visiblity is getting low. Also, few observation of accidents could be the result from the NYC government that may enforce “emergency vehicle only” order on extreme heavy snow day.
There is no good conclusion for linear regression based on the result of correlation matrix and frequency observation above.
For bad weather (water >= 0.01) and vehicle accident = “Unspecified - Passenger Vehicle”, the border line between the observation is around water = 2.
We may define the condition: Water <= 2 is visible water > 2 is Low Visible
Are the means of the vehicle accidents between 2 conditions are equal?
#clone dataset
pv_AccidentV_Weather_noDate <- pv_Accident_Weather_noDate
#create a new column for visiblity
pv_AccidentV_Weather_noDate$Visible <- ifelse(pv_Accident_Weather_noDate$Water <= 2, "Visible","Low Visible")
#plot the box chart
boxplot(pv_AccidentV_Weather_noDate$Freq ~ pv_AccidentV_Weather_noDate$Visible, data = pv_AccidentV_Weather_noDate, xlab = "Visible",
ylab = "Freq", main = "Unspecified - Passenger Vehicle")
acct_lm <- lm(pv_AccidentV_Weather_noDate$Freq ~ pv_AccidentV_Weather_noDate$Visible, data = pv_AccidentV_Weather_noDate)
summary(acct_lm)
##
## Call:
## lm(formula = pv_AccidentV_Weather_noDate$Freq ~ pv_AccidentV_Weather_noDate$Visible,
## data = pv_AccidentV_Weather_noDate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3854 -1.3854 -0.3854 0.6146 7.6146
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2.33333 0.37379 6.242
## pv_AccidentV_Weather_noDate$VisibleVisible 0.05202 0.38608 0.135
## Pr(>|t|)
## (Intercept) 1.31e-09 ***
## pv_AccidentV_Weather_noDate$VisibleVisible 0.893
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.713 on 333 degrees of freedom
## Multiple R-squared: 5.451e-05, Adjusted R-squared: -0.002948
## F-statistic: 0.01815 on 1 and 333 DF, p-value: 0.8929
anova(acct_lm)
## Analysis of Variance Table
##
## Response: pv_AccidentV_Weather_noDate$Freq
## Df Sum Sq Mean Sq F value Pr(>F)
## pv_AccidentV_Weather_noDate$Visible 1 0.05 0.05326 0.0182 0.8929
## Residuals 333 977.04 2.93405
Since the P-value is very large, we can’t reject the null hypothesis. The means of the vehicle accidents between 2 conditions are similar.
We may not be able to predict the number of vehicle accidents with weather condition such as using amount of rain. But we can tell that there will be more car accidents when the weather is bad but still visible. Drivers may be less caution when the rain or snow level is low. This finding may suggest that when the weather forecast shows it is possible to have low level of rain or snow, police department, hospital or even insurance and car towing company may need to arrange their staffs for the expecting car accidents its victim according to the weather condition.