For my final project I’d like to use data on pedestrian and motor vehicles accidents published by the City of New York as part of the NYC OpenData website. Being a resident of Queens, NY, it concerned me that Queens Boulevard at some point in the past was labeled as “Boulevard of Death”, being one of the most dangerous, widest and challenging roadways to cross for pedestrians. I have witnessed numerous improvements done over the years along this roadway and so I’d like to use what I’ve learned in Data Science so far (in terms of data acquisition, transformation, visualization and gathering statistics) in order to determine whether there was a ROI (return on investment) and therefore any significant drop in pedestrian related accidents.
Below is an outline of Data Sources I plan to use in order to extract accidents related data, to draw statistics from and to reach my conclusions:
library(RSocrata)
library(tidyverse)
library(httr)
library(magrittr)
library(lubridate)
library(kableExtra)
work.dir <- "~/R/Final Project/"
work.file <- "qb_data.csv"
isFirstRun <- !file.exists(file.path(work.dir, work.file))
The NYPD Motor Vehicle Collisions site offers to download the complete dataset on collisions starting from July 1st 2012 to present time, as a CSV file, NYPD_Collision_DataDictionary.xlsx. The file is currently at over 270MB in size and therefore can be problematic to work with. I decided to look for other ways to get the data that I needed. Since I only needed to focus on the data for “Queens Boulevard”, I started to look into the API offered by the site and to see if there was a filtering mechanism to let me extract and download the subset of the data which I was after.
NOTE: The API requires an API token. Use “Knit with Parameters…” to supply value for api_token parameter or set api_key using environment options, such as options(api_key = “xyz123”)
Declare a function, called read.socrata.by.starts_with which would query the API given a field name and a text value that the field would start with. This function will be useful, since the street name, “QUEENS BOULEVARD”, can be found in the following three different fields of the dataset:
read.socrata.by.starts_with <- function(colname, starts_with_value) {
base.url <- "https://data.cityofnewyork.us/resource/qiz3-axqb.json"
param <- str_c("?$where=starts_with(", colname, ", '", starts_with_value, "')") %>% URLencode()
qry.url <- str_c(base.url, param)
return (read.socrata(qry.url, app_token = getOption("api_key")))
}
Make the calls to the API and combine the results into one data frame.
if (isFirstRun) {
df.on <- read.socrata.by.starts_with("on_street_name", "QUEENS BOULEVARD")
df.off <- read.socrata.by.starts_with("off_street_name", "QUEENS BOULEVARD")
df.cross <- read.socrata.by.starts_with("cross_street_name", "QUEENS BOULEVARD")
df.all <- bind_rows(df.on, df.off) %>% bind_rows(df.cross)
glimpse(df.all)
}
## Observations: 10,514
## Variables: 30
## $ borough <chr> "QUEENS", "QUEENS", "QUEENS", "Q...
## $ contributing_factor_vehicle_1 <chr> "Backing Unsafely", "Unspecified...
## $ contributing_factor_vehicle_2 <chr> "Unspecified", "Unspecified", "U...
## $ date <dttm> 2012-10-21, 2014-01-28, 2012-10...
## $ latitude <chr> "40.7435926", "40.7271816", "40....
## $ location.type <chr> "Point", "Point", "Point", "Poin...
## $ location.coordinates <list> [<-73.92241, 40.74359>, <-73.85...
## $ longitude <chr> "-73.9224105", "-73.8543445", "-...
## $ number_of_cyclist_injured <chr> "0", "0", "0", "0", "0", "0", "0...
## $ number_of_cyclist_killed <chr> "0", "0", "0", "0", "0", "0", "0...
## $ number_of_motorist_injured <chr> "0", "1", "0", "0", "0", "0", "0...
## $ number_of_motorist_killed <chr> "0", "0", "0", "0", "0", "0", "0...
## $ number_of_pedestrians_injured <chr> "0", "0", "0", "0", "0", "0", "0...
## $ number_of_pedestrians_killed <chr> "0", "0", "0", "0", "0", "0", "0...
## $ number_of_persons_injured <chr> "0", "1", "0", "0", "0", "0", "0...
## $ number_of_persons_killed <chr> "0", "0", "0", "0", "0", "0", "0...
## $ off_street_name <chr> "42 STREET ...
## $ on_street_name <chr> "QUEENS BOULEVARD ...
## $ time <chr> "9:00", "10:15", "2:40", "9:00",...
## $ unique_key <chr> "239896", "267817", "239878", "2...
## $ vehicle_type_code1 <chr> "UNKNOWN", "SPORT UTILITY / STAT...
## $ vehicle_type_code2 <chr> "UNKNOWN", "PASSENGER VEHICLE", ...
## $ zip_code <chr> "11104", "11374", "11104", "1137...
## $ contributing_factor_vehicle_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ vehicle_type_code_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ contributing_factor_vehicle_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ contributing_factor_vehicle_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ vehicle_type_code_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ vehicle_type_code_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cross_street_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
NOTE: The glimpse into the attributes of the data frame shows that the “location.coordinates” column is of type ‘’ of coordinate values. This will need to be converted to a simpler represantation such as a string in order to save the data frame into a CSV file for subsequent processing
if (isFirstRun) {
# convert location.coordinates list into a string of CSV values and also converting empty strings to NA
df.all$location.coordinates %<>% lapply(str_c, collapse = ",") %>%
lapply(function(x) if(identical(x, character(0))) NA_character_ else x) %>% as.character()
write.csv(df.all, file = file.path(work.dir, work.file))
}
Load data back from CSV file prior to analysis
df.all <- read.csv(file = file.path(work.dir, work.file), stringsAsFactors = FALSE)
glimpse(df.all)
## Observations: 10,514
## Variables: 31
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1...
## $ borough <chr> "QUEENS", "QUEENS", "QUEENS", "Q...
## $ contributing_factor_vehicle_1 <chr> "Backing Unsafely", "Unspecified...
## $ contributing_factor_vehicle_2 <chr> "Unspecified", "Unspecified", "U...
## $ date <chr> "2012-10-21", "2014-01-28", "201...
## $ latitude <dbl> 40.74359, 40.72718, 40.74359, 40...
## $ location.type <chr> "Point", "Point", "Point", "Poin...
## $ location.coordinates <chr> "-73.9224105,40.7435926", "-73.8...
## $ longitude <dbl> -73.92241, -73.85434, -73.92241,...
## $ number_of_cyclist_injured <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ number_of_cyclist_killed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ number_of_motorist_injured <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ number_of_motorist_killed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ number_of_pedestrians_injured <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ number_of_pedestrians_killed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ number_of_persons_injured <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ number_of_persons_killed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ off_street_name <chr> "42 STREET ...
## $ on_street_name <chr> "QUEENS BOULEVARD ...
## $ time <chr> "9:00", "10:15", "2:40", "9:00",...
## $ unique_key <int> 239896, 267817, 239878, 264778, ...
## $ vehicle_type_code1 <chr> "UNKNOWN", "SPORT UTILITY / STAT...
## $ vehicle_type_code2 <chr> "UNKNOWN", "PASSENGER VEHICLE", ...
## $ zip_code <int> 11104, 11374, 11104, 11375, 1143...
## $ contributing_factor_vehicle_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ vehicle_type_code_3 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ contributing_factor_vehicle_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ contributing_factor_vehicle_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ vehicle_type_code_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ vehicle_type_code_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ cross_street_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
Filter the data where there are pedestrians either injured or killed as a result of an accident.
df <- df.all %>% filter(number_of_pedestrians_injured > 0 | number_of_pedestrians_killed > 0)
Convert the date column into Date type for better processing using the lubridate package
#Convert "date" column from <chr> to <Date> type
df$date %<>% as.Date()
glimpse(df$date)
## Date[1:489], format: "2012-10-13" "2012-10-04" "2012-09-24" "2012-09-20" "2012-09-10" ...
Here, I group the data by each year to see a trend of total number of pedestrian related accidents. The data for 2012 and 2018 can be ignored, since we don’t have the complete set of data for those years.
df.y <- df %>% group_by(year = year(date)) %>%
summarise(ped_killed = sum(number_of_pedestrians_killed),
ped_accidents = sum(number_of_pedestrians_injured + number_of_pedestrians_killed))
#For every year, find percentage difference from the year before
df.y %<>% mutate(pct_diff = round(((ped_accidents - lag(ped_accidents)) / lag(ped_accidents)) * 100))
kable_styling(knitr::kable(df.y, "html", caption = "Accidents by Year"), bootstrap_options = "striped")
| year | ped_killed | ped_accidents | pct_diff |
|---|---|---|---|
| 2012 | 0 | 52 | NA |
| 2013 | 6 | 116 | 123 |
| 2014 | 2 | 100 | -14 |
| 2015 | 0 | 83 | -17 |
| 2016 | 0 | 68 | -18 |
| 2017 | 0 | 70 | 3 |
| 2018 | 0 | 31 | -56 |
ggplot(df.y, mapping = aes(x = year, y = ped_accidents)) +
geom_col() +
geom_text(aes(label = ped_accidents, y = ped_accidents), vjust = -0.5) +
labs(title = "Accidents by Year", x = "Years", y = "Total Pedestrian Accidents")
The data shows that there is indeed a significant reduction (nearly by half) in pedestrian related accidents. There were no fatalities since 2014 and the percentage drops in total pedestrian accidents were in double digits every year for consecutive 3 years since 2013. This shows that the measures taken by the city to improve safety on Queens Boulevard paid off with good results, especially if we also consider growing population of residents and drivers on the road.
After answering my main question, I decided to do some additional exploration and see what else the data can reveal if I zoom in on the data a bit more.
I start off with some Date based analysis.
using the lubridate::floor_date(date, “quarter”) function
df.q <- df %>% group_by(year = year(date), quarter = floor_date(date, "quarter")) %>%
summarise(ped_accidents = sum(number_of_pedestrians_injured + number_of_pedestrians_killed))
df.q
## # A tibble: 24 x 3
## # Groups: year [?]
## year quarter ped_accidents
## <dbl> <date> <int>
## 1 2012. 2012-07-01 24
## 2 2012. 2012-10-01 28
## 3 2013. 2013-01-01 31
## 4 2013. 2013-04-01 23
## 5 2013. 2013-07-01 29
## 6 2013. 2013-10-01 33
## 7 2014. 2014-01-01 32
## 8 2014. 2014-04-01 18
## 9 2014. 2014-07-01 16
## 10 2014. 2014-10-01 34
## # ... with 14 more rows
ggplot(df.q, mapping = aes(x = quarter, y = ped_accidents)) +
geom_line() +
geom_text(aes(label = ped_accidents, y = ped_accidents), vjust = -0.5) +
labs(title = "Accidents by Quarter", x = "Quarters", y = "Total Pedestrian Accidents")
The data seems to show a somewhat downward trend but overrall it doesn’t appear to be revealing much.
Next step is to try to summarize and plot the data by month (similarly, by using the floor_date() function).
df.m <- df %>% group_by(month = floor_date(date, "month")) %>%
summarise(ped_accidents = sum(number_of_pedestrians_injured + number_of_pedestrians_killed))
df.m
## # A tibble: 71 x 2
## month ped_accidents
## <date> <int>
## 1 2012-07-01 9
## 2 2012-08-01 7
## 3 2012-09-01 8
## 4 2012-10-01 8
## 5 2012-11-01 8
## 6 2012-12-01 12
## 7 2013-01-01 12
## 8 2013-02-01 15
## 9 2013-03-01 4
## 10 2013-04-01 5
## # ... with 61 more rows
ggplot(df.m, mapping = aes(x = month, y = ped_accidents)) +
geom_line() +
#geom_text(aes(label = ped_accidents, y = ped_accidents), vjust = -0.5) +
labs(title = "Accidents by Month", x = "Months", y = "Total Pedestrian Accidents")
Now the data revealed a slightly better pattern that the number of accidents spike around the end or the start of a year. This led me to do one more date-based analysis by grouping dates into seasons.
I wrote a function, called “season” which would take a date or a vector of dates and would translate that into one of the seasons (WINTER, SPRING, SUMMER, and AUTUMN)
season <- function(dt) {
unlist(
lapply(dt, function(d) {
if (month(d) %in% c(12, 1, 2)) {"WINTER"}
else if (month(d) %in% c(3, 4, 5)) {"SPRING"}
else if (month(d) %in% c(6, 7, 8)) {"SUMMER"}
else if (month(d) %in% c(9, 10, 11)) {"AUTUMN"}
})
)
}
Here’s what the data looks like, grouped by seasons
df.s <- df %>% group_by(season = season(date)) %>%
summarise(ped_accidents = sum(number_of_pedestrians_injured + number_of_pedestrians_killed))
df.s
## # A tibble: 4 x 2
## season ped_accidents
## <chr> <int>
## 1 AUTUMN 166
## 2 SPRING 99
## 3 SUMMER 93
## 4 WINTER 162
ggplot(df.s, mapping = aes(x = season, y = ped_accidents)) +
geom_col() +
geom_text(aes(label = ped_accidents, y = ped_accidents), vjust = -0.5) +
labs(title = "Accidents by Season", x = "Seasons", y = "Total Pedestrian Accidents")
This now strongly suggests that “bad” weather such as rain or snow is a contributing factor to the accidents.
Next I decided to do a time-based analysis and to group the data by a day-time period. I broke up 24 hours into the following 4 categories:
1. AM RUSH - Morning Rush Hours (6:00AM - 9:59AM)
2. MIDDAY - (10:00AM - 4:59PM)
3. PM RUSH - Evening Rush Hours (5:00PM - 8:59PM)
4. NIGHT - (9:00PM - 5:59AM)
I wrote a function, called hour_of_day which would take an integer between 0 and 24 (or a vector of such values) and would translate that into one of the 4 categories above. A similar function, called time_of_day would take a parameter of type Date, extract the hour value from it and call the hour_of_day function to do the same translation.
hour_of_day <- function(hr) {
unlist(
lapply(hr, function(h) {
if (between(h, 6, 9)) {"AM RUSH"}
else if (between(h, 10, 16)) {"MIDDAY"}
else if (between(h, 17, 20)) {"PM RUSH"}
else if (between(h, 21, 24) || between(h, 0, 5)) {"NIGHT"}
})
)
}
time_of_day <- function(dt) {
hour_of_day(hour(dt))
}
The data contained separate columns for date and time values. I decided to combine them into one column using the lubridate::ymd_hm() function inside the mutate() function call.
Then using the time of day classificaiton on the new date-time column, here’s what the data looks like:
df.h <- df %>% mutate(date_time = ymd_hm(str_c(date, time, sep = " "))) %>%
group_by(time_of_day = time_of_day(date_time)) %>%
summarise(ped_accidents = sum(number_of_pedestrians_injured + number_of_pedestrians_killed))
df.h
## # A tibble: 4 x 2
## time_of_day ped_accidents
## <chr> <int>
## 1 AM RUSH 83
## 2 MIDDAY 211
## 3 NIGHT 68
## 4 PM RUSH 158
ggplot(df.h, mapping = aes(x = time_of_day, y = ped_accidents)) +
geom_col() +
geom_text(aes(label = ped_accidents, y = ped_accidents), vjust = -0.5) +
labs(title = "Accidents by Time of Day", x = "Day Periods", y = "Total Pedestrian Accidents")
The results appeared somewhat counterintuitive to me. First, I wouldn’t expect the morning rush hour having nearly half the total number of accidents than the evening rush hour. I would expect the two periods to have more evenly distributed numbers, guessing the morning to have even more rather than less the number of accidents. Finaly, a more surprising revelation to me was that most accidents fall into the midday category. This led me to hypothesize, that perhaps most accidents involved elderly and/or students. Unfortunately the data does not include age of pedestrians which would help confirm the hypothesis.
As for my final analysis on the data, I wanted to see which factors most contributed to the pedestrian accidents. The dataset contains 5 “contributing_factor_vehicle” columns, however the fist one “contributing_factor_vehicle_1” had the most information and the others were mostly blank or unspecified. Here’s the plot of the data sorted by the most influential (top 14) factors.
df %>% group_by(contributing_factor_vehicle_1) %>%
summarize(cnt = sum(number_of_pedestrians_injured + number_of_pedestrians_killed)) %>%
top_n(14, cnt) %>%
ggplot(mapping = aes(x = reorder(contributing_factor_vehicle_1, cnt), cnt)) +
geom_bar(stat = 'identity', position = 'dodge') +
coord_flip() +
geom_text(aes(label = cnt, y = cnt + 12), vjust = 0.25) +
labs(title = "Most Contributing Factors", x = "", y = "Total Pedestrian Accidents")
The chart shows that a lot of the factors were unspecified. However the two most frequent causes of the accidents were driver’s “Failure to Yield” and “Inattention or Distraction”
New skills and experiences