Bike lanes are controversial! Complaints about bike lanes are common in local media, but how much do New York residents actually object?
Using publicly available data on the location and timing of installation of new bike lanes in NYC and messages from the public sent to the NYC Department of Transportation, we will explore whether there is any correlation between public dissatisfaction with bike lanes and the amount of new bike lanes installed in a given calendar year.
This project will pull from three types of data:
Bike lane data will be drawn from published records of bike lane installation available through NYC Open Data. This data is stored as a shapefile (.SHP) and will be scraped from the web.
The message data will be drawn from the DOT’s Commissioner’s Correspondence Unit (CCU), which consists of tagged message metadata of many types (including emails, letters, phone calls, social media, and webforms) sent to the NYC Department of Transportation. The data pulled for this analysis is all tagged as “Bicycle Lanes and Programs,” and will be used as a proxy for public sentiment. This data is stored as JSON and will be retrieved from an API.
Maps pulled from the Google API using ggmap.
The first dataset needed is the bicycle route data available from NYC Open Data. This data set is given as a shapefile–a special file structure for storing projected spatial data, which appears to be multiple files of various types (.sbn, .sbx, .dbf, etc.), but can be read as a single file (the shapefile) by specialized mapping programs and packages.
To acquire the data, first we need to download and unzip the relevant files from the web. We will find and download the correct links using the XML and RCurl packages. Then we can read is as a shapefile using the rgdal package.
Libraries needed include XML, RCurl, and stringr.
library(XML)
library(RCurl)
library(stringr)
We will parse the content of the website by creating and then running a function to get a list of desired filenames from a specified URL matching a given regular expression–in this case we are looking for a .zip file.
getFileNamesList <- function(url, regX) {
https_doc <- getURLContent(url) #necessary to use getURL from RCurl because of https
links <- getHTMLLinks(https_doc) #scrape url for links
filenames <- links[str_detect(links, regX)] #store links as file names, limiting results to those of interest
as.list(filenames) #convert to list in case there is more than one version
}
#set arguments for funciton
url <- "https://data.cityofnewyork.us/Transportation/Bicycle-Routes/7vsa-caz7"
regX <- ".+[.]zip"
#run function
filenames_list <- getFileNamesList(url, regX)
filenames_list
## [[1]]
## [1] "/api/views/7vsa-caz7/files/8206b964-ce0a-49a7-9b68-1eca8f6adbce?download=true&filename=nyc-bike-routes.zip"
In the next step, we will create and run a function to download the files identified in the previous step.
This program will not download files if they already exist in the specified folder. As it turns out, only one file needs to be downloaded, but it’s useful to have the function in case more than one link turns out to be needed.
#create function to download file to specified folder
download <- function(filename, baseurl, folder, reg = ".*") {
if(!dir.exists(folder)){dir.create(folder, recursive = TRUE)} #create folder(s) not there
fileurl <- str_c(baseurl, filename) #URL to download
fileout <- str_extract(filename, reg) #allows for more general use with different paths
if(!file.exists(str_c(folder, "/", fileout))) {
download.file(fileurl,
destfile = str_c(folder, "/", fileout))
Sys.sleep(1)
}
}
#set arguments to function
baseurl <- "https://data.cityofnewyork.us" #note: this requires inspection of the website to determine how the links work
folder <- "data/temp"
reg <- "[[:alnum:]]+[.]zip$" # accounts for path inside filename
#run function
download(filenames_list[[1]], baseurl, folder, reg)
Next, we will check our “data/temp” folder to see what files have been downloaded, and then unzip them into a “bike” subfolder in a data folder.
#list files
filelist <- list.files("data/temp")
filelist
## [1] "routes.zip"
if(!dir.exists("data/bike")){dir.create("data/bike", recursive = TRUE)} #create folder(s) not there
unzip(str_c("data/temp/", "routes.zip"), exdir = "data/bike") #unzip tp data folder
list.files("data/bike") #read data folder to see what's there now
## [1] "nyc_bike_routes_2017_METADATA.pdf" "nyc_bike_routes_2017.cpg"
## [3] "nyc_bike_routes_2017.dbf" "nyc_bike_routes_2017.prj"
## [5] "nyc_bike_routes_2017.sbn" "nyc_bike_routes_2017.sbx"
## [7] "nyc_bike_routes_2017.shp" "nyc_bike_routes_2017.shp.xml"
## [9] "nyc_bike_routes_2017.shx"
rgdalNext, we will load the downloaded files into a single R object using the rgdal:readOGR function.
library(rgdal)
filepath <- "data/bike"
bikeSHP <- readOGR(filepath, "nyc_bike_routes_2017")
## OGR data source with driver: ESRI Shapefile
## Source: "/Users/alice/Documents/GitHub/DATA607_Final/data/bike", layer: "nyc_bike_routes_2017"
## with 14980 features
## It has 15 fields
## Integer64 fields read as strings: OBJECTID_1
summary(bikeSHP)
## Object of class SpatialLinesDataFrame
## Coordinates:
## min max
## x 914209.9 1062052.0
## y 120826.2 271349.1
## Is projected: TRUE
## proj4string :
## [+proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333
## +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0
## +datum=NAD83 +units=us-ft +no_defs +ellps=GRS80 +towgs84=0,0,0]
## Data attributes:
## OBJECTID_1 SegmentID street
## 1 : 1 0 : 15 GRAND CONCOURSE : 289
## 10 : 1 0017193: 1 QUEENS BLVD : 206
## 100 : 1 0017292: 1 1 AV : 197
## 1000 : 1 0020080: 1 34 AV : 155
## 10000 : 1 0020082: 1 ADAM C POWELL BLVD: 154
## 10001 : 1 (Other):14943 5 AV : 129
## (Other):14974 NA's : 18 (Other) :13850
## boro fromstreet tostreet
## Min. :1.000 E 161 ST : 337 VAN CORTLANDT AVE E: 284
## 1st Qu.:1.000 MYRTLE AV : 127 WOODHAVEN BLVD : 165
## Median :3.000 E 125 ST : 126 DIVISION AV : 119
## Mean :2.606 59 ST : 108 RALPH AV : 117
## 3rd Qu.:4.000 SHORE BLVD: 107 113 ST : 107
## Max. :5.000 (Other) :14172 (Other) :14185
## NA's : 3 NA's : 3
## onoffst allclasses instdate moddate
## OFF: 2512 II :6562 1980/07/01: 331 2009/05/01: 317
## ON :12468 I :4138 1979/05/01: 321 2003/06/01: 269
## III :4002 2009/05/01: 315 2010/06/01: 266
## II,III : 143 2000/01/01: 307 2008/05/01: 262
## III,II : 73 2006/09/01: 292 2006/09/01: 253
## I,III : 35 2003/06/01: 275 2002/06/01: 237
## (Other): 27 (Other) :13139 (Other) :13376
## comments
## Bridge : 224
## Central Park Auto-Free Hours: Closed to Cars : 88
## Boardwalk open 6am-10am: Bicycles allowed at all times except Weekends 10am-6pm May-September: 85
## Bicycling Permitted on West Mall Only : 64
## Bicycles permitted 5-10am only : 55
## (Other) : 572
## NA's :13892
## bikedir lanecount ft_facilit
## 2:7744 Min. :0.000 Standard :3927
## L:3573 1st Qu.:1.000 Greenway :2192
## R:3645 Median :2.000 Sharrows :2029
## X: 18 Mean :1.519 Protected Path :1135
## 3rd Qu.:2.000 Bike-Friendly Parking: 708
## Max. :3.000 (Other) :1416
## NA's :3573
## tf_facilit
## Standard :3903
## Greenway :2127
## Sharrows :2084
## Protected Path :1035
## Bike-Friendly Parking: 717
## (Other) :1469
## NA's :3645
Lastly, we will delete the downloaded files, now that they have been sucessfully loaded into R, to save memory. Those were some very large files!
unlink("data", recursive = TRUE)
This data, in JSON format, will be retrieved from the NYC Open Data API, filtering for bike related messages in the query.
Load necessary libraries, which include jsonlite, dplyr, and httr.
library(httr)
library(jsonlite)
library(dplyr)
After inspecting the data on NYC’s Open Data visualizer I determined that there are ~7,000 records that match the initial query, which is to locate records where the Case Topic is “Bicycle Lanes & Programs”. The Socrata API returns records 1,000 at a time, so multiple calls are necessary to retrieve all records.
One note: Because JSON data does not have “nulls” per se (rather, the records are simply not there if not explicitly entered as “NA”), there are some fields present in each API call that are missing in the other two. For this reason, I have used full_join instead of union to merge the datasets into one.
JSON data from an API is a lot easier to load (2 steps instead of 6) than parsing a website to find a zipped file containing a shapefile!
baseurl <- "https://data.cityofnewyork.us/resource/paw9-9kar.json?casetopic=Bicycle%20Lanes%20%26%20Programs" # includes the query casetopic = Bicycle Lanes, encoded as URL
allSearch <- fromJSON(baseurl) #creates a data frame from JSON retrieved through API
for(n in 1:6){
i <- n*1000
searchCCU <- fromJSON(paste0(baseurl,"&$offset=",i)) #searches for pages in chunks of 1000
allSearch <- full_join(allSearch, searchCCU) #combines data into original data frame
}
glimpse(allSearch)
## Observations: 6,820
## Variables: 42
## $ borough <chr> "Brooklyn", "Queens", "Brooklyn", NA, "...
## $ caseaddressedto <chr> "Customer Service", "BC-Queens", "311",...
## $ casechannel <chr> "Web Form", "Web Form", "Customer Comme...
## $ casenumber <chr> "DOT-391803-G9M7", "DOT-323519-H9K3", "...
## $ casetopic <chr> "Bicycle Lanes & Programs", "Bicycle La...
## $ clienttype <chr> "Citizen", "Citizen", "Citizen", "Citiz...
## $ createddate <chr> "2018-10-02T13:46:20.000", "2017-01-13T...
## $ crossstreet <chr> "Prospect Park", "Yellowstone", NA, NA,...
## $ fromstreet <chr> "8 Avenue", "Eliot", "7th Street", NA, ...
## $ latitude <chr> "40.665749", "40.7325068549", "40.67038...
## $ locationdetail <chr> "Multiple Block - Stretch", "Multiple B...
## $ locationtype <chr> "Street or Sidewalk", "Street or Sidewa...
## $ longitude <chr> "-73.979224", "-73.8682252485", "-73.98...
## $ maplocation <chr> "40.6657490,-73.9792240", "40.732506854...
## $ pressname <chr> "No", "No", "No", "No", "No", "No", "No...
## $ resolutionreason <chr> "Open - New", "Information Provided", "...
## $ status <chr> "Active", "Resolved", "Resolved", "Reso...
## $ streetname <chr> "9 Street", "Queens Blvd", "5th Avenue"...
## $ translationneeded <chr> "No", "No", "No", "No", "No", "No", "No...
## $ acknowledgesent <chr> NA, "2017-01-23T12:34:35.000", "2016-08...
## $ caseissue <chr> NA, "Comment", "Remove or Relocate", NA...
## $ closeddate <chr> NA, "2018-07-08T11:48:32.000", "2017-11...
## $ communitydistrict <chr> NA, "06", "06", NA, "01", NA, "11", NA,...
## $ policeprecinct <chr> NA, "112", "078", NA, "114", NA, "111",...
## $ xcoordinate <chr> NA, "1020819", "988329", NA, "1003102",...
## $ ycoordinate <chr> NA, "206204", "183516", NA, "213248", N...
## $ externaltrackingnumber <chr> NA, NA, "3914.02", NA, NA, NA, NA, NA, ...
## $ seibelnumber <chr> NA, NA, "1-1-1296826294", "1-1-12725743...
## $ mayorsnumber <chr> NA, NA, NA, "854615", NA, NA, NA, NA, N...
## $ buildingnumber <chr> NA, NA, NA, NA, NA, NA, NA, NA, "2233",...
## $ category <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ex...
## $ inspector <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ke...
## $ clientcompanyname <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ bridgename <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ zipcode <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ otherbridgename <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ highwayname <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ ferrydatetime <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ ferrydirection <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ directionname <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ segmentorexit <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ bridgedirection <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
First, we will create a simple map showing the bicycle route data by facility type, using the object, bikeSHP, which is an R object containing shapefile information.
spplotThe spatial mapping package spplot produces high quality maps with very few lines of code by reading in shapefiles. Graduated color palettes and attractive formatting come standard, for example.
spplot(bikeSHP, z = "ft_facilit")
dplyr and broomThe downside of the spplot package is that shapefiles cannot be easily manipulated in R, making the addition of other data types difficult. For this reason, we will transform the shapefile into a data frame so that it can be read by ggplot and mapped using ggmap. The transformation to a data frame is done using the broom::tidy function.
Note: Code in this section is adapted from Making Maps in R by Claudia A. Engel.
#code adapted from https://cengel.github.io/rspatial/4_Mapping.nb.html
library(broom)
bikes_raw <- tidy(bikeSHP)
bikeSHP$lineID <- sapply(slot(bikeSHP, "lines"), function(x) slot(x, "ID"))
bikes_df <- merge(bikes_raw, bikeSHP, by.x = "id", by.y="lineID")
head(bikes_df)
## id long lat order piece group OBJECTID_1 SegmentID
## 1 0 984139.6 211708.7 1 1 0.1 1 33547
## 2 0 984267.5 211938.7 2 1 0.1 1 33547
## 3 1 987723.2 185455.1 1 1 1.1 2 22630
## 4 1 987577.5 185225.4 2 1 1.1 2 22630
## 5 10 914903.0 121334.9 1 1 10.1 11 59
## 6 10 914808.9 121744.4 2 1 10.1 11 59
## street boro fromstreet tostreet onoffst
## 1 9 AV 1 W 16 ST W 31 ST ON
## 2 9 AV 1 W 16 ST W 31 ST ON
## 3 3 AV 3 DEAN ST 15 ST ON
## 4 3 AV 3 DEAN ST 15 ST ON
## 5 CONFERENCE HOUSE PARK GREENWAY 5 HYLAN BLVD SWINNERTON ST OFF
## 6 CONFERENCE HOUSE PARK GREENWAY 5 HYLAN BLVD SWINNERTON ST OFF
## allclasses instdate moddate comments bikedir lanecount ft_facilit
## 1 I 2008/09/25 2008/09/25 <NA> L 1 <NA>
## 2 I 2008/09/25 2008/09/25 <NA> L 1 <NA>
## 3 II 1980/07/01 1980/07/01 <NA> R 1 Standard
## 4 II 1980/07/01 1980/07/01 <NA> R 1 Standard
## 5 I 2005/06/01 2005/06/01 <NA> 2 2 Greenway
## 6 I 2005/06/01 2005/06/01 <NA> 2 2 Greenway
## tf_facilit
## 1 Protected Path
## 2 Protected Path
## 3 <NA>
## 4 <NA>
## 5 Greenway
## 6 Greenway
Next, we will transform the data using dplyr to obtain the year each segment was installed.
library(lubridate)
bikes_installed_df <- bikes_df %>%
mutate(yearInstalled = lubridate::year(as.Date(instdate))) %>%
glimpse()
## Observations: 49,216
## Variables: 22
## $ id <chr> "0", "0", "1", "1", "10", "10", "100", "100", "1...
## $ long <dbl> 984139.6, 984267.5, 987723.2, 987577.5, 914903.0...
## $ lat <dbl> 211708.7, 211938.7, 185455.1, 185225.4, 121334.9...
## $ order <int> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 3, 4, 5, 6, ...
## $ piece <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ group <chr> "0.1", "0.1", "1.1", "1.1", "10.1", "10.1", "100...
## $ OBJECTID_1 <fct> 1, 1, 2, 2, 11, 11, 101, 101, 1001, 1001, 10010,...
## $ SegmentID <fct> 33547, 33547, 22630, 22630, 59, 59, 10544, 10544...
## $ street <fct> 9 AV, 9 AV, 3 AV, 3 AV, CONFERENCE HOUSE PARK GR...
## $ boro <dbl> 1, 1, 3, 3, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ fromstreet <fct> W 16 ST, W 16 ST, DEAN ST, DEAN ST, HYLAN BLVD, ...
## $ tostreet <fct> W 31 ST, W 31 ST, 15 ST, 15 ST, SWINNERTON ST, S...
## $ onoffst <fct> ON, ON, ON, ON, OFF, OFF, ON, ON, ON, ON, OFF, O...
## $ allclasses <fct> I, I, II, II, I, I, II, II, III, III, I, I, I, I...
## $ instdate <fct> 2008/09/25, 2008/09/25, 1980/07/01, 1980/07/01, ...
## $ moddate <fct> 2008/09/25, 2008/09/25, 1980/07/01, 1980/07/01, ...
## $ comments <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ bikedir <fct> L, L, R, R, 2, 2, L, L, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ lanecount <dbl> 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ ft_facilit <fct> NA, NA, Standard, Standard, Greenway, Greenway, ...
## $ tf_facilit <fct> Protected Path, Protected Path, NA, NA, Greenway...
## $ yearInstalled <dbl> 2008, 2008, 1980, 1980, 2005, 2005, 2009, 2009, ...
Finally, we can map the data using ggplot2.
#all code in this chunk adapted from https://cengel.github.io/rspatial/4_Mapping.nb.html
library(ggplot2)
#Plot the NYC Bike Network by year each segment was installed
ggplot() +
geom_line( # make a line
data = bikes_installed_df, # data frame
aes(x = long, y = lat, group = group, # group by lines
col = yearInstalled)) + # color by **yearInstalled**
ggtitle("NYC Bike Map, 2017, by Installation Date") + # add title
theme(line = element_blank(), # remove axis lines ..
axis.text=element_blank(), # .. tickmarks..
axis.title=element_blank(), # .. axis labels..
panel.background = element_blank()) + # .. background gridlines
coord_equal() # both axes the same scale
#Plot the NYC Bike Network by segment type
ggplot() +
geom_line( # make a line
data = bikes_installed_df, # data frame
aes(x = long, y = lat, group = group, # coordinates, and group them by lines
col = tf_facilit)) + # color by **tf_facilit**, which is the segment type
ggtitle("NYC Bike Map, 2017, by Segment Type") + # add title
theme(line = element_blank(), # remove axis lines ..
axis.text=element_blank(), # .. tickmarks..
axis.title=element_blank(), # .. axis labels..
panel.background = element_blank()) + # .. background gridlines
coord_equal() # both axes the same scale
How many messages of each type are received in a given year? We will first create a data frame from the JSON query data and transform it to obtain a “year” field using the lubridate::year function. We will also reclassify and rename the latitude and longitude data so that it can be more easily mapped in a later step using ggplot.
CCU_df <- allSearch %>%
mutate(
year = lubridate::year(as.Date(createddate)),
lat = as.numeric(latitude),
lon = as.numeric(longitude)) %>%
glimpse()
## Observations: 6,820
## Variables: 45
## $ borough <chr> "Brooklyn", "Queens", "Brooklyn", NA, "...
## $ caseaddressedto <chr> "Customer Service", "BC-Queens", "311",...
## $ casechannel <chr> "Web Form", "Web Form", "Customer Comme...
## $ casenumber <chr> "DOT-391803-G9M7", "DOT-323519-H9K3", "...
## $ casetopic <chr> "Bicycle Lanes & Programs", "Bicycle La...
## $ clienttype <chr> "Citizen", "Citizen", "Citizen", "Citiz...
## $ createddate <chr> "2018-10-02T13:46:20.000", "2017-01-13T...
## $ crossstreet <chr> "Prospect Park", "Yellowstone", NA, NA,...
## $ fromstreet <chr> "8 Avenue", "Eliot", "7th Street", NA, ...
## $ latitude <chr> "40.665749", "40.7325068549", "40.67038...
## $ locationdetail <chr> "Multiple Block - Stretch", "Multiple B...
## $ locationtype <chr> "Street or Sidewalk", "Street or Sidewa...
## $ longitude <chr> "-73.979224", "-73.8682252485", "-73.98...
## $ maplocation <chr> "40.6657490,-73.9792240", "40.732506854...
## $ pressname <chr> "No", "No", "No", "No", "No", "No", "No...
## $ resolutionreason <chr> "Open - New", "Information Provided", "...
## $ status <chr> "Active", "Resolved", "Resolved", "Reso...
## $ streetname <chr> "9 Street", "Queens Blvd", "5th Avenue"...
## $ translationneeded <chr> "No", "No", "No", "No", "No", "No", "No...
## $ acknowledgesent <chr> NA, "2017-01-23T12:34:35.000", "2016-08...
## $ caseissue <chr> NA, "Comment", "Remove or Relocate", NA...
## $ closeddate <chr> NA, "2018-07-08T11:48:32.000", "2017-11...
## $ communitydistrict <chr> NA, "06", "06", NA, "01", NA, "11", NA,...
## $ policeprecinct <chr> NA, "112", "078", NA, "114", NA, "111",...
## $ xcoordinate <chr> NA, "1020819", "988329", NA, "1003102",...
## $ ycoordinate <chr> NA, "206204", "183516", NA, "213248", N...
## $ externaltrackingnumber <chr> NA, NA, "3914.02", NA, NA, NA, NA, NA, ...
## $ seibelnumber <chr> NA, NA, "1-1-1296826294", "1-1-12725743...
## $ mayorsnumber <chr> NA, NA, NA, "854615", NA, NA, NA, NA, N...
## $ buildingnumber <chr> NA, NA, NA, NA, NA, NA, NA, NA, "2233",...
## $ category <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ex...
## $ inspector <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ke...
## $ clientcompanyname <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ bridgename <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ zipcode <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ otherbridgename <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ highwayname <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ ferrydatetime <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ ferrydirection <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ directionname <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ segmentorexit <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ bridgedirection <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ year <dbl> 2018, 2017, 2016, 2016, 2016, 2017, 201...
## $ lat <dbl> 40.66575, 40.73251, 40.67039, NA, 40.75...
## $ lon <dbl> -73.97922, -73.86823, -73.98529, NA, -7...
library(kableExtra) #for formatting table
CCU_df %>% select(caseissue) %>%
group_by(caseissue) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| caseissue | count |
|---|---|
| Bicycle Lane | 2678 |
| Safety | 1564 |
| NA | 1180 |
| Bicycle Rack | 383 |
| Concern | 347 |
| Comment | 283 |
| Remove or Relocate | 121 |
| Regulations | 76 |
| Study | 50 |
| Outreach Campaigns | 44 |
| Safety Material/Helmet Related | 40 |
| Bicycle Map | 19 |
| Repair | 17 |
| Commercial Cyclist - Rider Identification | 7 |
| Commercial Cyclist - General | 5 |
| Commercial Cyclist - Bicycle Safety Devices | 2 |
| Commercial Cyclist - Safety Poster | 1 |
| Commercial Cyclist - Sign on Bike | 1 |
| General Information | 1 |
| No Parking | 1 |
This chart will be easier to read if the categories are collapsed into fewer, still meaningful buckets. We can perform this transformation using the forcats package.
library(forcats)
library(tidyr)
#Use fct_collapse to group multiple factors levels into fewer buckets
CCUsCollapse <- CCU_df %>% mutate(caseissueFCT = as.factor(caseissue)) %>%
mutate(caseIssueType = fct_collapse(caseissueFCT,
"Likely Supportive" = c(
"Bicycle Rack",
"Bicycle Map",
"Safety Material/Helmet Related",
"Repair"
),
"Likely Opposed" = c(
"Remove or Relocate",
"Concern",
"Safety",
"No Parking"),
"Neutral/Unknown" = c(
"Commercial Cyclist - Rider Identification",
"Commercial Cyclist - General",
"Commercial Cyclist - Bicycle Safety Devices",
"Commercial Cyclist - Safety Poster",
"Commercial Cyclist - Sign on Bike",
"Study",
"Outreach Campaigns",
"General Information",
"Regulations",
"Comment",
"Bicycle Lane"
)
))
#Use fct_explicit_na to relevel missing values as "Neutral/Unknown"
CCUsCollapse$caseIssueType <- CCUsCollapse$caseIssueType %>% fct_explicit_na(na_level = "Neutral/Unknown")
#show last 5 years of data as table
CCUsCollapse %>% select(caseIssueType, year) %>%
group_by(year, caseIssueType) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
filter(year > 2013) %>%
spread(year, count) %>%
rename(`Message Type` = caseIssueType) %>%
kable(caption = "Commissioner's Correspondence by Likely Sentiment, Year") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Message Type | 2014 | 2015 | 2016 | 2017 | 2018 |
|---|---|---|---|---|---|
| Neutral/Unknown | 377 | 505 | 608 | 723 | 588 |
| Likely Supportive | 51 | 54 | 68 | 117 | 49 |
| Likely Opposed | 47 | 39 | 108 | 15 | NA |
#show last 5 years of data as chart
CCUsCollapse %>% select(caseIssueType, year) %>%
group_by(year, caseIssueType) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
filter(year > 2012) %>%
filter(year < 2018) %>%
rename(`Message Type` = caseIssueType) %>%
ggplot + geom_col(aes(year, count, fill=`Message Type`), position= "dodge") +
ggtitle("Correspondence to DOT's Bicycle Lanes and Program, 2012-20127") +
xlab("Year") + ylab("Count")
What if we look at just requests to remove or relocate a bike lane?
CCU_df %>% filter(caseissue=="Remove or Relocate") %>%
rename(`Case Issue`=caseissue) %>%
ggplot() + geom_bar(aes(year, fill=`Case Issue`)) +
ggtitle("Correspondence to NYC DOT Bicycle Lanes and Programs, 2013-2017")+
theme(plot.title = element_text(size = rel(1.4)),
legend.position = "top") +
xlab("Year") + ylab("Count")
2016 was a high point for requests to remove or relocate bike lanes! What if we map these?
library(googleway)
library(ggmap)
key <- myKey
register_google(key = key)
#note: requires dev version of ggmap
#devtools::install_github("dkahle/ggmap", ref = "tidyup")
#note: I have stored my private Google API key. To get your own, visit: https://developers.google.com/maps/documentation/javascript/get-api-key
myMap <- get_map(location = c(lon = -74.0060, lat = 40.7128), zoom = 11)
CCUdata2016 <- CCU_df %>% ungroup() %>%
rename(`Case Issue` = caseissue) %>%
filter(year==2016) %>%
filter(!is.na(latitude), !is.na(longitude)) %>%
mutate(lon = as.numeric(longitude), lat = as.numeric(latitude))
myMap %>% ggmap() +
geom_point(data = CCUdata2016,
aes(x = lon, y = lat, col=`Case Issue`),
alpha = 0.5,
size = 1) +
ggtitle("NYC DOT Correspondence by Case Issue (2016)") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
Finally, we will create a map using both sets of data so that we can begin to explore the question of whether there is a correlation between new bike segments (stored in the bikeSHP object) and correspondence to the DOT (stored in the CCU_df object).
To do this, we first have to re-project the spatial data stored in bikeSHP to the coordinate system used by ggmap, whose source is Google maps. (Otherwise the maps will not line up correctly.) This can be done with the spTransform function. Then, we will repeat the previous data transformation steps with broom::tidy to transform the shapefile back into a data frame. This section of code draws heavily from Mapping in R.
# All code in this chunk adapted from https://cengel.github.io/rspatial/4_Mapping.nb.html
bikesGoogleMaps <- spTransform(bikeSHP, CRS("+init=epsg:4326"))
bikes_dfGoogleMaps <- tidy(bikesGoogleMaps)
bikesGoogleMaps$lineID <- sapply(slot(bikesGoogleMaps, "lines"), function(x) slot(x, "ID"))
bikes_dfGoogleMaps <- merge(bikes_dfGoogleMaps, bikesGoogleMaps, by.x = "id", by.y="lineID")
head(bikes_dfGoogleMaps)
## id long lat order piece group OBJECTID_1 SegmentID
## 1 0 -74.00040 40.74777 1 1 0.1 1 33547
## 2 0 -73.99994 40.74840 2 1 0.1 1 33547
## 3 1 -73.98748 40.67571 1 1 1.1 2 22630
## 4 1 -73.98800 40.67508 2 1 1.1 2 22630
## 5 10 -74.24935 40.49944 1 1 10.1 11 59
## 6 10 -74.24969 40.50056 2 1 10.1 11 59
## street boro fromstreet tostreet onoffst
## 1 9 AV 1 W 16 ST W 31 ST ON
## 2 9 AV 1 W 16 ST W 31 ST ON
## 3 3 AV 3 DEAN ST 15 ST ON
## 4 3 AV 3 DEAN ST 15 ST ON
## 5 CONFERENCE HOUSE PARK GREENWAY 5 HYLAN BLVD SWINNERTON ST OFF
## 6 CONFERENCE HOUSE PARK GREENWAY 5 HYLAN BLVD SWINNERTON ST OFF
## allclasses instdate moddate comments bikedir lanecount ft_facilit
## 1 I 2008/09/25 2008/09/25 <NA> L 1 <NA>
## 2 I 2008/09/25 2008/09/25 <NA> L 1 <NA>
## 3 II 1980/07/01 1980/07/01 <NA> R 1 Standard
## 4 II 1980/07/01 1980/07/01 <NA> R 1 Standard
## 5 I 2005/06/01 2005/06/01 <NA> 2 2 Greenway
## 6 I 2005/06/01 2005/06/01 <NA> 2 2 Greenway
## tf_facilit
## 1 Protected Path
## 2 Protected Path
## 3 <NA>
## 4 <NA>
## 5 Greenway
## 6 Greenway
Repeating our earlier plot, we can see how ggplot and spplot handle mapping differently.
register_google(key = key)
#Plot the NYC Bike Network by segment type
bikesYearGoogle <- bikes_dfGoogleMaps %>%
mutate(year = lubridate::year(as.Date(instdate)))
basemap <- get_map("New York", zoom = 11)
ggmap(basemap) +
geom_line(data = bikesYearGoogle,
aes(x = long, y = lat, group = group,
col = year)) +
geom_point(data = CCU_df,
aes(x = lon, y = lat, col=year), size = 1) +
ggtitle("NYC Bike Map, 2017, by Year")
Now that we have explored the data a litle bit, we can begin to tackle our main question: What is the relationship between CCU messages and bike network development? We will perform this analysis using a map, graphs, and a statistical test.
Metadata about each message, or the total number of messages, is sufficient to measure public sentiment. A challenge with the CCU dataset is that messages tagged as “concerns” can be either concerns that are in favor of the expansion of the bike lane network, or messages that are against the installation of bike lanes. Therefore it is necessary to make assumptions about which categories of messages indicate the sender is in favor of bike lanes (requests for bike racks, for example) and which are opposed (requests to remove or relocate bike lanes). Ultimately, after exploring the data, it seems that the best proxy for public sentiment is simply the total number of messages, which are likely to be complaints.
CCUs in response to new infrastructure will occur in the same calendar year as the installation of the bike lane. A second assumption is that message writers will send their message in the same calendar year as the infrastructure they are writing about was built–if, in fact, they are writing in response to new bike lane segments.
Number of segments is a good proxy for amount of construction and overall segment length. A third assumption (based on my experience with working with this dataset in ArcMAP) is that each segment is roughly the same length.
Data is coded correctly and consistently. We cannot see the underlying messages, and so must assume the data was coded correctly and consistently tagged in each category. This analysis assumes that all messages are coded consistently by different reviewers and from one year to the next–in reality both of these assumptions are likely to be incorrect, at least to a point. For example, one reviewer may code a complaint about a new bike lane as “Concern” while a different reviewer in a different year might code a very similar message as “Remove or Relocate.” My experience in working with these messages is that, in fact, messages are very often mis-coded, with concerns about bike share directed to the bicycle lane unit, and vice versa.
Location data is coded related to the message, rather then the customer address. By mapping the data, we also assume that the location data refers to the location of the concern rather than the writer. This is an assumption I have more confidence in being true, at least most of the time.
The map below shows some correlation between the year a segment of bike lane was installed and the requests to remove them; however, the correlation is not great.
register_google(key = key)
#Overall map, showing requests to remove bike lanes and year segments were installed
ggmap(get_map("New York", maptype = "terrain-background", zoom = 11)) +
geom_line(data = bikesYearGoogle, #add bike map
aes(x = long, y = lat, group = group),
alpha = .25) +
geom_line(data = filter(bikesYearGoogle, year > 2011), #add highlight for lines 2012-2017
aes(x = long, y = lat, group = group,
col = as.factor(year))) + #show year installed
geom_point(data = filter(CCU_df, #add CCUs, filtered
caseissue=="Remove or Relocate",
year < 2017),
aes(x = lon, y = lat, col=as.factor(year)),
size = 1,
alpha = 0.5) +
ggtitle("Requests to Remove or Relocate NYC Bike Lanes\n(2011-2016)")
Zooming in on a high request-density area, we can see a cluster of requests to remove or relocate bike lanes in areas and years where no bike lanes were installed in Lower Manhattan and the Brooklyn neighborhoods of Park Slope and Cobble Hill. This seems to indicate that the relationship may not be so strong.
#Zoom to high density request area (inner Brooklyn/Lower Manhattan)
ggmap(get_map(location = c(lon = -73.99, lat = 40.70),
maptype = "terrain-background", zoom = 13)) +
geom_line(data = bikesYearGoogle, #add bike map
aes(x = long, y = lat, group = group),
alpha = .25) +
geom_line(data = filter(bikesYearGoogle, year > 2011), #add highlight for lines 2012-2017
aes(x = long, y = lat, group = group,
col = as.factor(year))) + #show year installed
geom_point(data = filter(CCU_df, #add CCUs, filtered
caseissue=="Remove or Relocate",
year < 2017),
aes(x = lon, y = lat, col=as.factor(year)),
size = 2) +
ggtitle("High Density Requests to Remove or Relocate NYC Bike Lanes\n(2012-2016)")
Because the vast majority of messages are tagged as issue types which cannot be parsed more than “neutral unknown,” we will assume (reasonably, based on my experience) that messages are mostly complaints. In other words, the number of CCUs received in a given year can be a proxy for public sentiment against the expansion of the bike network.
To see if bike network expansion is driving complaints, we can first see if the number of CCU messages received correlates to the number of segments installed when the data sets are joined by “year” and “streets”.
If no correlation is found, we go one step back to see if the total number of CCU messages received in a calendar year has any correlation with the number of new segments of bike lanes installed in the same year.
Does the number of new segments constructed in a given time period correlate with the number of “Bicycle Lanes and Programs” correspondence items received by the NYC Department of Transportation?
A few data cleaning steps will be needed before we can run this analysis:
Because off-street bicycle lane segments are typically not under the jurisdiction of the DOT, we will exclude segments coded as “OFF” in the field onoffst.
Street names are coded differently in the two data sets! CCUs are usually but not always coded in CamelCase with the full street name (e.g. “Queens Boulevard”), which the bike network data is all caps and abbreviations (e.g. “QUEENS BLVD”). Therefore we will need to run some regular expressions and data transformations in order to correctly join the data and standardize the format.
We will need to create a merged version of the two data sets that correctly counts both bike lane segments installed (“Segs”) and items of correspondence received by the DOT (“CCUs”) by year and street.
library(stringr)
#CCUsBikesMerge
CCUsByStreetYear <- CCU_df %>%
select(year, streetname) %>%
rename(street = streetname) %>%
mutate(street = toupper(street)) %>%
mutate(street = str_trim(street)) %>%
mutate(street = str_remove_all(street, "[[:punct:]]")) %>%
#replace all abbreviations with full words
mutate(street = gsub(" ST$", "STREET", street)) %>%
mutate(street = gsub("^ST ", "SAINT", street)) %>%
mutate(street = gsub(" DR$", "DRIVE", street)) %>%
mutate(street = gsub(" BLVD$", "BOULEVARD", street)) %>%
mutate(street = gsub(" AV(E)?$", "AVENUE", street)) %>% #4 Av, e.g.
mutate(street = gsub("AV(E)? ", "AVENUE", street)) %>% #Ave C, e.g.
mutate(street = gsub(" RD$", "ROAD", street)) %>%
mutate(street = gsub("^N ", "NORTH", street)) %>%
mutate(street = gsub("^S ", "SOUTH", street)) %>%
mutate(street = gsub("^E ", "EAST", street)) %>%
mutate(street = gsub("^W ", "WEST", street)) %>%
mutate(street = gsub(" N$", " NORTH", street)) %>%
mutate(street = gsub(" S$ ", " SOUTH", street)) %>%
mutate(street = gsub(" E$", " EAST", street)) %>%
mutate(street = gsub(" W$", " WEST", street)) %>%
#remove all variations of "nth" using look-behind regex
mutate(street = str_remove_all(street, "(?<=[0-9])(?:ST|ND|RD|TH)")) %>%
group_by(year, street) %>%
arrange(street) %>%
summarize(CCUs=n())%>%
glimpse()
## Observations: 2,615
## Variables: 3
## $ year <dbl> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2003, 2...
## $ street <chr> "6 AVENUE 32 TO 42 STREET", "AVENUE C", "DELANCY STREET...
## $ CCUs <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 3, 5, 3, 2, 1, 2, 1, 1...
SegsByStreetYear <- bikesYearGoogle %>% select(year, street, onoffst) %>%
filter(onoffst=="ON") %>%
select(-onoffst) %>%
mutate(street = toupper(street)) %>%
mutate(street = str_trim(street)) %>%
mutate(street = str_remove_all(street, "[[:punct:]]")) %>%
#replace all abbreviations with full words
mutate(street = gsub(" ST$", " STREET", street)) %>%
mutate(street = gsub("^ST ", "SAINT ", street)) %>%
mutate(street = gsub(" DR$", " DRIVE", street)) %>%
mutate(street = gsub(" BLVD$", " BOULEVARD", street)) %>%
mutate(street = gsub(" AV(E)?$", " AVENUE", street)) %>% #4 Av, e.g.
mutate(street = gsub("^AV(E)? ", "AVENUE ", street)) %>% #Ave C, e.g.
mutate(street = gsub(" RD$", " ROAD", street)) %>%
mutate(street = gsub("^N ", "NORTH ", street)) %>%
mutate(street = gsub("^S ", "SOUTH ", street)) %>%
mutate(street = gsub("^E ", "EAST ", street)) %>%
mutate(street = gsub("^W ", "WEST ", street)) %>%
mutate(street = gsub(" N$", " NORTH", street)) %>%
mutate(street = gsub(" S$ ", " SOUTH", street)) %>%
mutate(street = gsub(" E$", " EAST", street)) %>%
mutate(street = gsub(" W$", " WEST", street)) %>%
#remove all variations of "nth" using look-behind regex
mutate(street = str_remove_all(street, "(?<=[0-9])(?:ST|ND|RD|TH)")) %>%
group_by(year, street) %>%
arrange(street) %>%
summarize(Segs=n()) %>% glimpse()
## Observations: 1,027
## Variables: 3
## $ year <dbl> 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1904, 1909, 1...
## $ street <chr> "ASTORIA PARK GREENWAY", "EAST 175 STREET", "FERRIS STR...
## $ Segs <int> 23, 4, 10, 21, 8, 14, 2, 80, 80, 78, 30, 17, 83, 19, 62...
CCUsLanesMerge <- full_join(CCUsByStreetYear, SegsByStreetYear) %>% arrange(desc(Segs))
head(CCUsLanesMerge)
## # A tibble: 6 x 4
## # Groups: year [4]
## year street CCUs Segs
## <dbl> <chr> <int> <int>
## 1 2009 GRAND CONCOURSE NA 471
## 2 1979 WEST DRIVE NA 278
## 3 1979 EAST DRIVE NA 257
## 4 1997 CROSS BAY BOULEVARD NA 223
## 5 2015 QUEENS BOULEVARD 4 220
## 6 1997 34 AVENUE NA 215
CCUsLanesMerge %>% ungroup() %>%
filter(year > 2011) %>%
filter(year < 2017) %>%
mutate(year = as.factor(year)) %>%
ggplot+geom_jitter(aes(x=Segs, y=CCUs, col=year)) +
scale_x_log10() +
scale_y_log10() +
ggtitle("DOT Correspondence (CCUs) per Street per Year\nby Number of Segments New Bike Lane (2013-2017)")+
xlab("Number of Segments of Bike Lane Installed")
In the above analysis, no correlation is apparent between the numbere of segments installed on a given street and CCUs received regarding that street in the same year.
As a next step, we will re-run the analysis without joining by street–in other words, we will look to see if the total number of new bike lane segments installed correlates to the total number of Bicycle Lanes and Programs CCUs received.
CCUsByYear <- CCU_df %>%
select(year) %>%
filter(year < 2017) %>%
group_by(year) %>%
summarise(CCUs = n()) %>%
glimpse()
## Observations: 15
## Variables: 2
## $ year <dbl> 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 201...
## $ CCUs <int> 9, 49, 55, 139, 254, 264, 377, 405, 406, 417, 393, 703, 4...
LanesByYear <- bikesYearGoogle %>%
select(year) %>%
filter(year < 2017) %>%
group_by(year) %>%
summarise(Lanes = n()) %>%
glimpse()
## Observations: 56
## Variables: 2
## $ year <dbl> 1900, 1904, 1909, 1920, 1923, 1932, 1935, 1936, 1938, 19...
## $ Lanes <int> 1045, 80, 80, 125, 70, 53, 51, 380, 175, 261, 137, 1147,...
MergeByYear <- full_join(CCUsByYear, LanesByYear) %>% arrange(desc(year)) %>% filter(year < 2017)
glimpse(MergeByYear)
## Observations: 56
## Variables: 3
## $ year <dbl> 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 20...
## $ CCUs <int> 784, 598, 475, 703, 393, 417, 406, 405, 377, 264, 254, 1...
## $ Lanes <int> 2269, 2168, 1668, 3484, 1118, 1068, 1548, 2806, 3824, 56...
MergeByYear %>%
filter(year > 2000) %>%
mutate(year = as.factor(year)) %>%
ggplot + geom_jitter(aes(x=Lanes, y=CCUs, col=year, size = CCUs)) +
geom_smooth(aes(x=Lanes, y=CCUs), method='lm') +
ggtitle("Correlation of Number of CCUs received by Bicycle Lane Segments\n(2000-2016)")
Now it looks as though there may be a trend! We can confirm with a statistical analysis.
model <- lm(CCUs ~ Lanes, MergeByYear)
summary(model)
##
## Call:
## lm(formula = CCUs ~ Lanes, data = MergeByYear)
##
## Residuals:
## Min 1Q Median 3Q Max
## -315.81 -188.40 24.43 125.07 425.70
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 264.20259 115.84309 2.281 0.0401 *
## Lanes 0.04147 0.04505 0.921 0.3740
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 234 on 13 degrees of freedom
## (41 observations deleted due to missingness)
## Multiple R-squared: 0.0612, Adjusted R-squared: -0.01101
## F-statistic: 0.8475 on 1 and 13 DF, p-value: 0.374
As we can see from the model, while there is a slight trend in the graph, it has no statistical significance.
Commissioner’s Correspondence items (CCUs) received by the Department of Transportation and coded as “Bicycle Lanes & Programs” do not seem to vary significantly by the number of segments of new bicycle lanes installed. This is an interesting finding as the number of segments installed by year have varied quite a bit!
Some data was unavailable publicly. My initial idea was to perform a sentiment analysis on the content of the messages themselves. This was based on a similar analysis I had performed on NYC Parks Department data some years ago; however, it turns out that this level of information is not available through open data.
JSON data is much easier to work with than CSVs.Initially, I had tried to load a static CSV containing all the 311 data from NYC. This was a large (10+MB) file, and it was very challenging to load. Once loaded, it was very challenging to parse, as it was a comma-delimited CSV whose values sometimes also contained commas. Ultimately, I was only able to load and correctly parse the data by using the JSON-based API. As it turned out, this dataset didn’t have the right kinds of information anyway! Luckily the same code was applicable to the correct dataset (the CCU information), and having learned the significant advantages of reading the JSON data (namely, correct assignment of values to fields, reduced memory use, and much faster load times), loading this dataset into an R data frame was a snap.
R packages for spatial analysis are great! Using the various mapping packages available in R turned out to be surprisingly straight-forward, and runs significantly faster than software used for spatial analysis in the world of urban planning. I would definitely use R for spatial analysis again!
Data source for Commissioner’s Correspondence on Bike Lanes: Open Data
Data source Commissioner’s Correspondence on Bike Lanes (JSON data, API)
ggmap CheatSheet: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/ggmap/ggmapCheatsheet.pdf
Making Maps in R by Claudia A. Engel