Introduction

Project purpose and motivation

Bike lanes are controversial! Complaints about bike lanes are common in local media, but how much do New York residents actually object?

Using publicly available data on the location and timing of installation of new bike lanes in NYC and messages from the public sent to the NYC Department of Transportation, we will explore whether there is any correlation between public dissatisfaction with bike lanes and the amount of new bike lanes installed in a given calendar year.

This project will pull from three types of data:

Bike lane data will be drawn from published records of bike lane installation available through NYC Open Data. This data is stored as a shapefile (.SHP) and will be scraped from the web.
The message data will be drawn from the DOT’s Commissioner’s Correspondence Unit (CCU), which consists of tagged message metadata of many types (including emails, letters, phone calls, social media, and webforms) sent to the NYC Department of Transportation. The data pulled for this analysis is all tagged as “Bicycle Lanes and Programs,” and will be used as a proxy for public sentiment. This data is stored as JSON and will be retrieved from an API.
Maps pulled from the Google API using ggmap.

Part 1: Data Acquisition

Dataset 1: Bicycle Route Data

The first dataset needed is the bicycle route data available from NYC Open Data. This data set is given as a shapefile–a special file structure for storing projected spatial data, which appears to be multiple files of various types (.sbn, .sbx, .dbf, etc.), but can be read as a single file (the shapefile) by specialized mapping programs and packages.

To acquire the data, first we need to download and unzip the relevant files from the web. We will find and download the correct links using the XML and RCurl packages. Then we can read is as a shapefile using the rgdal package.

Step 1: Load needed libraries to scrape the website for links to download.

Libraries needed include XML, RCurl, and stringr.

library(XML)
library(RCurl)
library(stringr)

Step 2: Parse website to find links

We will parse the content of the website by creating and then running a function to get a list of desired filenames from a specified URL matching a given regular expression–in this case we are looking for a .zip file.

getFileNamesList <- function(url, regX) {
  https_doc <- getURLContent(url) #necessary to use getURL from RCurl because of https
  links <- getHTMLLinks(https_doc) #scrape url for links
  filenames <- links[str_detect(links, regX)] #store links as file names, limiting results to those of interest
  as.list(filenames) #convert to list in case there is more than one version 
}

#set arguments for funciton
url <- "https://data.cityofnewyork.us/Transportation/Bicycle-Routes/7vsa-caz7"
regX <- ".+[.]zip"

#run function
filenames_list <- getFileNamesList(url, regX)
filenames_list

## [[1]]
## [1] "/api/views/7vsa-caz7/files/8206b964-ce0a-49a7-9b68-1eca8f6adbce?download=true&filename=nyc-bike-routes.zip"

Step 3: Create and run function to download relevant files

In the next step, we will create and run a function to download the files identified in the previous step.

This program will not download files if they already exist in the specified folder. As it turns out, only one file needs to be downloaded, but it’s useful to have the function in case more than one link turns out to be needed.

#create function to download file to specified folder
download <- function(filename, baseurl, folder, reg = ".*") {
  if(!dir.exists(folder)){dir.create(folder, recursive = TRUE)} #create folder(s) not there
  fileurl <- str_c(baseurl, filename) #URL to download
  fileout <- str_extract(filename, reg) #allows for more general use with different paths
  if(!file.exists(str_c(folder, "/", fileout))) {
    download.file(fileurl, 
                  destfile = str_c(folder, "/", fileout))
    Sys.sleep(1)
  }
}

#set arguments to function
baseurl <- "https://data.cityofnewyork.us" #note: this requires inspection of the website to determine how the links work
folder <- "data/temp"
reg <- "[[:alnum:]]+[.]zip$" # accounts for path inside filename
  
#run function
download(filenames_list[[1]], baseurl, folder, reg)

Step 4: Unzip the file to access and load the CSV into a data frame

Next, we will check our “data/temp” folder to see what files have been downloaded, and then unzip them into a “bike” subfolder in a data folder.

#list files
filelist <- list.files("data/temp")
filelist

## [1] "routes.zip"

if(!dir.exists("data/bike")){dir.create("data/bike", recursive = TRUE)} #create folder(s) not there

unzip(str_c("data/temp/", "routes.zip"), exdir = "data/bike") #unzip tp data folder

list.files("data/bike") #read data folder to see what's there now

## [1] "nyc_bike_routes_2017_METADATA.pdf" "nyc_bike_routes_2017.cpg"         
## [3] "nyc_bike_routes_2017.dbf"          "nyc_bike_routes_2017.prj"         
## [5] "nyc_bike_routes_2017.sbn"          "nyc_bike_routes_2017.sbx"         
## [7] "nyc_bike_routes_2017.shp"          "nyc_bike_routes_2017.shp.xml"     
## [9] "nyc_bike_routes_2017.shx"

Step 5: Read shapefile using `rgdal`

Next, we will load the downloaded files into a single R object using the rgdal:readOGR function.

library(rgdal)
filepath <- "data/bike"
bikeSHP <- readOGR(filepath, "nyc_bike_routes_2017")

## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/alice/Documents/GitHub/DATA607_Final/data/bike", layer: "nyc_bike_routes_2017"
## with 14980 features
## It has 15 fields
## Integer64 fields read as strings:  OBJECTID_1

summary(bikeSHP)

## Object of class SpatialLinesDataFrame
## Coordinates:
##        min       max
## x 914209.9 1062052.0
## y 120826.2  271349.1
## Is projected: TRUE 
## proj4string :
## [+proj=lcc +lat_1=40.66666666666666 +lat_2=41.03333333333333
## +lat_0=40.16666666666666 +lon_0=-74 +x_0=300000 +y_0=0
## +datum=NAD83 +units=us-ft +no_defs +ellps=GRS80 +towgs84=0,0,0]
## Data attributes:
##    OBJECTID_1      SegmentID                    street     
##  1      :    1   0      :   15   GRAND CONCOURSE   :  289  
##  10     :    1   0017193:    1   QUEENS BLVD       :  206  
##  100    :    1   0017292:    1   1 AV              :  197  
##  1000   :    1   0020080:    1   34 AV             :  155  
##  10000  :    1   0020082:    1   ADAM C POWELL BLVD:  154  
##  10001  :    1   (Other):14943   5 AV              :  129  
##  (Other):14974   NA's   :   18   (Other)           :13850  
##       boro            fromstreet                   tostreet    
##  Min.   :1.000   E 161 ST  :  337   VAN CORTLANDT AVE E:  284  
##  1st Qu.:1.000   MYRTLE AV :  127   WOODHAVEN BLVD     :  165  
##  Median :3.000   E 125 ST  :  126   DIVISION AV        :  119  
##  Mean   :2.606   59 ST     :  108   RALPH AV           :  117  
##  3rd Qu.:4.000   SHORE BLVD:  107   113 ST             :  107  
##  Max.   :5.000   (Other)   :14172   (Other)            :14185  
##                  NA's      :    3   NA's               :    3  
##  onoffst       allclasses         instdate           moddate     
##  OFF: 2512   II     :6562   1980/07/01:  331   2009/05/01:  317  
##  ON :12468   I      :4138   1979/05/01:  321   2003/06/01:  269  
##              III    :4002   2009/05/01:  315   2010/06/01:  266  
##              II,III : 143   2000/01/01:  307   2008/05/01:  262  
##              III,II :  73   2006/09/01:  292   2006/09/01:  253  
##              I,III  :  35   2003/06/01:  275   2002/06/01:  237  
##              (Other):  27   (Other)   :13139   (Other)   :13376  
##                                                                                           comments    
##  Bridge                                                                                       :  224  
##  Central Park Auto-Free Hours: Closed to Cars                                                 :   88  
##  Boardwalk open 6am-10am: Bicycles allowed at all times except Weekends 10am-6pm May-September:   85  
##  Bicycling Permitted on West Mall Only                                                        :   64  
##  Bicycles permitted 5-10am only                                                               :   55  
##  (Other)                                                                                      :  572  
##  NA's                                                                                         :13892  
##  bikedir    lanecount                     ft_facilit  
##  2:7744   Min.   :0.000   Standard             :3927  
##  L:3573   1st Qu.:1.000   Greenway             :2192  
##  R:3645   Median :2.000   Sharrows             :2029  
##  X:  18   Mean   :1.519   Protected Path       :1135  
##           3rd Qu.:2.000   Bike-Friendly Parking: 708  
##           Max.   :3.000   (Other)              :1416  
##                           NA's                 :3573  
##                  tf_facilit  
##  Standard             :3903  
##  Greenway             :2127  
##  Sharrows             :2084  
##  Protected Path       :1035  
##  Bike-Friendly Parking: 717  
##  (Other)              :1469  
##  NA's                 :3645

Step 6: Delete the downloaded files

Lastly, we will delete the downloaded files, now that they have been sucessfully loaded into R, to save memory. Those were some very large files!

unlink("data", recursive = TRUE)

Dataset 2: Commissioner’s Correspondence Unit messages (JSON)

This data, in JSON format, will be retrieved from the NYC Open Data API, filtering for bike related messages in the query.

Step 1: Setup

Load necessary libraries, which include jsonlite, dplyr, and httr.

library(httr)
library(jsonlite)
library(dplyr)

Step 2: Download JSON data from API

After inspecting the data on NYC’s Open Data visualizer I determined that there are ~7,000 records that match the initial query, which is to locate records where the Case Topic is “Bicycle Lanes & Programs”. The Socrata API returns records 1,000 at a time, so multiple calls are necessary to retrieve all records.

One note: Because JSON data does not have “nulls” per se (rather, the records are simply not there if not explicitly entered as “NA”), there are some fields present in each API call that are missing in the other two. For this reason, I have used full_join instead of union to merge the datasets into one.

JSON data from an API is a lot easier to load (2 steps instead of 6) than parsing a website to find a zipped file containing a shapefile!

baseurl <- "https://data.cityofnewyork.us/resource/paw9-9kar.json?casetopic=Bicycle%20Lanes%20%26%20Programs" # includes the query casetopic = Bicycle Lanes, encoded as URL

allSearch <- fromJSON(baseurl) #creates a data frame from JSON retrieved through API

for(n in 1:6){
  i <- n*1000
  searchCCU <- fromJSON(paste0(baseurl,"&$offset=",i)) #searches for pages in chunks of 1000
  allSearch <- full_join(allSearch, searchCCU) #combines data into original data frame
}

glimpse(allSearch)

## Observations: 6,820
## Variables: 42
## $ borough                <chr> "Brooklyn", "Queens", "Brooklyn", NA, "...
## $ caseaddressedto        <chr> "Customer Service", "BC-Queens", "311",...
## $ casechannel            <chr> "Web Form", "Web Form", "Customer Comme...
## $ casenumber             <chr> "DOT-391803-G9M7", "DOT-323519-H9K3", "...
## $ casetopic              <chr> "Bicycle Lanes & Programs", "Bicycle La...
## $ clienttype             <chr> "Citizen", "Citizen", "Citizen", "Citiz...
## $ createddate            <chr> "2018-10-02T13:46:20.000", "2017-01-13T...
## $ crossstreet            <chr> "Prospect Park", "Yellowstone", NA, NA,...
## $ fromstreet             <chr> "8 Avenue", "Eliot", "7th Street", NA, ...
## $ latitude               <chr> "40.665749", "40.7325068549", "40.67038...
## $ locationdetail         <chr> "Multiple Block - Stretch", "Multiple B...
## $ locationtype           <chr> "Street or Sidewalk", "Street or Sidewa...
## $ longitude              <chr> "-73.979224", "-73.8682252485", "-73.98...
## $ maplocation            <chr> "40.6657490,-73.9792240", "40.732506854...
## $ pressname              <chr> "No", "No", "No", "No", "No", "No", "No...
## $ resolutionreason       <chr> "Open - New", "Information Provided", "...
## $ status                 <chr> "Active", "Resolved", "Resolved", "Reso...
## $ streetname             <chr> "9 Street", "Queens Blvd", "5th Avenue"...
## $ translationneeded      <chr> "No", "No", "No", "No", "No", "No", "No...
## $ acknowledgesent        <chr> NA, "2017-01-23T12:34:35.000", "2016-08...
## $ caseissue              <chr> NA, "Comment", "Remove or Relocate", NA...
## $ closeddate             <chr> NA, "2018-07-08T11:48:32.000", "2017-11...
## $ communitydistrict      <chr> NA, "06", "06", NA, "01", NA, "11", NA,...
## $ policeprecinct         <chr> NA, "112", "078", NA, "114", NA, "111",...
## $ xcoordinate            <chr> NA, "1020819", "988329", NA, "1003102",...
## $ ycoordinate            <chr> NA, "206204", "183516", NA, "213248", N...
## $ externaltrackingnumber <chr> NA, NA, "3914.02", NA, NA, NA, NA, NA, ...
## $ seibelnumber           <chr> NA, NA, "1-1-1296826294", "1-1-12725743...
## $ mayorsnumber           <chr> NA, NA, NA, "854615", NA, NA, NA, NA, N...
## $ buildingnumber         <chr> NA, NA, NA, NA, NA, NA, NA, NA, "2233",...
## $ category               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ex...
## $ inspector              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ke...
## $ clientcompanyname      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ bridgename             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ zipcode                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ otherbridgename        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ highwayname            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ ferrydatetime          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ ferrydirection         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ directionname          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ segmentorexit          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ bridgedirection        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

Part 2: Data Exploration

First, we will create a simple map showing the bicycle route data by facility type, using the object, bikeSHP, which is an R object containing shapefile information.

Map bicycle network data using `spplot`

The spatial mapping package spplot produces high quality maps with very few lines of code by reading in shapefiles. Graduated color palettes and attractive formatting come standard, for example.

spplot(bikeSHP, z = "ft_facilit")

Transform the data using `dplyr` and `broom`

The downside of the spplot package is that shapefiles cannot be easily manipulated in R, making the addition of other data types difficult. For this reason, we will transform the shapefile into a data frame so that it can be read by ggplot and mapped using ggmap. The transformation to a data frame is done using the broom::tidy function.

Note: Code in this section is adapted from Making Maps in R by Claudia A. Engel.

#code adapted from https://cengel.github.io/rspatial/4_Mapping.nb.html
library(broom)
bikes_raw <- tidy(bikeSHP)
bikeSHP$lineID <- sapply(slot(bikeSHP, "lines"), function(x) slot(x, "ID"))
bikes_df <- merge(bikes_raw, bikeSHP, by.x = "id", by.y="lineID")
head(bikes_df)

##   id     long      lat order piece group OBJECTID_1 SegmentID
## 1  0 984139.6 211708.7     1     1   0.1          1     33547
## 2  0 984267.5 211938.7     2     1   0.1          1     33547
## 3  1 987723.2 185455.1     1     1   1.1          2     22630
## 4  1 987577.5 185225.4     2     1   1.1          2     22630
## 5 10 914903.0 121334.9     1     1  10.1         11        59
## 6 10 914808.9 121744.4     2     1  10.1         11        59
##                           street boro fromstreet      tostreet onoffst
## 1                           9 AV    1    W 16 ST       W 31 ST      ON
## 2                           9 AV    1    W 16 ST       W 31 ST      ON
## 3                           3 AV    3    DEAN ST         15 ST      ON
## 4                           3 AV    3    DEAN ST         15 ST      ON
## 5 CONFERENCE HOUSE PARK GREENWAY    5 HYLAN BLVD SWINNERTON ST     OFF
## 6 CONFERENCE HOUSE PARK GREENWAY    5 HYLAN BLVD SWINNERTON ST     OFF
##   allclasses   instdate    moddate comments bikedir lanecount ft_facilit
## 1          I 2008/09/25 2008/09/25     <NA>       L         1       <NA>
## 2          I 2008/09/25 2008/09/25     <NA>       L         1       <NA>
## 3         II 1980/07/01 1980/07/01     <NA>       R         1   Standard
## 4         II 1980/07/01 1980/07/01     <NA>       R         1   Standard
## 5          I 2005/06/01 2005/06/01     <NA>       2         2   Greenway
## 6          I 2005/06/01 2005/06/01     <NA>       2         2   Greenway
##       tf_facilit
## 1 Protected Path
## 2 Protected Path
## 3           <NA>
## 4           <NA>
## 5       Greenway
## 6       Greenway

Next, we will transform the data using dplyr to obtain the year each segment was installed.

library(lubridate)
bikes_installed_df <- bikes_df %>% 
  mutate(yearInstalled = lubridate::year(as.Date(instdate))) %>% 
  glimpse()

## Observations: 49,216
## Variables: 22
## $ id            <chr> "0", "0", "1", "1", "10", "10", "100", "100", "1...
## $ long          <dbl> 984139.6, 984267.5, 987723.2, 987577.5, 914903.0...
## $ lat           <dbl> 211708.7, 211938.7, 185455.1, 185225.4, 121334.9...
## $ order         <int> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 3, 4, 5, 6, ...
## $ piece         <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ group         <chr> "0.1", "0.1", "1.1", "1.1", "10.1", "10.1", "100...
## $ OBJECTID_1    <fct> 1, 1, 2, 2, 11, 11, 101, 101, 1001, 1001, 10010,...
## $ SegmentID     <fct> 33547, 33547, 22630, 22630, 59, 59, 10544, 10544...
## $ street        <fct> 9 AV, 9 AV, 3 AV, 3 AV, CONFERENCE HOUSE PARK GR...
## $ boro          <dbl> 1, 1, 3, 3, 5, 5, 5, 5, 3, 3, 3, 3, 3, 3, 3, 3, ...
## $ fromstreet    <fct> W 16 ST, W 16 ST, DEAN ST, DEAN ST, HYLAN BLVD, ...
## $ tostreet      <fct> W 31 ST, W 31 ST, 15 ST, 15 ST, SWINNERTON ST, S...
## $ onoffst       <fct> ON, ON, ON, ON, OFF, OFF, ON, ON, ON, ON, OFF, O...
## $ allclasses    <fct> I, I, II, II, I, I, II, II, III, III, I, I, I, I...
## $ instdate      <fct> 2008/09/25, 2008/09/25, 1980/07/01, 1980/07/01, ...
## $ moddate       <fct> 2008/09/25, 2008/09/25, 1980/07/01, 1980/07/01, ...
## $ comments      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ bikedir       <fct> L, L, R, R, 2, 2, L, L, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ lanecount     <dbl> 1, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, ...
## $ ft_facilit    <fct> NA, NA, Standard, Standard, Greenway, Greenway, ...
## $ tf_facilit    <fct> Protected Path, Protected Path, NA, NA, Greenway...
## $ yearInstalled <dbl> 2008, 2008, 1980, 1980, 2005, 2005, 2009, 2009, ...

Finally, we can map the data using ggplot2.

#all code in this chunk adapted from https://cengel.github.io/rspatial/4_Mapping.nb.html
library(ggplot2)

#Plot the NYC Bike Network by year each segment was installed
ggplot() +                                               
  geom_line(                                             # make a line
    data = bikes_installed_df,                           # data frame
    aes(x = long, y = lat, group = group,                # group by lines
    col = yearInstalled)) +                              # color by **yearInstalled**
  ggtitle("NYC Bike Map, 2017, by Installation Date") +  # add title
  theme(line = element_blank(),                          # remove axis lines .. 
        axis.text=element_blank(),                       # .. tickmarks..
        axis.title=element_blank(),                      # .. axis labels..
        panel.background = element_blank()) +            # .. background gridlines
  coord_equal()                                          # both axes the same scale

#Plot the NYC Bike Network by segment type
ggplot() +                                               
  geom_line(                                             # make a line
    data = bikes_installed_df,                           # data frame
    aes(x = long, y = lat, group = group,                # coordinates, and group them by lines
    col = tf_facilit)) +                                 # color by **tf_facilit**, which is the segment type
  ggtitle("NYC Bike Map, 2017, by Segment Type") +       # add title
  theme(line = element_blank(),                          # remove axis lines .. 
        axis.text=element_blank(),                       # .. tickmarks..
        axis.title=element_blank(),                      # .. axis labels..
        panel.background = element_blank()) +            # .. background gridlines
  coord_equal()                                          # both axes the same scale

Exploration of the CCU data

How many messages of each type are received in a given year? We will first create a data frame from the JSON query data and transform it to obtain a “year” field using the lubridate::year function. We will also reclassify and rename the latitude and longitude data so that it can be more easily mapped in a later step using ggplot.

CCU_df <- allSearch %>%
  mutate(
    year = lubridate::year(as.Date(createddate)),
    lat = as.numeric(latitude),
    lon = as.numeric(longitude)) %>% 
  glimpse()

## Observations: 6,820
## Variables: 45
## $ borough                <chr> "Brooklyn", "Queens", "Brooklyn", NA, "...
## $ caseaddressedto        <chr> "Customer Service", "BC-Queens", "311",...
## $ casechannel            <chr> "Web Form", "Web Form", "Customer Comme...
## $ casenumber             <chr> "DOT-391803-G9M7", "DOT-323519-H9K3", "...
## $ casetopic              <chr> "Bicycle Lanes & Programs", "Bicycle La...
## $ clienttype             <chr> "Citizen", "Citizen", "Citizen", "Citiz...
## $ createddate            <chr> "2018-10-02T13:46:20.000", "2017-01-13T...
## $ crossstreet            <chr> "Prospect Park", "Yellowstone", NA, NA,...
## $ fromstreet             <chr> "8 Avenue", "Eliot", "7th Street", NA, ...
## $ latitude               <chr> "40.665749", "40.7325068549", "40.67038...
## $ locationdetail         <chr> "Multiple Block - Stretch", "Multiple B...
## $ locationtype           <chr> "Street or Sidewalk", "Street or Sidewa...
## $ longitude              <chr> "-73.979224", "-73.8682252485", "-73.98...
## $ maplocation            <chr> "40.6657490,-73.9792240", "40.732506854...
## $ pressname              <chr> "No", "No", "No", "No", "No", "No", "No...
## $ resolutionreason       <chr> "Open - New", "Information Provided", "...
## $ status                 <chr> "Active", "Resolved", "Resolved", "Reso...
## $ streetname             <chr> "9 Street", "Queens Blvd", "5th Avenue"...
## $ translationneeded      <chr> "No", "No", "No", "No", "No", "No", "No...
## $ acknowledgesent        <chr> NA, "2017-01-23T12:34:35.000", "2016-08...
## $ caseissue              <chr> NA, "Comment", "Remove or Relocate", NA...
## $ closeddate             <chr> NA, "2018-07-08T11:48:32.000", "2017-11...
## $ communitydistrict      <chr> NA, "06", "06", NA, "01", NA, "11", NA,...
## $ policeprecinct         <chr> NA, "112", "078", NA, "114", NA, "111",...
## $ xcoordinate            <chr> NA, "1020819", "988329", NA, "1003102",...
## $ ycoordinate            <chr> NA, "206204", "183516", NA, "213248", N...
## $ externaltrackingnumber <chr> NA, NA, "3914.02", NA, NA, NA, NA, NA, ...
## $ seibelnumber           <chr> NA, NA, "1-1-1296826294", "1-1-12725743...
## $ mayorsnumber           <chr> NA, NA, NA, "854615", NA, NA, NA, NA, N...
## $ buildingnumber         <chr> NA, NA, NA, NA, NA, NA, NA, NA, "2233",...
## $ category               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ex...
## $ inspector              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "Ke...
## $ clientcompanyname      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ bridgename             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ zipcode                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ otherbridgename        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ highwayname            <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ ferrydatetime          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ ferrydirection         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ directionname          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ segmentorexit          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ bridgedirection        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ year                   <dbl> 2018, 2017, 2016, 2016, 2016, 2017, 201...
## $ lat                    <dbl> 40.66575, 40.73251, 40.67039, NA, 40.75...
## $ lon                    <dbl> -73.97922, -73.86823, -73.98529, NA, -7...

library(kableExtra) #for formatting table

CCU_df %>% select(caseissue) %>% 
  group_by(caseissue) %>% 
  summarise(count = n()) %>% 
  arrange(desc(count)) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

caseissue	count
Bicycle Lane	2678
Safety	1564
NA	1180
Bicycle Rack	383
Concern	347
Comment	283
Remove or Relocate	121
Regulations	76
Study	50
Outreach Campaigns	44
Safety Material/Helmet Related	40
Bicycle Map	19
Repair	17
Commercial Cyclist - Rider Identification	7
Commercial Cyclist - General	5
Commercial Cyclist - Bicycle Safety Devices	2
Commercial Cyclist - Safety Poster	1
Commercial Cyclist - Sign on Bike	1
General Information	1
No Parking	1

This chart will be easier to read if the categories are collapsed into fewer, still meaningful buckets. We can perform this transformation using the forcats package.

library(forcats) 
library(tidyr)

#Use fct_collapse to group multiple factors levels into fewer buckets
CCUsCollapse <- CCU_df %>% mutate(caseissueFCT = as.factor(caseissue)) %>% 
  mutate(caseIssueType = fct_collapse(caseissueFCT,
  "Likely Supportive" = c(
    "Bicycle Rack",
    "Bicycle Map",
    "Safety Material/Helmet Related",
    "Repair"
  ),
  "Likely Opposed" = c(
    "Remove or Relocate", 
    "Concern", 
    "Safety",
    "No Parking"),
  "Neutral/Unknown" = c(
    "Commercial Cyclist - Rider Identification", 
    "Commercial Cyclist - General",
    "Commercial Cyclist - Bicycle Safety Devices",
    "Commercial Cyclist - Safety Poster",
    "Commercial Cyclist - Sign on Bike", 
    "Study",
    "Outreach Campaigns",
    "General Information",
    "Regulations",
    "Comment",
    "Bicycle Lane"
    )
))

#Use fct_explicit_na to relevel missing values as "Neutral/Unknown"
CCUsCollapse$caseIssueType <- CCUsCollapse$caseIssueType %>% fct_explicit_na(na_level = "Neutral/Unknown")

#show last 5 years of data as table  
CCUsCollapse %>% select(caseIssueType, year) %>% 
  group_by(year, caseIssueType) %>%
  summarise(count = n()) %>% 
  arrange(desc(count)) %>%
  filter(year > 2013) %>% 
  spread(year, count) %>% 
  rename(`Message Type` = caseIssueType) %>% 
  kable(caption = "Commissioner's Correspondence by Likely Sentiment, Year") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Commissioner’s Correspondence by Likely Sentiment, Year
Message Type	2014	2015	2016	2017	2018
Neutral/Unknown	377	505	608	723	588
Likely Supportive	51	54	68	117	49
Likely Opposed	47	39	108	15	NA

#show last 5 years of data as chart
CCUsCollapse %>% select(caseIssueType, year) %>% 
  group_by(year, caseIssueType) %>%
  summarise(count = n()) %>% 
  arrange(desc(count)) %>%
  filter(year > 2012) %>% 
  filter(year < 2018) %>% 
  rename(`Message Type` = caseIssueType) %>% 
  ggplot + geom_col(aes(year, count, fill=`Message Type`), position= "dodge") +
  ggtitle("Correspondence to DOT's Bicycle Lanes and Program, 2012-20127") +
  xlab("Year") + ylab("Count")

What if we look at just requests to remove or relocate a bike lane?

CCU_df %>% filter(caseissue=="Remove or Relocate") %>% 
  rename(`Case Issue`=caseissue) %>% 
  ggplot() + geom_bar(aes(year, fill=`Case Issue`)) +
  ggtitle("Correspondence to NYC DOT Bicycle Lanes and Programs, 2013-2017")+
  theme(plot.title = element_text(size = rel(1.4)),
           legend.position = "top") +
  xlab("Year") + ylab("Count")

2016 was a high point for requests to remove or relocate bike lanes! What if we map these?

library(googleway)
library(ggmap)
key <- myKey
register_google(key = key)

#note: requires dev version of ggmap
#devtools::install_github("dkahle/ggmap", ref = "tidyup")
#note: I have stored my private Google API key. To get your own, visit: https://developers.google.com/maps/documentation/javascript/get-api-key

myMap <- get_map(location = c(lon = -74.0060, lat = 40.7128), zoom = 11)

CCUdata2016 <- CCU_df %>% ungroup() %>% 
  rename(`Case Issue` = caseissue) %>% 
  filter(year==2016) %>% 
  filter(!is.na(latitude), !is.na(longitude)) %>% 
  mutate(lon = as.numeric(longitude), lat = as.numeric(latitude))

myMap %>% ggmap() +
  geom_point(data = CCUdata2016, 
             aes(x = lon, y = lat, col=`Case Issue`), 
             alpha = 0.5, 
             size = 1) +
  ggtitle("NYC DOT Correspondence by Case Issue (2016)") +
  theme(axis.title.x=element_blank(),
        axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

Combining the datasets

Finally, we will create a map using both sets of data so that we can begin to explore the question of whether there is a correlation between new bike segments (stored in the bikeSHP object) and correspondence to the DOT (stored in the CCU_df object).

To do this, we first have to re-project the spatial data stored in bikeSHP to the coordinate system used by ggmap, whose source is Google maps. (Otherwise the maps will not line up correctly.) This can be done with the spTransform function. Then, we will repeat the previous data transformation steps with broom::tidy to transform the shapefile back into a data frame. This section of code draws heavily from Mapping in R.

# All code in this chunk adapted from https://cengel.github.io/rspatial/4_Mapping.nb.html
bikesGoogleMaps <- spTransform(bikeSHP, CRS("+init=epsg:4326"))
bikes_dfGoogleMaps <- tidy(bikesGoogleMaps)
bikesGoogleMaps$lineID <- sapply(slot(bikesGoogleMaps, "lines"), function(x) slot(x, "ID"))
bikes_dfGoogleMaps <- merge(bikes_dfGoogleMaps, bikesGoogleMaps, by.x = "id", by.y="lineID")
head(bikes_dfGoogleMaps)

##   id      long      lat order piece group OBJECTID_1 SegmentID
## 1  0 -74.00040 40.74777     1     1   0.1          1     33547
## 2  0 -73.99994 40.74840     2     1   0.1          1     33547
## 3  1 -73.98748 40.67571     1     1   1.1          2     22630
## 4  1 -73.98800 40.67508     2     1   1.1          2     22630
## 5 10 -74.24935 40.49944     1     1  10.1         11        59
## 6 10 -74.24969 40.50056     2     1  10.1         11        59
##                           street boro fromstreet      tostreet onoffst
## 1                           9 AV    1    W 16 ST       W 31 ST      ON
## 2                           9 AV    1    W 16 ST       W 31 ST      ON
## 3                           3 AV    3    DEAN ST         15 ST      ON
## 4                           3 AV    3    DEAN ST         15 ST      ON
## 5 CONFERENCE HOUSE PARK GREENWAY    5 HYLAN BLVD SWINNERTON ST     OFF
## 6 CONFERENCE HOUSE PARK GREENWAY    5 HYLAN BLVD SWINNERTON ST     OFF
##   allclasses   instdate    moddate comments bikedir lanecount ft_facilit
## 1          I 2008/09/25 2008/09/25     <NA>       L         1       <NA>
## 2          I 2008/09/25 2008/09/25     <NA>       L         1       <NA>
## 3         II 1980/07/01 1980/07/01     <NA>       R         1   Standard
## 4         II 1980/07/01 1980/07/01     <NA>       R         1   Standard
## 5          I 2005/06/01 2005/06/01     <NA>       2         2   Greenway
## 6          I 2005/06/01 2005/06/01     <NA>       2         2   Greenway
##       tf_facilit
## 1 Protected Path
## 2 Protected Path
## 3           <NA>
## 4           <NA>
## 5       Greenway
## 6       Greenway

Repeating our earlier plot, we can see how ggplot and spplot handle mapping differently.

register_google(key = key)

#Plot the NYC Bike Network by segment type
bikesYearGoogle <- bikes_dfGoogleMaps %>% 
  mutate(year = lubridate::year(as.Date(instdate)))

basemap <- get_map("New York", zoom = 11)


ggmap(basemap) +                                               
  geom_line(data = bikesYearGoogle,                                             
    aes(x = long, y = lat, group = group,                
    col = year)) +
  geom_point(data = CCU_df,
    aes(x = lon, y = lat, col=year), size = 1) +
  ggtitle("NYC Bike Map, 2017, by Year")

Part 3: Analysis

Now that we have explored the data a litle bit, we can begin to tackle our main question: What is the relationship between CCU messages and bike network development? We will perform this analysis using a map, graphs, and a statistical test.

Assumptions and Limitations of the Analysis

Data content

Metadata about each message, or the total number of messages, is sufficient to measure public sentiment. A challenge with the CCU dataset is that messages tagged as “concerns” can be either concerns that are in favor of the expansion of the bike lane network, or messages that are against the installation of bike lanes. Therefore it is necessary to make assumptions about which categories of messages indicate the sender is in favor of bike lanes (requests for bike racks, for example) and which are opposed (requests to remove or relocate bike lanes). Ultimately, after exploring the data, it seems that the best proxy for public sentiment is simply the total number of messages, which are likely to be complaints.

CCUs in response to new infrastructure will occur in the same calendar year as the installation of the bike lane. A second assumption is that message writers will send their message in the same calendar year as the infrastructure they are writing about was built–if, in fact, they are writing in response to new bike lane segments.

Data structure

Number of segments is a good proxy for amount of construction and overall segment length. A third assumption (based on my experience with working with this dataset in ArcMAP) is that each segment is roughly the same length.

Data coding

Data is coded correctly and consistently. We cannot see the underlying messages, and so must assume the data was coded correctly and consistently tagged in each category. This analysis assumes that all messages are coded consistently by different reviewers and from one year to the next–in reality both of these assumptions are likely to be incorrect, at least to a point. For example, one reviewer may code a complaint about a new bike lane as “Concern” while a different reviewer in a different year might code a very similar message as “Remove or Relocate.” My experience in working with these messages is that, in fact, messages are very often mis-coded, with concerns about bike share directed to the bicycle lane unit, and vice versa.

Location data is coded related to the message, rather then the customer address. By mapping the data, we also assume that the location data refers to the location of the concern rather than the writer. This is an assumption I have more confidence in being true, at least most of the time.

Spatial analysis

The map below shows some correlation between the year a segment of bike lane was installed and the requests to remove them; however, the correlation is not great.

register_google(key = key)

#Overall map, showing requests to remove bike lanes and year segments were installed
ggmap(get_map("New York", maptype = "terrain-background", zoom = 11)) +
  geom_line(data = bikesYearGoogle,                      #add bike map
            aes(x = long, y = lat, group = group),
            alpha = .25) +           
  geom_line(data = filter(bikesYearGoogle, year > 2011), #add highlight for lines 2012-2017
            aes(x = long, y = lat, group = group, 
                col = as.factor(year))) +                #show year installed
  geom_point(data = filter(CCU_df,                       #add CCUs, filtered 
                           caseissue=="Remove or Relocate",
                           year < 2017), 
             aes(x = lon, y = lat, col=as.factor(year)), 
             size = 1,
             alpha = 0.5) +
  ggtitle("Requests to Remove or Relocate NYC Bike Lanes\n(2011-2016)")

Zooming in on a high request-density area, we can see a cluster of requests to remove or relocate bike lanes in areas and years where no bike lanes were installed in Lower Manhattan and the Brooklyn neighborhoods of Park Slope and Cobble Hill. This seems to indicate that the relationship may not be so strong.

#Zoom to high density request area (inner Brooklyn/Lower Manhattan)
ggmap(get_map(location = c(lon = -73.99, lat = 40.70), 
              maptype = "terrain-background", zoom = 13)) +
  geom_line(data = bikesYearGoogle,                      #add bike map
            aes(x = long, y = lat, group = group),
            alpha = .25) +           
  geom_line(data = filter(bikesYearGoogle, year > 2011), #add highlight for lines 2012-2017
            aes(x = long, y = lat, group = group, 
                col = as.factor(year))) +                #show year installed
  geom_point(data = filter(CCU_df,                       #add CCUs, filtered 
                           caseissue=="Remove or Relocate",
                           year < 2017), 
             aes(x = lon, y = lat, col=as.factor(year)), 
             size = 2) +
  ggtitle("High Density Requests to Remove or Relocate NYC Bike Lanes\n(2012-2016)")

Graphic Analysis

Because the vast majority of messages are tagged as issue types which cannot be parsed more than “neutral unknown,” we will assume (reasonably, based on my experience) that messages are mostly complaints. In other words, the number of CCUs received in a given year can be a proxy for public sentiment against the expansion of the bike network.

To see if bike network expansion is driving complaints, we can first see if the number of CCU messages received correlates to the number of segments installed when the data sets are joined by “year” and “streets”.

If no correlation is found, we go one step back to see if the total number of CCU messages received in a calendar year has any correlation with the number of new segments of bike lanes installed in the same year.

New Segments vs. Number of Messages

Does the number of new segments constructed in a given time period correlate with the number of “Bicycle Lanes and Programs” correspondence items received by the NYC Department of Transportation?

A few data cleaning steps will be needed before we can run this analysis:

Because off-street bicycle lane segments are typically not under the jurisdiction of the DOT, we will exclude segments coded as “OFF” in the field onoffst.
Street names are coded differently in the two data sets! CCUs are usually but not always coded in CamelCase with the full street name (e.g. “Queens Boulevard”), which the bike network data is all caps and abbreviations (e.g. “QUEENS BLVD”). Therefore we will need to run some regular expressions and data transformations in order to correctly join the data and standardize the format.
We will need to create a merged version of the two data sets that correctly counts both bike lane segments installed (“Segs”) and items of correspondence received by the DOT (“CCUs”) by year and street.

library(stringr)
#CCUsBikesMerge 

CCUsByStreetYear <- CCU_df %>% 
  select(year, streetname) %>% 
  rename(street = streetname) %>% 
  mutate(street = toupper(street)) %>% 
  mutate(street = str_trim(street)) %>% 
  mutate(street = str_remove_all(street, "[[:punct:]]")) %>%
  #replace all abbreviations with full words
    mutate(street = gsub(" ST$", "STREET", street)) %>%     
    mutate(street = gsub("^ST ", "SAINT", street)) %>%     
    mutate(street = gsub(" DR$", "DRIVE", street)) %>% 
    mutate(street = gsub(" BLVD$", "BOULEVARD", street)) %>% 
    mutate(street = gsub(" AV(E)?$", "AVENUE", street)) %>%  #4 Av, e.g.
    mutate(street = gsub("AV(E)? ", "AVENUE", street)) %>%  #Ave C, e.g.
    mutate(street = gsub(" RD$", "ROAD", street)) %>%
    mutate(street = gsub("^N ", "NORTH", street)) %>% 
    mutate(street = gsub("^S ", "SOUTH", street)) %>% 
    mutate(street = gsub("^E ", "EAST", street)) %>% 
    mutate(street = gsub("^W ", "WEST", street)) %>%
    mutate(street = gsub(" N$", " NORTH", street)) %>% 
    mutate(street = gsub(" S$ ", " SOUTH", street)) %>% 
    mutate(street = gsub(" E$", " EAST", street)) %>% 
    mutate(street = gsub(" W$", " WEST", street)) %>%
  #remove all variations of "nth" using look-behind regex
    mutate(street = str_remove_all(street, "(?<=[0-9])(?:ST|ND|RD|TH)")) %>% 
  group_by(year, street) %>% 
  arrange(street) %>% 
  summarize(CCUs=n())%>% 
  glimpse()

## Observations: 2,615
## Variables: 3
## $ year   <dbl> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2003, 2...
## $ street <chr> "6 AVENUE 32 TO 42 STREET", "AVENUE C", "DELANCY STREET...
## $ CCUs   <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 3, 5, 3, 2, 1, 2, 1, 1...

SegsByStreetYear <- bikesYearGoogle %>% select(year, street, onoffst) %>% 
  filter(onoffst=="ON") %>% 
  select(-onoffst) %>%
  mutate(street = toupper(street)) %>% 
  mutate(street = str_trim(street)) %>% 
  mutate(street = str_remove_all(street, "[[:punct:]]")) %>%
  #replace all abbreviations with full words
    mutate(street = gsub(" ST$", " STREET", street)) %>%     
    mutate(street = gsub("^ST ", "SAINT ", street)) %>%     
    mutate(street = gsub(" DR$", " DRIVE", street)) %>% 
    mutate(street = gsub(" BLVD$", " BOULEVARD", street)) %>% 
    mutate(street = gsub(" AV(E)?$", " AVENUE", street)) %>%  #4 Av, e.g.
    mutate(street = gsub("^AV(E)? ", "AVENUE ", street)) %>%  #Ave C, e.g.
    mutate(street = gsub(" RD$", " ROAD", street)) %>%
    mutate(street = gsub("^N ", "NORTH ", street)) %>% 
    mutate(street = gsub("^S ", "SOUTH ", street)) %>% 
    mutate(street = gsub("^E ", "EAST ", street)) %>% 
    mutate(street = gsub("^W ", "WEST ", street)) %>%    
    mutate(street = gsub(" N$", " NORTH", street)) %>% 
    mutate(street = gsub(" S$ ", " SOUTH", street)) %>% 
    mutate(street = gsub(" E$", " EAST", street)) %>% 
    mutate(street = gsub(" W$", " WEST", street)) %>%
  #remove all variations of "nth" using look-behind regex
    mutate(street = str_remove_all(street, "(?<=[0-9])(?:ST|ND|RD|TH)")) %>% 
  group_by(year, street) %>%
  arrange(street) %>% 
  summarize(Segs=n()) %>% glimpse()

## Observations: 1,027
## Variables: 3
## $ year   <dbl> 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1904, 1909, 1...
## $ street <chr> "ASTORIA PARK GREENWAY", "EAST 175 STREET", "FERRIS STR...
## $ Segs   <int> 23, 4, 10, 21, 8, 14, 2, 80, 80, 78, 30, 17, 83, 19, 62...

CCUsLanesMerge <- full_join(CCUsByStreetYear, SegsByStreetYear) %>% arrange(desc(Segs))
head(CCUsLanesMerge)

## # A tibble: 6 x 4
## # Groups:   year [4]
##    year street               CCUs  Segs
##   <dbl> <chr>               <int> <int>
## 1  2009 GRAND CONCOURSE        NA   471
## 2  1979 WEST DRIVE             NA   278
## 3  1979 EAST DRIVE             NA   257
## 4  1997 CROSS BAY BOULEVARD    NA   223
## 5  2015 QUEENS BOULEVARD        4   220
## 6  1997 34 AVENUE              NA   215

CCUsLanesMerge %>% ungroup() %>% 
  filter(year > 2011) %>% 
  filter(year < 2017) %>% 
  mutate(year = as.factor(year)) %>% 
  ggplot+geom_jitter(aes(x=Segs, y=CCUs, col=year)) +
  scale_x_log10() +
  scale_y_log10() +
  ggtitle("DOT Correspondence (CCUs) per Street per Year\nby Number of Segments New Bike Lane (2013-2017)")+
  xlab("Number of Segments of Bike Lane Installed")

In the above analysis, no correlation is apparent between the numbere of segments installed on a given street and CCUs received regarding that street in the same year.

As a next step, we will re-run the analysis without joining by street–in other words, we will look to see if the total number of new bike lane segments installed correlates to the total number of Bicycle Lanes and Programs CCUs received.

CCUsByYear <- CCU_df %>% 
  select(year) %>% 
  filter(year < 2017) %>% 
  group_by(year) %>% 
  summarise(CCUs = n()) %>% 
  glimpse()

## Observations: 15
## Variables: 2
## $ year <dbl> 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 201...
## $ CCUs <int> 9, 49, 55, 139, 254, 264, 377, 405, 406, 417, 393, 703, 4...

LanesByYear <- bikesYearGoogle %>% 
  select(year) %>% 
  filter(year < 2017) %>% 
  group_by(year) %>% 
  summarise(Lanes = n()) %>% 
  glimpse()

## Observations: 56
## Variables: 2
## $ year  <dbl> 1900, 1904, 1909, 1920, 1923, 1932, 1935, 1936, 1938, 19...
## $ Lanes <int> 1045, 80, 80, 125, 70, 53, 51, 380, 175, 261, 137, 1147,...

MergeByYear <- full_join(CCUsByYear, LanesByYear) %>% arrange(desc(year)) %>% filter(year < 2017)
glimpse(MergeByYear)

## Observations: 56
## Variables: 3
## $ year  <dbl> 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 20...
## $ CCUs  <int> 784, 598, 475, 703, 393, 417, 406, 405, 377, 264, 254, 1...
## $ Lanes <int> 2269, 2168, 1668, 3484, 1118, 1068, 1548, 2806, 3824, 56...

MergeByYear %>% 
  filter(year > 2000) %>% 
  mutate(year = as.factor(year)) %>% 
  ggplot + geom_jitter(aes(x=Lanes, y=CCUs, col=year, size = CCUs)) +
  geom_smooth(aes(x=Lanes, y=CCUs), method='lm') +
  ggtitle("Correlation of Number of CCUs received by Bicycle Lane Segments\n(2000-2016)")

Now it looks as though there may be a trend! We can confirm with a statistical analysis.

Statistical Analysis

model <- lm(CCUs ~ Lanes, MergeByYear)

summary(model)

## 
## Call:
## lm(formula = CCUs ~ Lanes, data = MergeByYear)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -315.81 -188.40   24.43  125.07  425.70 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 264.20259  115.84309   2.281   0.0401 *
## Lanes         0.04147    0.04505   0.921   0.3740  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 234 on 13 degrees of freedom
##   (41 observations deleted due to missingness)
## Multiple R-squared:  0.0612, Adjusted R-squared:  -0.01101 
## F-statistic: 0.8475 on 1 and 13 DF,  p-value: 0.374

As we can see from the model, while there is a slight trend in the graph, it has no statistical significance.

Conclusion

Commissioner’s Correspondence items (CCUs) received by the Department of Transportation and coded as “Bicycle Lanes & Programs” do not seem to vary significantly by the number of segments of new bicycle lanes installed. This is an interesting finding as the number of segments installed by year have varied quite a bit!

What worked and what didn’t

Data Availability

Some data was unavailable publicly. My initial idea was to perform a sentiment analysis on the content of the messages themselves. This was based on a similar analysis I had performed on NYC Parks Department data some years ago; however, it turns out that this level of information is not available through open data.

Socrata API and JSON data

JSON data is much easier to work with than CSVs.Initially, I had tried to load a static CSV containing all the 311 data from NYC. This was a large (10+MB) file, and it was very challenging to load. Once loaded, it was very challenging to parse, as it was a comma-delimited CSV whose values sometimes also contained commas. Ultimately, I was only able to load and correctly parse the data by using the JSON-based API. As it turned out, this dataset didn’t have the right kinds of information anyway! Luckily the same code was applicable to the correct dataset (the CCU information), and having learned the significant advantages of reading the JSON data (namely, correct assignment of values to fields, reduced memory use, and much faster load times), loading this dataset into an R data frame was a snap.

Mapping the data

R packages for spatial analysis are great! Using the various mapping packages available in R turned out to be surprisingly straight-forward, and runs significantly faster than software used for spatial analysis in the world of urban planning. I would definitely use R for spatial analysis again!

References

Data source for Commissioner’s Correspondence on Bike Lanes: Open Data

Data source Commissioner’s Correspondence on Bike Lanes (JSON data, API)

ggmap CheatSheet: https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/ggmap/ggmapCheatsheet.pdf

Making Maps in R by Claudia A. Engel

Final Project

Alice Friedman

12/9/2018

Introduction

Project purpose and motivation

Part 1: Data Acquisition

Dataset 1: Bicycle Route Data

Step 1: Load needed libraries to scrape the website for links to download.

Step 2: Parse website to find links

Step 3: Create and run function to download relevant files

Step 4: Unzip the file to access and load the CSV into a data frame

Step 5: Read shapefile using `rgdal`

Step 6: Delete the downloaded files

Dataset 2: Commissioner’s Correspondence Unit messages (JSON)

Step 1: Setup

Step 2: Download JSON data from API

Part 2: Data Exploration

Map bicycle network data using `spplot`

Transform the data using `dplyr` and `broom`

Exploration of the CCU data

Combining the datasets

Part 3: Analysis

Assumptions and Limitations of the Analysis

Data content

Data structure

Data coding

Spatial analysis

Graphic Analysis

New Segments vs. Number of Messages

Statistical Analysis

Conclusion

What worked and what didn’t

Data Availability

Socrata API and JSON data

Mapping the data

References

Final Project

Alice Friedman

12/9/2018

Introduction

Project purpose and motivation

Part 1: Data Acquisition

Dataset 1: Bicycle Route Data

Step 1: Load needed libraries to scrape the website for links to download.

Step 2: Parse website to find links

Step 3: Create and run function to download relevant files

Step 4: Unzip the file to access and load the CSV into a data frame

Step 5: Read shapefile using rgdal

Step 6: Delete the downloaded files

Dataset 2: Commissioner’s Correspondence Unit messages (JSON)

Step 1: Setup

Step 2: Download JSON data from API

Part 2: Data Exploration

Map bicycle network data using spplot

Transform the data using dplyr and broom

Exploration of the CCU data

Combining the datasets

Part 3: Analysis

Assumptions and Limitations of the Analysis

Data content

Data structure

Data coding

Spatial analysis

Graphic Analysis

New Segments vs. Number of Messages

Statistical Analysis

Conclusion

What worked and what didn’t

Data Availability

Socrata API and JSON data

Mapping the data

References

Step 5: Read shapefile using `rgdal`

Map bicycle network data using `spplot`

Transform the data using `dplyr` and `broom`