September 2014 Uber Pickups
New York City Zip Codes
Reverse Geocoding Output
The NYC Uber data set for pick up locations include longitude and latitude information; however, it does not contain information such as zip codes, borough, or neighborhood. One of my projects for Data 607 is to take geolocation data and convert them into zip code, borough, and neighborhood. A detailed step by step of how this was estimated is described here.
The CSV version of this file was over 100 MB in size. I did not save this file in Github.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The code below loads the data from the file sources mentioned above into objects.
file_NYC_zipcodes <- "https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/FinalProject/NYC_Borough_Neighborhood_ZipCodes.csv"
file_reverseGeo_output <- "https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/FinalProject/Pickup_Zipcodes.csv"
#column names for the reverse geocoding output
col_names <- c("row_id","pickup_datetime","latitude","longitude","base","location_index","distance","zip")
NYC_pickup_sept2014 <- readr::read_csv(file_reverseGeo_output, skip = 1, col_names)
## Parsed with column specification:
## cols(
## row_id = col_double(),
## pickup_datetime = col_character(),
## latitude = col_double(),
## longitude = col_double(),
## base = col_character(),
## location_index = col_double(),
## distance = col_double(),
## zip = col_double()
## )
#mapping of NYC zip codes to borough and neighborhoods
NYC_zipcodes <- readr::read_csv(file_NYC_zipcodes)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## Borough = col_character(),
## Neighborhood = col_character(),
## zip = col_double()
## )
The data frame NYC_pickup_sept2014
includes mapping of (longitude, latitude) to estimated zip codes. A detailed step by step of how the estimate was done is described here.
If the estimate is more than 2,000 meters (around 1.24 miles) from the published zip code to (longitude, latitude) mapping in the zipcode
library, these rows were dropped.
The zipcode
library contains over 44,000 mappings of zip codes to (longitude, latitude) in the United States.
The code below will remove rows that are more than 2000 meters.
pickup_info_far <- NYC_pickup_sept2014[NYC_pickup_sept2014$distance >= 2001,]
pickup_info_nyc <- NYC_pickup_sept2014[NYC_pickup_sept2014$distance < 2001,]
50,636 rows were removed from 1,028,136 pickup locations.
nrow(pickup_info_far)
## [1] 50636
977,500 Remaining rows with estimated zip codes that are within 1.24 miles of the zip code to (longitude, latitude) mapping as published in the zipcode
library.
nrow(pickup_info_nyc)
## [1] 977500
Join pickup data with estimated zip codes to NYC borough and neighborhood.
NYC_Uber_pickup_Sept2014 <- dplyr::inner_join(pickup_info_nyc, NYC_zipcodes, by = "zip")
#remove columns that are not necessary
NYC_Uber_pickup_Sept2014$X1 <- NULL
NYC_Uber_pickup_Sept2014$location_index <- NULL
Preview data
head(NYC_Uber_pickup_Sept2014)
## # A tibble: 6 x 9
## row_id pickup_datetime latitude longitude base distance zip Borough
## <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
## 1 2 9/1/2014 0:01:~ 40.8 -74.0 B025~ 528. 10001 Manhat~
## 2 3 9/1/2014 0:03:~ 40.8 -74.0 B025~ 513. 10036 Manhat~
## 3 4 9/1/2014 0:06:~ 40.7 -74.0 B025~ 881. 10010 Manhat~
## 4 5 9/1/2014 0:11:~ 40.8 -73.9 B025~ 412. 10030 Manhat~
## 5 6 9/1/2014 0:12:~ 40.7 -74.0 B025~ 847. 11215 Brookl~
## 6 10 9/1/2014 0:33:~ 40.8 -74.0 B025~ 571. 10020 Manhat~
## # ... with 1 more variable: Neighborhood <chr>
Save this data as CSV and RDS file.
This file data is going to be used by the Shiny application.
saveRDS(NYC_Uber_pickup_Sept2014, "NYC_Uber_pickup_Sept2014.rds")
write.csv(NYC_Uber_pickup_Sept2014, file='NYC_Uber_pickup_Sept2014.csv')
S. Tinapunan, 5/16/2019