Data 608 - Visual Analytics

Final Project by S. Tinapunan
The purpose of this file is to prepare the data that’s going to be used by the Shiny application that I’m going to build.

Data set

September 2014 Uber Pickups

This is taken from https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city/version/2.
This is a list of Uber pickups for the month of September 2014 only.
This file has 1,028,136 rows.

New York City Zip Codes

This is taken from https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm.
This is a list of 178 zip codes with neighborhood and borough descriptions.
This is the CSV format of this data that’s been transformed:
https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/FinalProject/NYC_Borough_Neighborhood_ZipCodes.csv

Reverse Geocoding Output

The NYC Uber data set for pick up locations include longitude and latitude information; however, it does not contain information such as zip codes, borough, or neighborhood. One of my projects for Data 607 is to take geolocation data and convert them into zip code, borough, and neighborhood. A detailed step by step of how this was estimated is described here.

Output file of reverse geocoding of the NYC Uber 2014 pickup location:
https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/FinalProject/Pickup_Zipcodes.csv
This file has 1,028,136 rows.

Output file generated by this job

The CSV version of this file was over 100 MB in size. I did not save this file in Github.

https://github.com/Shetura36/Data-608/blob/master/Final%20Project/NYC_Uber_pickup_Sept2014_output.rds?raw=true

Load libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Prepare data

The code below loads the data from the file sources mentioned above into objects.

file_NYC_zipcodes <- "https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/FinalProject/NYC_Borough_Neighborhood_ZipCodes.csv"
file_reverseGeo_output <- "https://raw.githubusercontent.com/Shetura36/Data-607-Assignments/master/FinalProject/Pickup_Zipcodes.csv"

#column names for the reverse geocoding output
col_names <- c("row_id","pickup_datetime","latitude","longitude","base","location_index","distance","zip")
NYC_pickup_sept2014 <- readr::read_csv(file_reverseGeo_output, skip = 1, col_names)

## Parsed with column specification:
## cols(
##   row_id = col_double(),
##   pickup_datetime = col_character(),
##   latitude = col_double(),
##   longitude = col_double(),
##   base = col_character(),
##   location_index = col_double(),
##   distance = col_double(),
##   zip = col_double()
## )

#mapping of NYC zip codes to borough and neighborhoods
NYC_zipcodes <- readr::read_csv(file_NYC_zipcodes)

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Borough = col_character(),
##   Neighborhood = col_character(),
##   zip = col_double()
## )

The data frame NYC_pickup_sept2014 includes mapping of (longitude, latitude) to estimated zip codes. A detailed step by step of how the estimate was done is described here.

If the estimate is more than 2,000 meters (around 1.24 miles) from the published zip code to (longitude, latitude) mapping in the zipcode library, these rows were dropped.

The zipcode library contains over 44,000 mappings of zip codes to (longitude, latitude) in the United States.

The code below will remove rows that are more than 2000 meters.

pickup_info_far <- NYC_pickup_sept2014[NYC_pickup_sept2014$distance >= 2001,]
pickup_info_nyc <- NYC_pickup_sept2014[NYC_pickup_sept2014$distance < 2001,]

50,636 rows were removed from 1,028,136 pickup locations.

nrow(pickup_info_far)

## [1] 50636

977,500 Remaining rows with estimated zip codes that are within 1.24 miles of the zip code to (longitude, latitude) mapping as published in the zipcode library.

nrow(pickup_info_nyc)

## [1] 977500

Join pickup data with estimated zip codes to NYC borough and neighborhood.

NYC_Uber_pickup_Sept2014 <- dplyr::inner_join(pickup_info_nyc, NYC_zipcodes, by = "zip")

#remove columns that are not necessary
NYC_Uber_pickup_Sept2014$X1 <- NULL
NYC_Uber_pickup_Sept2014$location_index <- NULL

Preview data

head(NYC_Uber_pickup_Sept2014)

## # A tibble: 6 x 9
##   row_id pickup_datetime latitude longitude base  distance   zip Borough
##    <dbl> <chr>              <dbl>     <dbl> <chr>    <dbl> <dbl> <chr>  
## 1      2 9/1/2014 0:01:~     40.8     -74.0 B025~     528. 10001 Manhat~
## 2      3 9/1/2014 0:03:~     40.8     -74.0 B025~     513. 10036 Manhat~
## 3      4 9/1/2014 0:06:~     40.7     -74.0 B025~     881. 10010 Manhat~
## 4      5 9/1/2014 0:11:~     40.8     -73.9 B025~     412. 10030 Manhat~
## 5      6 9/1/2014 0:12:~     40.7     -74.0 B025~     847. 11215 Brookl~
## 6     10 9/1/2014 0:33:~     40.8     -74.0 B025~     571. 10020 Manhat~
## # ... with 1 more variable: Neighborhood <chr>

Save this data as CSV and RDS file.

This file data is going to be used by the Shiny application.

saveRDS(NYC_Uber_pickup_Sept2014, "NYC_Uber_pickup_Sept2014.rds")
write.csv(NYC_Uber_pickup_Sept2014, file='NYC_Uber_pickup_Sept2014.csv')

S. Tinapunan, 5/16/2019