Executative Summary : New York Taxi Ride Map

The dataset is obtained from the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). This report will be demostrating on what are the popular pickup spots of two different time frame (Morning, Evening) during a day.

Exploratory Data Analysis

The training dataset is downloaded and unzip from Kaggle website, we can see that there are totally 1.45 million rows and 11 columns.

# The URL to download the train.zip is from the kaggle website: 
# website:https://www.kaggle.com/c/nyc-taxi-trip-duration/data/
unzip("train.zip")
trainDS<-read.csv2("train.csv", header=TRUE, sep=",")
dim(trainDS)

## [1] 1458644      11

The columns available are shown below. We will be using the following columns (pickup_datetime, pickup_longitute, pickup_latitude) to demostrate the popular pickup location for the taxi ride during morning time (7 am to 9 am), and evening time (5 pm to 7 pm).

names(trainDS)

##  [1] "id"                 "vendor_id"          "pickup_datetime"   
##  [4] "dropoff_datetime"   "passenger_count"    "pickup_longitude"  
##  [7] "pickup_latitude"    "dropoff_longitude"  "dropoff_latitude"  
## [10] "store_and_fwd_flag" "trip_duration"

Map for Popular Morning Pickup Location

We first preprocess the data to categorize the pickup time frame of taxi ride. A new column named (“pickup_timeframe”) is added to dataset, we also change the pickup_longtitude and pickup_latitude to numeric for better processing later.

finalDS <- trainDS %>%
    select(id, pickup_datetime, pickup_longitude, pickup_latitude) %>%
    mutate(pt = as.POSIXct(strptime(pickup_datetime, format="%Y-%m-%d %H:%M:%S"))) %>%
    mutate(pickup_timeframe=case_when(
                            hour(pt) >= 7 & hour(pt) <=9 ~ "morning",
                            hour(pt) >= 17 & hour(pt) <= 19 ~ "evening",
                            TRUE ~ "other")) %>%
    mutate(pickup_timeframe=as.factor(pickup_timeframe)) %>%
    mutate(pickup_longitude=as.numeric(as.character(pickup_longitude)))%>%
    mutate(pickup_latitude=as.numeric(as.character(pickup_latitude)))

## Warning: package 'bindrcpp' was built under R version 3.4.2

To draw a map of popular pickup location for morning, we subset the data for morning ride to a new data frame named morningDS. We can see there are around 190K of morning taxi rides.

morningDS <- finalDS %>%
    subset(pickup_timeframe=="morning") %>%
    select(pickup_longitude, pickup_latitude, id)
dim(morningDS)

## [1] 190316      3

To show the most popular pickup location for the taxi ride, we will use two different kind of approaches. The first method is to calculate the clustered center using kmeans for longitude and latitude, and we will display the marker on the map with popup represents the count of rides.

kms <- kmeans(cbind(morningDS$pickup_longitude, morningDS$pickup_latitude), centers=10)
morningCluster <- cbind(morningDS, kms$cluster)
names(morningCluster) <- c("lng", "lat", "id", "cluster_id")
c <- morningCluster %>%
    group_by(cluster_id) %>%
    summarise_at(vars(id), n_distinct) %>%
    mutate(lng=kms$centers[,1], lat=kms$centers[,2])

kmeansMap <- leaflet(c) %>% 
    addTiles() %>% 
    addMarkers(lng=~lng, lat=~lat,
               popup=as.character(c$id),
               clusterOptions = markerClusterOptions())
kmeansMap

The second map will display the raw data of the ride. This method is slower since we have many data points (190K). But upon zooming the raw data, we can clearly see the most popular pickup location is in lower and middle Manhattan.

rawMap <- leaflet(morningCluster) %>%
    addTiles() %>%
    setView(lng=-73.978270, lat=40.752949, zoom=12) %>%
    addMarkers(lng=~lng, lat=~lat,
               clusterOptions = markerClusterOptions())

rawMap

Map for Popular Evening Pickup Location

The evening taxi ride data has 257K rows. We subset the data and prodices the eveningDS data frame.

eveningDS <- finalDS %>%
    subset(pickup_timeframe=="evening") %>%
    select(pickup_longitude, pickup_latitude, id)
dim(eveningDS)

## [1] 257391      3

Using kMeans method, we can add a circle marker to the map to show the most popular location.

kmsEvening <- kmeans(cbind(eveningDS$pickup_longitude, eveningDS$pickup_latitude), centers=10)
eveningCluster <- cbind(eveningDS, kmsEvening$cluster)
names(eveningCluster) <- c("lng", "lat", "id", "cluster_id")
c2 <- eveningCluster %>%
    group_by(cluster_id) %>%
    summarise_at(vars(id), n_distinct) %>%
    mutate(lng=kmsEvening$centers[,1], lat=kmsEvening$centers[,2])

kmeansMapEvening <- leaflet(c2) %>% 
    addTiles() %>% 
    addCircleMarkers(lng=~lng, lat=~lat,
               popup=as.character(c2$id),
               radius=(c2$id/1000),
               clusterOptions = markerClusterOptions())
kmeansMapEvening

Using the rawData to show the markers, we can see the popular pickup location for the evening. It’s again near the lower and middle mahatten.

rawMapEvening <- leaflet(eveningCluster) %>%
    addTiles() %>%
    setView(lng=-73.978270, lat=40.752949, zoom=12) %>%
    addMarkers(lng=~lng, lat=~lat,
               clusterOptions = markerClusterOptions())

rawMapEvening

New York Taxi Popular Pickup Location Map

Jane

January 10, 2018

Executative Summary : New York Taxi Ride Map

Exploratory Data Analysis

Map for Popular Morning Pickup Location

Map for Popular Evening Pickup Location