The dataset is obtained from the 2016 NYC Yellow Cab trip record data made available in Big Query on Google Cloud Platform. The data was originally published by the NYC Taxi and Limousine Commission (TLC). This report will be demostrating on what are the popular pickup spots of two different time frame (Morning, Evening) during a day.
The training dataset is downloaded and unzip from Kaggle website, we can see that there are totally 1.45 million rows and 11 columns.
# The URL to download the train.zip is from the kaggle website:
# website:https://www.kaggle.com/c/nyc-taxi-trip-duration/data/
unzip("train.zip")
trainDS<-read.csv2("train.csv", header=TRUE, sep=",")
dim(trainDS)
## [1] 1458644 11
The columns available are shown below. We will be using the following columns (pickup_datetime, pickup_longitute, pickup_latitude) to demostrate the popular pickup location for the taxi ride during morning time (7 am to 9 am), and evening time (5 pm to 7 pm).
names(trainDS)
## [1] "id" "vendor_id" "pickup_datetime"
## [4] "dropoff_datetime" "passenger_count" "pickup_longitude"
## [7] "pickup_latitude" "dropoff_longitude" "dropoff_latitude"
## [10] "store_and_fwd_flag" "trip_duration"
We first preprocess the data to categorize the pickup time frame of taxi ride. A new column named (“pickup_timeframe”) is added to dataset, we also change the pickup_longtitude and pickup_latitude to numeric for better processing later.
finalDS <- trainDS %>%
select(id, pickup_datetime, pickup_longitude, pickup_latitude) %>%
mutate(pt = as.POSIXct(strptime(pickup_datetime, format="%Y-%m-%d %H:%M:%S"))) %>%
mutate(pickup_timeframe=case_when(
hour(pt) >= 7 & hour(pt) <=9 ~ "morning",
hour(pt) >= 17 & hour(pt) <= 19 ~ "evening",
TRUE ~ "other")) %>%
mutate(pickup_timeframe=as.factor(pickup_timeframe)) %>%
mutate(pickup_longitude=as.numeric(as.character(pickup_longitude)))%>%
mutate(pickup_latitude=as.numeric(as.character(pickup_latitude)))
## Warning: package 'bindrcpp' was built under R version 3.4.2
To draw a map of popular pickup location for morning, we subset the data for morning ride to a new data frame named morningDS. We can see there are around 190K of morning taxi rides.
morningDS <- finalDS %>%
subset(pickup_timeframe=="morning") %>%
select(pickup_longitude, pickup_latitude, id)
dim(morningDS)
## [1] 190316 3
To show the most popular pickup location for the taxi ride, we will use two different kind of approaches. The first method is to calculate the clustered center using kmeans for longitude and latitude, and we will display the marker on the map with popup represents the count of rides.
kms <- kmeans(cbind(morningDS$pickup_longitude, morningDS$pickup_latitude), centers=10)
morningCluster <- cbind(morningDS, kms$cluster)
names(morningCluster) <- c("lng", "lat", "id", "cluster_id")
c <- morningCluster %>%
group_by(cluster_id) %>%
summarise_at(vars(id), n_distinct) %>%
mutate(lng=kms$centers[,1], lat=kms$centers[,2])
kmeansMap <- leaflet(c) %>%
addTiles() %>%
addMarkers(lng=~lng, lat=~lat,
popup=as.character(c$id),
clusterOptions = markerClusterOptions())
kmeansMap
The second map will display the raw data of the ride. This method is slower since we have many data points (190K). But upon zooming the raw data, we can clearly see the most popular pickup location is in lower and middle Manhattan.
rawMap <- leaflet(morningCluster) %>%
addTiles() %>%
setView(lng=-73.978270, lat=40.752949, zoom=12) %>%
addMarkers(lng=~lng, lat=~lat,
clusterOptions = markerClusterOptions())
rawMap
The evening taxi ride data has 257K rows. We subset the data and prodices the eveningDS data frame.
eveningDS <- finalDS %>%
subset(pickup_timeframe=="evening") %>%
select(pickup_longitude, pickup_latitude, id)
dim(eveningDS)
## [1] 257391 3
Using kMeans method, we can add a circle marker to the map to show the most popular location.
kmsEvening <- kmeans(cbind(eveningDS$pickup_longitude, eveningDS$pickup_latitude), centers=10)
eveningCluster <- cbind(eveningDS, kmsEvening$cluster)
names(eveningCluster) <- c("lng", "lat", "id", "cluster_id")
c2 <- eveningCluster %>%
group_by(cluster_id) %>%
summarise_at(vars(id), n_distinct) %>%
mutate(lng=kmsEvening$centers[,1], lat=kmsEvening$centers[,2])
kmeansMapEvening <- leaflet(c2) %>%
addTiles() %>%
addCircleMarkers(lng=~lng, lat=~lat,
popup=as.character(c2$id),
radius=(c2$id/1000),
clusterOptions = markerClusterOptions())
kmeansMapEvening
Using the rawData to show the markers, we can see the popular pickup location for the evening. It’s again near the lower and middle mahatten.
rawMapEvening <- leaflet(eveningCluster) %>%
addTiles() %>%
setView(lng=-73.978270, lat=40.752949, zoom=12) %>%
addMarkers(lng=~lng, lat=~lat,
clusterOptions = markerClusterOptions())
rawMapEvening