Data is publically available on https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page.
Analysis is provided by Amy Liu Pathak and Joshua Curley. The dataset was trimmed down to trips with at least a pick up (PU) or drop off (DO) in zone 186, in which Penn Station is located.
This is a random sample of 20 rows for FHV’s and taxis respectively. The total number of observations for FHV’s is 2,351,254, for yellow taxis is 1,441,332, and for green taxis is 2,263. Yellow and green taxis are combined in the “taxi” dataset, which is used to analyze all taxis.
datatable(fhvhv_sample,
class = 'cell-border stripe',
extensions = 'Buttons',
fillContainer = FALSE,
options = list(pageLength = 10,
dom = 'Bfrtip',
buttons = c('csv', 'excel', 'pdf'),
scrollX = TRUE,
selection="multiple"
))
datatable(taxi_sample,
class = 'cell-border stripe',
extensions = 'Buttons',
fillContainer = FALSE,
options = list(pageLength = 10,
dom = 'Bfrtip',
buttons = c('csv', 'excel', 'pdf'),
scrollX = TRUE,
selection="multiple"
))
Are there months with unusual monthly ridership levels? January stands out as being lower than average. No month stands out as being consistently higher than average. To what extent does the weekday curve fluctuate? It appears that there is consistent rise and fall.
Monthly profiles show that axis peak during the week (M-F) while HVFHV peak during weekends (Sa-Su). Seems that they have different user groups (business vs. leisure?) Yet they still both peak at 6-7pm. There are 50% more Taxi pick ups at Penn than taxi drop offs (makes sense since there is a pick up queue available at Penn); and 30% more FHV drop offs at Penn than pickups (makes sense because hailing a cab on the street is not always easy). Combined the trips are nearly balanced (this excludes green taxis). The key exception is airport trips which are the opposite (more taxi drop off at Penn from Airports than taxi pick ups at Penn bound for airports).
In the line charts, double click the month to isolate it, and single click to remove it.
ggthemr("light")
fhvhv$facet = factor(fhvhv$request_month, levels = c("January", "February", "March", "April", "May", "June", "July", "August"))
ggplot(data=fhvhv, aes(x=request_date)) +
geom_bar(aes(y = (..count..))) +
facet_wrap( ~ facet) +
labs(x = "Month", y = "Total Number of Rides") +
theme_light() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)
) +
ggtitle("FHVHV")
taxi$facet = factor(taxi$request_month, levels = c("January", "February", "March", "April", "May", "June", "July", "August"))
ggplot(data=taxi, aes(x=request_date)) +
geom_bar(aes(y = (..count..)), fill="gold") +
facet_wrap( ~ facet) +
labs(x = "Month", y = "Total Number of Rides") +
theme_light() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)
) +
ggtitle("Taxi")
ggplotly(
fhvhv %>%
group_by(request_month, request_date) %>%
summarise(
count = n()
) %>%
ggplot(aes(request_date, count, colour = request_month))+
geom_line(aes(group=request_month),size=2, alpha=0.5) +
theme_light() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)
) +
ggtitle("FHVHV") +
xlab("Request Date") +
ylab("Total Number of Rides") +
scale_color_discrete(breaks=c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August'))
)
# then plot the line data
ggplotly(
taxi %>%
group_by(request_month, request_date) %>%
summarise(
count = n()
) %>%
ggplot(aes(request_date, count, colour = request_month))+
geom_line(aes(group=request_month),size=2, alpha=0.5) +
theme_light() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)
) +
ggtitle("Taxi") +
xlab("Request Date") +
ylab("Total Number of Rides") +
scale_color_discrete(breaks=c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August'))
)
Is there a pattern between ride type’s ridership levels? Somewhat. March and May (3 and 5) are inversely related between the two ride types. Other months show alignment- January, February, and others are similarly ranked between the two ride types. There could be a number of causes (including events, policy changes, promotions) that would need to be further researched before drawing conclusions.
fhvhv %>%
count(request_month, sort=TRUE)
## request_month n
## 1 May 320614
## 2 August 318072
## 3 July 317643
## 4 June 314290
## 5 March 305728
## 6 April 303516
## 7 February 261493
## 8 January 209898
taxi %>%
count(request_month, sort=TRUE)
## request_month n
## 1 March 197414
## 2 August 195827
## 3 June 193350
## 4 July 190588
## 5 April 190217
## 6 May 187409
## 7 February 160801
## 8 January 127989
Taxi and FHV’s are combined.
q = c(.05, .15, .5, .85, .95)
all %>%
count(request_date, sort=TRUE)
## request_date n
## 1 21 133529
## 2 20 133160
## 3 22 133059
## 4 27 132847
## 5 12 132451
## 6 23 130897
## 7 24 130825
## 8 19 129426
## 9 28 128518
## 10 26 127989
## 11 17 126182
## 12 16 126046
## 13 25 125563
## 14 15 124871
## 15 18 124510
## 16 14 123763
## 17 11 123578
## 18 13 123467
## 19 10 122814
## 20 5 122191
## 21 9 121770
## 22 8 120334
## 23 6 120306
## 24 7 120237
## 25 4 118324
## 26 3 117597
## 27 1 115393
## 28 2 113495
## 29 30 112880
## 30 29 102909
## 31 31 75918
all %>%
count(request_date, sort=TRUE) %>%
summarize(Quant5 = quantile(n, probs = q[1]),
Quant15 = quantile(n, probs = q[2]),
Quant50 = quantile(n, probs = q[3]),
Quant85 = quantile(n, probs = q[4]),
Quant95 = quantile(n, probs = q[5]))
## Quant5 Quant15 Quant50 Quant85 Quant95
## 1 107894.5 116495 123763 131674 133109.5
all %>%
filter(PULocationID == 186) %>%
count(request_date, sort=TRUE)
## request_date n
## 1 21 67846
## 2 22 66393
## 3 12 66268
## 4 24 66212
## 5 23 65518
## 6 20 65213
## 7 27 64492
## 8 17 63621
## 9 19 63546
## 10 28 63377
## 11 25 63029
## 12 11 62572
## 13 26 62401
## 14 5 62113
## 15 18 62063
## 16 16 61627
## 17 15 61605
## 18 14 61546
## 19 10 61484
## 20 4 61176
## 21 9 60920
## 22 13 60723
## 23 8 60669
## 24 7 60250
## 25 6 60048
## 26 3 58007
## 27 1 57350
## 28 2 56310
## 29 30 56203
## 30 29 51743
## 31 31 39641
all %>%
filter(PULocationID == 186) %>%
count(request_date, sort=TRUE) %>%
summarize(Quant5 = quantile(n, probs = q[1]),
Quant15 = quantile(n, probs = q[2]),
Quant50 = quantile(n, probs = q[3]),
Quant85 = quantile(n, probs = q[4]),
Quant95 = quantile(n, probs = q[5]))
## Quant5 Quant15 Quant50 Quant85 Quant95
## 1 53973 57678.5 61627 65365.5 66330.5
all %>%
filter(DOLocationID == 186) %>%
count(request_date, sort=TRUE)
## request_date n
## 1 27 68985
## 2 20 68535
## 3 22 67263
## 4 12 66779
## 5 19 66465
## 6 21 66272
## 7 26 66210
## 8 23 66037
## 9 28 65724
## 10 24 65174
## 11 16 64947
## 12 15 63770
## 13 13 63266
## 14 25 63159
## 15 17 63118
## 16 18 63051
## 17 14 62751
## 18 10 61867
## 19 11 61583
## 20 9 61424
## 21 6 60785
## 22 5 60588
## 23 7 60555
## 24 8 60219
## 25 3 60143
## 26 1 58603
## 27 2 57668
## 28 4 57659
## 29 30 57179
## 30 29 51657
## 31 31 36638
all %>%
filter(DOLocationID == 186) %>%
count(request_date, sort=TRUE) %>%
summarize(Quant5 = quantile(n, probs = q[1]),
Quant15 = quantile(n, probs = q[2]),
Quant50 = quantile(n, probs = q[3]),
Quant85 = quantile(n, probs = q[4]),
Quant95 = quantile(n, probs = q[5]))
## Quant5 Quant15 Quant50 Quant85 Quant95
## 1 54418 58135.5 63051 66368.5 67899
FHVs and taxis see similarly high ridership in August. What is the peak-iest date in August? It’s arguably sometime between August 21-25 inclusive. The first list is for FHVs, the second is for taxis.
fhvhv %>%
count(request_date, sort=TRUE)
## request_date n
## 1 20 84850
## 2 12 84073
## 3 21 83751
## 4 22 83556
## 5 27 83261
## 6 23 81936
## 7 19 81200
## 8 24 81135
## 9 26 81060
## 10 28 80151
## 11 17 78900
## 12 16 78010
## 13 13 77129
## 14 14 76664
## 15 15 76496
## 16 10 76429
## 17 25 76340
## 18 18 76139
## 19 5 75358
## 20 9 75150
## 21 11 74997
## 22 6 74664
## 23 7 73801
## 24 8 72421
## 25 4 71869
## 26 1 71626
## 27 3 71256
## 28 30 70319
## 29 2 70047
## 30 29 63508
## 31 31 45158
taxi %>%
count(request_date, sort=TRUE)
## request_date n
## 1 21 49778
## 2 24 49690
## 3 27 49586
## 4 22 49503
## 5 25 49223
## 6 23 48961
## 7 11 48581
## 8 12 48378
## 9 15 48375
## 10 18 48371
## 11 28 48367
## 12 20 48310
## 13 19 48226
## 14 16 48036
## 15 8 47913
## 16 17 47282
## 17 14 47099
## 18 26 46929
## 19 5 46833
## 20 9 46620
## 21 4 46455
## 22 7 46436
## 23 10 46385
## 24 3 46341
## 25 13 46338
## 26 6 45642
## 27 1 43767
## 28 2 43448
## 29 30 42561
## 30 29 39401
## 31 31 30760
Taxis and FHV’s are combined.
all %>%
count(request_hour, sort=TRUE)
## request_hour n
## 1 18 257862
## 2 17 244817
## 3 19 240153
## 4 16 222039
## 5 15 211956
## 6 14 201887
## 7 12 194222
## 8 20 192877
## 9 11 190532
## 10 10 190281
## 11 13 188126
## 12 9 186937
## 13 21 183670
## 14 22 180895
## 15 8 164889
## 16 23 151574
## 17 7 136530
## 18 6 110441
## 19 0 105691
## 20 1 68303
## 21 5 53514
## 22 2 49210
## 23 3 35501
## 24 4 32942
all %>%
count(request_hour, sort=TRUE) %>%
summarize(Quant5 = quantile(n, probs = q[1]),
Quant15 = quantile(n, probs = q[2]),
Quant50 = quantile(n, probs = q[3]),
Quant85 = quantile(n, probs = q[4]),
Quant95 = quantile(n, probs = q[5]))
## Quant5 Quant15 Quant50 Quant85 Quant95
## 1 37557.35 60169.05 185303.5 217501.7 244117.4
all %>%
filter(PULocationID == 186) %>%
count(request_hour, sort=TRUE)
## request_hour n
## 1 18 123924
## 2 19 120364
## 3 17 112865
## 4 20 103703
## 5 21 99945
## 6 22 99314
## 7 12 96117
## 8 15 95604
## 9 11 94136
## 10 9 91592
## 11 14 90862
## 12 16 90763
## 13 10 90350
## 14 13 89362
## 15 23 88379
## 16 8 84761
## 17 7 76011
## 18 0 59921
## 19 6 57939
## 20 1 41214
## 21 2 29319
## 22 5 21366
## 23 3 20608
## 24 4 15547
all %>%
filter(PULocationID == 186) %>%
count(request_hour, sort=TRUE) %>%
summarize(Quant5 = quantile(n, probs = q[1]),
Quant15 = quantile(n, probs = q[2]),
Quant50 = quantile(n, probs = q[3]),
Quant85 = quantile(n, probs = q[4]),
Quant95 = quantile(n, probs = q[5]))
## Quant5 Quant15 Quant50 Quant85 Quant95
## 1 20721.7 34671.75 90556.5 102011.9 119239.1
all %>%
filter(DOLocationID == 186) %>%
count(request_hour, sort=TRUE)
## request_hour n
## 1 18 134999
## 2 17 132989
## 3 16 132236
## 4 19 120771
## 5 15 117365
## 6 14 112023
## 7 10 100816
## 8 13 99676
## 9 12 99067
## 10 11 97359
## 11 9 96175
## 12 20 89986
## 13 21 84398
## 14 22 82282
## 15 8 80781
## 16 23 63861
## 17 7 61054
## 18 6 52977
## 19 0 46288
## 20 5 32414
## 21 1 27531
## 22 2 20263
## 23 4 17594
## 24 3 15169
all %>%
filter(DOLocationID == 186) %>%
count(request_hour, sort=TRUE) %>%
summarize(Quant5 = quantile(n, probs = q[1]),
Quant15 = quantile(n, probs = q[2]),
Quant50 = quantile(n, probs = q[3]),
Quant85 = quantile(n, probs = q[4]),
Quant95 = quantile(n, probs = q[5]))
## Quant5 Quant15 Quant50 Quant85 Quant95
## 1 17994.35 29728.35 87192 119238.3 132876
What is the peak hour during a busy day (August 21)? 19:00, or 7:00 PM.
fhvhv %>%
filter(request_month == "August", request_date == 21) %>%
{. ->> fhvhv_hcount} # this saves the count data
ggplotly(
ggplot(fhvhv_hcount, aes(x=request_hour)) +
geom_bar( fill="lightblue", color="white") +
theme_light() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)
) +
ggtitle("FHVHV (August 21, 2022)") +
xlab("Hour") +
ylab("Total Number of Rides")
)
taxi %>%
filter(request_month == "August", request_date == 21) %>%
{. ->> taxi_hcount} # this saves the count data
ggplotly(
ggplot(taxi_hcount, aes(x=request_hour)) +
geom_bar( fill="lightblue", color="white") +
theme_light() +
theme(
panel.grid.major.x = element_blank(),
panel.border = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = 0.5)
) +
ggtitle("Taxi (August 21, 2022)") +
xlab("Hour") +
ylab("Total Number of Rides")
)
Where are people going? Where are people coming from? The tables below (which can be sorted for max to min) show top pickup and drop off zones with respect to trips ending or beginning in zone 186. The following are top locations.
Zone 79 (East Village), 230 (Times Sq), 161 (Midtown Center), 162 (Midtown East), 170 (Murray Hill), 48 (Clinton East)
Zone 265 is all points outside of NYC (excluding Newark Airport). This likely explains the imbalance between Origin and Destination for the NYC Taxis since there can’t be legitimate out-of-NYC TLC taxi pick ups. Conversely, FHVs that do not have TLC license cannot pick up in NYC (i.e .if someone takes a non-TLC Uber from NJ to NYC the Uber must leave the city before making next pick up).
There are many short distance trips in this dataset. Particularly taxi trips from zone 186 to zone 186 (trip around the block?). Perhaps we need better wayfinding so people know it’s a 5 minute walk to their destination nearby and we could eliminate 20K trips between taxis and FHVs that never even leave the zone. With many more also going to the adjacent zones (i.e. Times Square).
fhvhv_origins <- fhvhv %>%
filter(DOLocationID == 186) %>%
count(PULocationID) %>%
rename(
"Pickup.Zone" = "PULocationID",
"Total.Number.of.Rides" = "n"
)
datatable(fhvhv_origins,
class = 'cell-border stripe',
extensions = 'Buttons',
fillContainer = FALSE,
options = list(pageLength = 10,
dom = 'Bfrtip',
buttons = c('csv', 'excel', 'pdf'),
scrollX = TRUE,
selection="multiple"
))
fhvhv_destinations <- fhvhv %>%
filter(PULocationID == 186) %>%
count(DOLocationID) %>%
rename(
"Dropoff.Zone" = "DOLocationID",
"Total.Number.of.Rides" = "n"
)
datatable(fhvhv_destinations,
class = 'cell-border stripe',
extensions = 'Buttons',
fillContainer = FALSE,
options = list(pageLength = 10,
dom = 'Bfrtip',
buttons = c('csv', 'excel', 'pdf'),
scrollX = TRUE,
selection="multiple"
))
taxi_origins <- taxi %>%
filter(DOLocationID == 186) %>%
count(PULocationID) %>%
rename(
"Pickup.Zone" = "PULocationID",
"Total.Number.of.Rides" = "n"
)
datatable(taxi_origins,
class = 'cell-border stripe',
extensions = 'Buttons',
fillContainer = FALSE,
options = list(pageLength = 10,
dom = 'Bfrtip',
buttons = c('csv', 'excel', 'pdf'),
scrollX = TRUE,
selection="multiple"
))
taxi_destinations <- taxi %>%
filter(PULocationID == 186) %>%
count(DOLocationID) %>%
rename(
"Dropoff.Zone" = "DOLocationID",
"Total.Number.of.Rides" = "n"
)
datatable(taxi_destinations,
class = 'cell-border stripe',
extensions = 'Buttons',
fillContainer = FALSE,
options = list(pageLength = 10,
dom = 'Bfrtip',
buttons = c('csv', 'excel', 'pdf'),
scrollX = TRUE,
selection="multiple"
))