Introduction

Data is publically available on https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

Analysis is provided by Amy Liu Pathak and Joshua Curley. The dataset was trimmed down to trips with at least a pick up (PU) or drop off (DO) in zone 186, in which Penn Station is located.

Preview data

This is a random sample of 20 rows for FHV’s and taxis respectively. The total number of observations for FHV’s is 2,351,254, for yellow taxis is 1,441,332, and for green taxis is 2,263. Yellow and green taxis are combined in the “taxi” dataset, which is used to analyze all taxis.

datatable(fhvhv_sample,
          class = 'cell-border stripe', 
          extensions = 'Buttons',
          fillContainer = FALSE, 
          options = list(pageLength = 10,
                         dom = 'Bfrtip',
                         buttons = c('csv', 'excel', 'pdf'), 
                         scrollX = TRUE, 
            selection="multiple"
          ))

datatable(taxi_sample,
          class = 'cell-border stripe', 
          extensions = 'Buttons',
          fillContainer = FALSE, 
          options = list(pageLength = 10,
                         dom = 'Bfrtip',
                         buttons = c('csv', 'excel', 'pdf'), 
                         scrollX = TRUE, 
            selection="multiple"
          ))

Trends by month

Month and day sums

Are there months with unusual monthly ridership levels? January stands out as being lower than average. No month stands out as being consistently higher than average. To what extent does the weekday curve fluctuate? It appears that there is consistent rise and fall.

Monthly profiles show that axis peak during the week (M-F) while HVFHV peak during weekends (Sa-Su). Seems that they have different user groups (business vs. leisure?) Yet they still both peak at 6-7pm. There are 50% more Taxi pick ups at Penn than taxi drop offs (makes sense since there is a pick up queue available at Penn); and 30% more FHV drop offs at Penn than pickups (makes sense because hailing a cab on the street is not always easy). Combined the trips are nearly balanced (this excludes green taxis). The key exception is airport trips which are the opposite (more taxi drop off at Penn from Airports than taxi pick ups at Penn bound for airports).

In the line charts, double click the month to isolate it, and single click to remove it.

ggthemr("light")

fhvhv$facet = factor(fhvhv$request_month, levels = c("January", "February", "March", "April", "May", "June", "July", "August"))

ggplot(data=fhvhv, aes(x=request_date)) + 
  geom_bar(aes(y = (..count..))) +
  facet_wrap( ~ facet) +
  labs(x = "Month", y = "Total Number of Rides") +
  theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5)
  ) +
  ggtitle("FHVHV")

taxi$facet = factor(taxi$request_month, levels = c("January", "February", "March", "April", "May", "June", "July", "August"))

ggplot(data=taxi, aes(x=request_date)) + 
  geom_bar(aes(y = (..count..)), fill="gold") +
  facet_wrap( ~ facet) +
  labs(x = "Month", y = "Total Number of Rides") +
  theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5)
  ) +
  ggtitle("Taxi")

ggplotly(
  fhvhv %>% 
  group_by(request_month, request_date) %>% 
  summarise(
    count = n()
  ) %>% 
  ggplot(aes(request_date, count, colour = request_month))+
  geom_line(aes(group=request_month),size=2, alpha=0.5) +
  theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5)
  ) +
  ggtitle("FHVHV") +
  xlab("Request Date") +
  ylab("Total Number of Rides") +
  scale_color_discrete(breaks=c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August'))
)

# then plot the line data
ggplotly(
  taxi %>% 
  group_by(request_month, request_date) %>% 
  summarise(
    count = n()
  ) %>% 
  ggplot(aes(request_date, count, colour = request_month))+
  geom_line(aes(group=request_month),size=2, alpha=0.5) +
  theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5)
  ) +
  ggtitle("Taxi") +
  xlab("Request Date") +
  ylab("Total Number of Rides") +
  scale_color_discrete(breaks=c('January', 'February', 'March', 'April', 'May', 'June', 'July', 'August'))
)

Month sums

Is there a pattern between ride type’s ridership levels? Somewhat. March and May (3 and 5) are inversely related between the two ride types. Other months show alignment- January, February, and others are similarly ranked between the two ride types. There could be a number of causes (including events, policy changes, promotions) that would need to be further researched before drawing conclusions.

fhvhv %>% 
  count(request_month, sort=TRUE)

##   request_month      n
## 1           May 320614
## 2        August 318072
## 3          July 317643
## 4          June 314290
## 5         March 305728
## 6         April 303516
## 7      February 261493
## 8       January 209898

taxi %>% 
  count(request_month, sort=TRUE)

##   request_month      n
## 1         March 197414
## 2        August 195827
## 3          June 193350
## 4          July 190588
## 5         April 190217
## 6           May 187409
## 7      February 160801
## 8       January 127989

Trends by day

Percentiles of ridership by day - PU’s and DO’s combined

Taxi and FHV’s are combined.

q = c(.05, .15, .5, .85, .95)

all %>% 
  count(request_date, sort=TRUE)

##    request_date      n
## 1            21 133529
## 2            20 133160
## 3            22 133059
## 4            27 132847
## 5            12 132451
## 6            23 130897
## 7            24 130825
## 8            19 129426
## 9            28 128518
## 10           26 127989
## 11           17 126182
## 12           16 126046
## 13           25 125563
## 14           15 124871
## 15           18 124510
## 16           14 123763
## 17           11 123578
## 18           13 123467
## 19           10 122814
## 20            5 122191
## 21            9 121770
## 22            8 120334
## 23            6 120306
## 24            7 120237
## 25            4 118324
## 26            3 117597
## 27            1 115393
## 28            2 113495
## 29           30 112880
## 30           29 102909
## 31           31  75918

all %>% 
  count(request_date, sort=TRUE) %>%
  summarize(Quant5 = quantile(n, probs = q[1]), 
            Quant15 = quantile(n, probs = q[2]),
            Quant50 = quantile(n, probs = q[3]),
            Quant85 = quantile(n, probs = q[4]),
            Quant95 = quantile(n, probs = q[5]))

##     Quant5 Quant15 Quant50 Quant85  Quant95
## 1 107894.5  116495  123763  131674 133109.5

Percentiles of ridership by day - PU’s

all %>% 
  filter(PULocationID == 186) %>%
  count(request_date, sort=TRUE)

##    request_date     n
## 1            21 67846
## 2            22 66393
## 3            12 66268
## 4            24 66212
## 5            23 65518
## 6            20 65213
## 7            27 64492
## 8            17 63621
## 9            19 63546
## 10           28 63377
## 11           25 63029
## 12           11 62572
## 13           26 62401
## 14            5 62113
## 15           18 62063
## 16           16 61627
## 17           15 61605
## 18           14 61546
## 19           10 61484
## 20            4 61176
## 21            9 60920
## 22           13 60723
## 23            8 60669
## 24            7 60250
## 25            6 60048
## 26            3 58007
## 27            1 57350
## 28            2 56310
## 29           30 56203
## 30           29 51743
## 31           31 39641

all %>% 
  filter(PULocationID == 186) %>%
  count(request_date, sort=TRUE) %>%
  summarize(Quant5 = quantile(n, probs = q[1]), 
            Quant15 = quantile(n, probs = q[2]),
            Quant50 = quantile(n, probs = q[3]),
            Quant85 = quantile(n, probs = q[4]),
            Quant95 = quantile(n, probs = q[5]))

##   Quant5 Quant15 Quant50 Quant85 Quant95
## 1  53973 57678.5   61627 65365.5 66330.5

Percentiles of ridership by day - DO’s

all %>% 
  filter(DOLocationID == 186) %>%
  count(request_date, sort=TRUE)

##    request_date     n
## 1            27 68985
## 2            20 68535
## 3            22 67263
## 4            12 66779
## 5            19 66465
## 6            21 66272
## 7            26 66210
## 8            23 66037
## 9            28 65724
## 10           24 65174
## 11           16 64947
## 12           15 63770
## 13           13 63266
## 14           25 63159
## 15           17 63118
## 16           18 63051
## 17           14 62751
## 18           10 61867
## 19           11 61583
## 20            9 61424
## 21            6 60785
## 22            5 60588
## 23            7 60555
## 24            8 60219
## 25            3 60143
## 26            1 58603
## 27            2 57668
## 28            4 57659
## 29           30 57179
## 30           29 51657
## 31           31 36638

all %>% 
  filter(DOLocationID == 186) %>%
  count(request_date, sort=TRUE) %>%
  summarize(Quant5 = quantile(n, probs = q[1]), 
            Quant15 = quantile(n, probs = q[2]),
            Quant50 = quantile(n, probs = q[3]),
            Quant85 = quantile(n, probs = q[4]),
            Quant95 = quantile(n, probs = q[5]))

##   Quant5 Quant15 Quant50 Quant85 Quant95
## 1  54418 58135.5   63051 66368.5   67899

A busy day

FHVs and taxis see similarly high ridership in August. What is the peak-iest date in August? It’s arguably sometime between August 21-25 inclusive. The first list is for FHVs, the second is for taxis.

fhvhv %>% 
  count(request_date, sort=TRUE)

##    request_date     n
## 1            20 84850
## 2            12 84073
## 3            21 83751
## 4            22 83556
## 5            27 83261
## 6            23 81936
## 7            19 81200
## 8            24 81135
## 9            26 81060
## 10           28 80151
## 11           17 78900
## 12           16 78010
## 13           13 77129
## 14           14 76664
## 15           15 76496
## 16           10 76429
## 17           25 76340
## 18           18 76139
## 19            5 75358
## 20            9 75150
## 21           11 74997
## 22            6 74664
## 23            7 73801
## 24            8 72421
## 25            4 71869
## 26            1 71626
## 27            3 71256
## 28           30 70319
## 29            2 70047
## 30           29 63508
## 31           31 45158

taxi %>% 
  count(request_date, sort=TRUE)

##    request_date     n
## 1            21 49778
## 2            24 49690
## 3            27 49586
## 4            22 49503
## 5            25 49223
## 6            23 48961
## 7            11 48581
## 8            12 48378
## 9            15 48375
## 10           18 48371
## 11           28 48367
## 12           20 48310
## 13           19 48226
## 14           16 48036
## 15            8 47913
## 16           17 47282
## 17           14 47099
## 18           26 46929
## 19            5 46833
## 20            9 46620
## 21            4 46455
## 22            7 46436
## 23           10 46385
## 24            3 46341
## 25           13 46338
## 26            6 45642
## 27            1 43767
## 28            2 43448
## 29           30 42561
## 30           29 39401
## 31           31 30760

Trends by hour

Percentiles of ridership by hour - PU’s and DO’s combined

Taxis and FHV’s are combined.

all %>% 
  count(request_hour, sort=TRUE)

##    request_hour      n
## 1            18 257862
## 2            17 244817
## 3            19 240153
## 4            16 222039
## 5            15 211956
## 6            14 201887
## 7            12 194222
## 8            20 192877
## 9            11 190532
## 10           10 190281
## 11           13 188126
## 12            9 186937
## 13           21 183670
## 14           22 180895
## 15            8 164889
## 16           23 151574
## 17            7 136530
## 18            6 110441
## 19            0 105691
## 20            1  68303
## 21            5  53514
## 22            2  49210
## 23            3  35501
## 24            4  32942

all %>% 
  count(request_hour, sort=TRUE) %>%
  summarize(Quant5 = quantile(n, probs = q[1]), 
            Quant15 = quantile(n, probs = q[2]),
            Quant50 = quantile(n, probs = q[3]),
            Quant85 = quantile(n, probs = q[4]),
            Quant95 = quantile(n, probs = q[5]))

##     Quant5  Quant15  Quant50  Quant85  Quant95
## 1 37557.35 60169.05 185303.5 217501.7 244117.4

Percentiles of ridership by hour - PU’s

all %>% 
  filter(PULocationID == 186) %>%
  count(request_hour, sort=TRUE)

##    request_hour      n
## 1            18 123924
## 2            19 120364
## 3            17 112865
## 4            20 103703
## 5            21  99945
## 6            22  99314
## 7            12  96117
## 8            15  95604
## 9            11  94136
## 10            9  91592
## 11           14  90862
## 12           16  90763
## 13           10  90350
## 14           13  89362
## 15           23  88379
## 16            8  84761
## 17            7  76011
## 18            0  59921
## 19            6  57939
## 20            1  41214
## 21            2  29319
## 22            5  21366
## 23            3  20608
## 24            4  15547

all %>% 
  filter(PULocationID == 186) %>%
  count(request_hour, sort=TRUE) %>%
  summarize(Quant5 = quantile(n, probs = q[1]), 
            Quant15 = quantile(n, probs = q[2]),
            Quant50 = quantile(n, probs = q[3]),
            Quant85 = quantile(n, probs = q[4]),
            Quant95 = quantile(n, probs = q[5]))

##    Quant5  Quant15 Quant50  Quant85  Quant95
## 1 20721.7 34671.75 90556.5 102011.9 119239.1

Percentiles of ridership by hour - DO’s

all %>% 
  filter(DOLocationID == 186) %>%
  count(request_hour, sort=TRUE)

##    request_hour      n
## 1            18 134999
## 2            17 132989
## 3            16 132236
## 4            19 120771
## 5            15 117365
## 6            14 112023
## 7            10 100816
## 8            13  99676
## 9            12  99067
## 10           11  97359
## 11            9  96175
## 12           20  89986
## 13           21  84398
## 14           22  82282
## 15            8  80781
## 16           23  63861
## 17            7  61054
## 18            6  52977
## 19            0  46288
## 20            5  32414
## 21            1  27531
## 22            2  20263
## 23            4  17594
## 24            3  15169

all %>% 
  filter(DOLocationID == 186) %>%
  count(request_hour, sort=TRUE) %>%
  summarize(Quant5 = quantile(n, probs = q[1]), 
            Quant15 = quantile(n, probs = q[2]),
            Quant50 = quantile(n, probs = q[3]),
            Quant85 = quantile(n, probs = q[4]),
            Quant95 = quantile(n, probs = q[5]))

##     Quant5  Quant15 Quant50  Quant85 Quant95
## 1 17994.35 29728.35   87192 119238.3  132876

A busy day and busy hour

What is the peak hour during a busy day (August 21)? 19:00, or 7:00 PM.

fhvhv %>% 
  filter(request_month == "August", request_date == 21) %>%
  {. ->> fhvhv_hcount} # this saves the count data


ggplotly(
  ggplot(fhvhv_hcount, aes(x=request_hour)) + 
   geom_bar( fill="lightblue", color="white") +
   theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5)
  ) +
  ggtitle("FHVHV (August 21, 2022)") +
  xlab("Hour") +
  ylab("Total Number of Rides")
  )

taxi %>% 
  filter(request_month == "August", request_date == 21) %>%
  {. ->> taxi_hcount} # this saves the count data


ggplotly(
  ggplot(taxi_hcount, aes(x=request_hour)) + 
  geom_bar( fill="lightblue", color="white") +
  theme_light() +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = 0.5)
  ) +
  ggtitle("Taxi (August 21, 2022)") +
  xlab("Hour") +
  ylab("Total Number of Rides")
  )

Top origins and destinations

Where are people going? Where are people coming from? The tables below (which can be sorted for max to min) show top pickup and drop off zones with respect to trips ending or beginning in zone 186. The following are top locations.

Zone 79 (East Village), 230 (Times Sq), 161 (Midtown Center), 162 (Midtown East), 170 (Murray Hill), 48 (Clinton East)

Zone 265 is all points outside of NYC (excluding Newark Airport). This likely explains the imbalance between Origin and Destination for the NYC Taxis since there can’t be legitimate out-of-NYC TLC taxi pick ups. Conversely, FHVs that do not have TLC license cannot pick up in NYC (i.e .if someone takes a non-TLC Uber from NJ to NYC the Uber must leave the city before making next pick up).

There are many short distance trips in this dataset. Particularly taxi trips from zone 186 to zone 186 (trip around the block?). Perhaps we need better wayfinding so people know it’s a 5 minute walk to their destination nearby and we could eliminate 20K trips between taxis and FHVs that never even leave the zone. With many more also going to the adjacent zones (i.e. Times Square).

FHVHV

fhvhv_origins <- fhvhv %>% 
  filter(DOLocationID == 186) %>% 
  count(PULocationID) %>% 
  rename(
    "Pickup.Zone" = "PULocationID",
    "Total.Number.of.Rides" = "n"
    )


datatable(fhvhv_origins,
          class = 'cell-border stripe', 
          extensions = 'Buttons',
          fillContainer = FALSE, 
          options = list(pageLength = 10,
                         dom = 'Bfrtip',
                         buttons = c('csv', 'excel', 'pdf'), 
                         scrollX = TRUE, 
            selection="multiple"
          ))

fhvhv_destinations <- fhvhv %>% 
  filter(PULocationID == 186) %>% 
  count(DOLocationID) %>% 
  rename(
    "Dropoff.Zone" = "DOLocationID",
    "Total.Number.of.Rides" = "n"
    )


datatable(fhvhv_destinations,
          class = 'cell-border stripe', 
          extensions = 'Buttons',
          fillContainer = FALSE, 
          options = list(pageLength = 10,
                         dom = 'Bfrtip',
                         buttons = c('csv', 'excel', 'pdf'), 
                         scrollX = TRUE, 
            selection="multiple"
          ))

Taxi

taxi_origins <- taxi %>% 
  filter(DOLocationID == 186) %>% 
  count(PULocationID) %>% 
  rename(
    "Pickup.Zone" = "PULocationID",
    "Total.Number.of.Rides" = "n"
    )

datatable(taxi_origins,
          class = 'cell-border stripe', 
          extensions = 'Buttons',
          fillContainer = FALSE, 
          options = list(pageLength = 10,
                         dom = 'Bfrtip',
                         buttons = c('csv', 'excel', 'pdf'), 
                         scrollX = TRUE, 
            selection="multiple"
          ))

taxi_destinations <- taxi %>% 
  filter(PULocationID == 186) %>% 
  count(DOLocationID) %>% 
  rename(
    "Dropoff.Zone" = "DOLocationID",
    "Total.Number.of.Rides" = "n"
    )

datatable(taxi_destinations,
          class = 'cell-border stripe', 
          extensions = 'Buttons',
          fillContainer = FALSE, 
          options = list(pageLength = 10,
                         dom = 'Bfrtip',
                         buttons = c('csv', 'excel', 'pdf'), 
                         scrollX = TRUE, 
            selection="multiple"
          ))

2022 NYC Taxis and For Hire Vehicles, Zone 186