Trip data

The trip data included here is provided by the New York Taxi & Limousine Commission. For more details see https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page.

Although other providers are available (including local, borough and for-hire services), Yellow Cabs have been chosen to represent overall taxi traffic for two reasons: their license covers all of the city’s five boroughs, and a large amount of trip information is publicly available.

The dataset includes details on 6,896,317 trips, made by a fleet of approximately 13,500 Yellow Cabs.

yellowcab <- read_csv("C:/Users/timke/Downloads/yellow_tripdata_2019-12.csv")

yellowcab$tpep_pickup_datetime <- ymd_hms(yellowcab$tpep_pickup_datetime)
yellowcab$day <- factor(day(yellowcab$tpep_pickup_datetime))
yellowcab$month <- factor(month(yellowcab$tpep_pickup_datetime, label = TRUE))
yellowcab$year <- factor(year(yellowcab$tpep_pickup_datetime))
yellowcab$dayofweek <- factor(wday(yellowcab$tpep_pickup_datetime, label = TRUE))
yellowcab$hour <- factor(hour(yellowcab$tpep_pickup_datetime))

colnames(yellowcab)[which(names(yellowcab) == "PULocationID")] <- "LocationID"

Trip distances

yellowcab %>%
  filter(trip_distance >= 0) %>%
  ggplot() +
  geom_histogram(aes(x = trip_distance, fill= cut(trip_distance,30)), binwidth = 1) +
  labs(title = "Yellow Cab trip distance breakdown, December 2019", caption = "Source: New York Taxi & Limousine Commission") +
  scale_y_continuous(labels = function(x) format(x, big.mark = ",",
                                                 scientific = FALSE)) +
  scale_x_continuous(expand = c(0,0), limits = c(0,30)) +
  xlab('Trip distance (miles)') +
  ylab('Number of trips') +
  theme(legend.position = 'none')

Yellow Cab rides taken within the New York City area travel relatively short distances, most extending less than 2 miles. This is likely explained by the fact that the system sees extremely heavy use within the island of Manhattan, by far the city’s most dense residential, commercial and entertainment center.

The small peak that features at around 18 miles is accounted for by the distance between Manhattan and John F Kennedy International Airport.

Payment practices

yellowcab$payment_type[yellowcab$payment_type == 1] <- "Credit card"
yellowcab$payment_type[yellowcab$payment_type == 2] <- "Cash"
yellowcab$payment_type[yellowcab$payment_type == 3] <- "No charge"
yellowcab$payment_type[yellowcab$payment_type == 4] <- "Dispute"
yellowcab$payment_type[yellowcab$payment_type == 5] <- "Unknown"
yellowcab$payment_type[yellowcab$payment_type == 5] <- "Voided trip"

payment <- yellowcab %>%
  drop_na(payment_type) %>%
  group_by(payment_type) %>%
  summarise(payment_total = n()) 

payment %>%
  ggplot(aes(x = reorder(payment_type, desc(payment_total)), y=payment_total, fill = payment_type)) +
  geom_col() +
  geom_text(aes(label = scales::comma(payment_total)), vjust = -0.5, size=3) +
  theme(panel.background = element_blank()) +
  scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE)) +
  scale_fill_brewer(palette = 'Set2') +
  ylab("") +
  xlab('Payment type') +
    labs(title = "Payment practices for Yellow Cab journeys, December 2019", subtitle = "Frequency of payment method", caption = "Source: New York Taxi & Limousine Commission", fill = '')  

Credit card is the most frequent method, used more than twice as often as payment by cash. Disputes over the charge only occur in less than 0.3% of cases.

Daily traffic patterns

cabrides <- yellowcab %>%
  group_by(day, hour) %>%
  summarise(total = n())

cabrides %>%
  ggplot(aes(day, hour, fill = total)) +
  geom_tile() +
  theme(panel.background = element_blank()) +
  scale_fill_viridis(labels=function(x) format(x, big.mark = ",", scientific = FALSE)) +
  xlab('Day') +
  ylab('Hour') +
  labs(title = "Yellow Cab pickups in New York City, December 2019", subtitle = "Distribution of rides by hour of the day", caption = "Source: New York Taxi & Limousine Commission", fill = 'Total rides')

Taxi use during the day follows a noticeable pattern outside of the holiday period: a small peak around 8am, before picking up again in the mid-afternoon. A consistent lull occurs at 4pm, providing a useful demarcation between daily and evening periods, before 6pm sees use hit its most intense period, a peak that continues through until 11pm.

The weekends see this busy evening period extend well into the early morning, before dying off by 4 to 5am. Weekend mid-mornings are subsequently as quiet as the middle of a weekday night.

As expected, business tapers off going into the holiday period. The weekend immediately before Christmas sees diminishing use, with the intensity weakening quickly after midnight.

Apart from a lull on Christmas Day itself, usage trends become much less pronounced over the holiday, with consistent but diminished use of the service throughout the day. One of the few noteable features is a small spike in use in the late evening of New Years Eve.

Pick-ups in Manhattan

zonerides <- yellowcab %>%
  group_by(LocationID) %>%
  summarise(total = n())

taxizones <- st_read("C:/Users/timke/Downloads/taxi_zones.shp")

taxizonerides <- merge(taxizones, zonerides, by='LocationID')

taxizonerides <- taxizonerides %>%
  filter(borough == "Manhattan")

ggplot() +
  geom_sf(data=taxizonerides, aes(fill=total)) +
  coord_sf() +
  theme(panel.background = element_blank(), axis.text = element_blank(), axis.ticks = element_blank()) +
  labs(title = "Yellow Cab pickup locations in Manhattan, December 2019", subtitle = "Geographical spread of pickups by taxi zone", caption ="Source: New York Taxi & Limousine Commission", fill = 'Rides') +
  scale_fill_viridis(labels=function(x) format(x, big.mark = ",", scientific = FALSE))

Common pickup zones are clustered around the tourist zone in the Midtown area, seeming to closely align with Broadway, from the Lincoln Center to Union Square. The Upper East Side also sees especially heavy use.

Tipping in Manhattan

zonerides2 <- yellowcab %>%
  filter(tip_amount > 0) %>%
  mutate(tip = (tip_amount/(total_amount-tip_amount))) %>%
  filter(tip < 0.6) %>%
  group_by(LocationID) %>%
  summarise(total = n(), avg_tip = mean(tip))

taxizonerides2 <- merge(taxizones, zonerides2, by='LocationID')

taxizonerides2 <- taxizonerides2 %>%
  filter(borough == "Manhattan")

options(scipen=999)
ggplot() +
  geom_sf(data=taxizonerides2, aes(fill=avg_tip)) +
  coord_sf() +
  theme(panel.background = element_blank(), axis.text = element_blank(), axis.ticks = element_blank()) +
  labs(title = "Tips by pickup location, December 2019", subtitle = "Average tip as a percentage of the fare", caption ="Source: New York Taxi & Limousine Commission", fill ="") +
  scale_fill_viridis(labels = scales::percent)

The tip is calculated here as a proportion of the time-and-distance fare. Rides that did not result in a tip have been excluded.

Despite being a wealthy zone, trips beginning in the Upper East Side are notable for a lower average tip than elsewhere. One possible explanation is the large number of rides that initiate in that zone; frequent travel may well lead to more conservative tipping practices.