Project 1

Author

Michael Desir

NYC Vehicle-For-Hire Data Analysis for March 2020

Yellow taxi on blurred background

Credit Kai Pilger from Unsplash

In March 2020, we were all witness to one of the most expansive shutdowns ever put in place, where lights were switched off and doors locked, some temporarily and others permanently. For decades, taxi service has been a defining characteristic of New York City. This project attempts to analyze vehicle-for-hire activity between February 29 and March 31, 2020. The following are a few notable variables and their definitions:

  • dispatch_base: The NYC Taxi and Limousine Commission ID of the base that dispatched the trip
  • pickup_location, aka PULocationID: TLC Taxi Zone in which the trip began
  • dropoff_location, aka DOLocationID: TLC Taxi Zone in which the trip ended

load libraries

library(tidyverse)
library(RColorBrewer)
library(arrow)
library(ggfortify)
library(lubridate)
library(treemapify)
library(streamgraph)
library(paletteer)
library(ggalluvial)

bring in taxi dataset

setwd("C:/Users/desir_7411ic3/Desktop/Montgomery College/DATA110/Project 1")
taxi_orig = read_parquet("fhv_tripdata_2020-03.parquet")
head(taxi_orig)
# A tibble: 6 × 7
  dispatching_base_num pickup_datetime     dropOff_datetime    PUlocationID
  <chr>                <dttm>              <dttm>                     <dbl>
1 B00013               2020-02-29 19:52:09 2020-02-29 20:18:43          264
2 B00013               2020-02-29 19:59:46 2020-02-29 20:14:18          264
3 B00013               2020-02-29 19:48:29 2020-02-29 20:33:03          264
4 B00013               2020-02-29 19:51:41 2020-02-29 20:32:46          264
5 B00013               2020-02-29 19:58:29 2020-02-29 20:36:04          264
6 B00013               2020-02-29 19:42:04 2020-02-29 20:04:15          264
# ℹ 3 more variables: DOlocationID <dbl>, SR_Flag <???>,
#   Affiliated_base_number <chr>

reorganize datetime columns

taxi_dated <- taxi_orig %>% 
  select(-SR_Flag) %>%
  na.omit(taxi_dated) %>%
  mutate(diff = difftime(dropOff_datetime,pickup_datetime,units="mins")) %>%
  separate(pickup_datetime,c("date","time"),sep=" ") %>%
  separate(dropOff_datetime,c("date_dropoff","time_dropoff"),sep=" ") %>%
  select(-date_dropoff)
head(taxi_dated)
# A tibble: 6 × 8
  dispatching_base_num date       time    time_dropoff PUlocationID DOlocationID
  <chr>                <chr>      <chr>   <chr>               <dbl>        <dbl>
1 B00013               2020-02-29 19:52:… 20:18:43              264          264
2 B00013               2020-02-29 19:59:… 20:14:18              264          264
3 B00013               2020-02-29 19:48:… 20:33:03              264          264
4 B00013               2020-02-29 19:51:… 20:32:46              264          264
5 B00013               2020-02-29 19:58:… 20:36:04              264          264
6 B00013               2020-02-29 19:42:… 20:04:15              264          264
# ℹ 2 more variables: Affiliated_base_number <chr>, diff <drtn>

re-ordering and renaming

taxi_dated <- taxi_dated %>% 
  mutate(across(c('diff'), round, 2)) %>%
  rename_at("PUlocationID",~"pickup_location") %>%
  rename_at("DOlocationID",~"dropoff_location") %>%
  rename_at("dispatching_base_num",~"dispatch_base") %>%
  rename_at("Affiliated_base_number",~"taxi_base") %>%
  rename_at("diff",~"ride_duration") %>%
  na.omit(taxi_dated) %>%
  select(dispatch_base,taxi_base,date,time,ride_duration,pickup_location,dropoff_location)

omit blank rows

taxi_dated <- taxi_dated[!(taxi_dated$taxi_base==""),]
taxi_dated <- taxi_dated[!(taxi_dated$ride_duration==""),]

linear regression

reg_model_data <- aggregate(ride_duration ~ pickup_location, data=taxi_dated, FUN=mean)
reg_model <- lm(pickup_location ~ ride_duration, data=reg_model_data)
summary(reg_model)

Call:
lm(formula = pickup_location ~ ride_duration, data = reg_model_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-131.879  -66.988    0.829   67.121  132.169 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   133.44831   15.00706   8.892   <2e-16 ***
ride_duration  -0.01403    0.51565  -0.027    0.978    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 77.21 on 259 degrees of freedom
Multiple R-squared:  2.858e-06, Adjusted R-squared:  -0.003858 
F-statistic: 0.0007401 on 1 and 259 DF,  p-value: 0.9783
reg <- ggplot(reg_model_data,aes(pickup_location,ride_duration,color=pickup_location)) + geom_smooth()
reg
Don't know how to automatically pick scale for object of type <difftime>.
Defaulting to continuous.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Equation: ride_duration = -0.05(pickup_location) + 131.44

This linear regression posits pickup locations against ride duration. Taxi zones are roughly ordered by the year they grew in to their own neighborhood. Even allowing for gentrification and urban development, what this model finds is that the newer a community is, the less time it takes people to get where they want to go. This isn’t a surprising trend, as it makes sense to live as close to the center of your world as possible, but it is notable. Unfortunately, it is not statistically significant with a very high p-value of 0.9225. It should also be noted that this equation does not account for the majority of the data points, with an adjusted R^2 value near 0.

change date format

taxi_dated <- taxi_dated %>%
  mutate(date = as.Date(taxi_dated$date))
reg_model <- lm(pickup_location ~ dropoff_location, data=taxi_dated)

compare pick-up and drop-off location zones

zoning_graph <- ggplot(taxi_dated,aes(x=pickup_location,dropoff_location,color="coral3")) +
  labs(title="Pickup Location vs. Drop-Off Location, NYC March 2020",
       x="Pickup Location Zone ###",
       y="Drop-Off Location Zone ###",
       caption="Sourced from NYC Taxi and Limousine Commission") +
  geom_smooth(size=0.75) +
  scale_x_continuous(breaks=seq(0,300,by=20)) + 
  theme_bw() +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle=45)
  )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
zoning_graph
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

create dataset with top 20 dispatch bases and another with top 5

top20 <- taxi_dated %>%
  group_by(dispatch_base) %>%
  summarise(num_rides=n())
top20 <- top20 %>%
  top_n(n=20)
Selecting by num_rides
top5 <- top20 %>%
  top_n(n=5)
Selecting by num_rides
top20
# A tibble: 20 × 2
   dispatch_base num_rides
   <chr>             <int>
 1 B00256            17444
 2 B00310            22982
 3 B00647            20525
 4 B00856            29617
 5 B00900            19579
 6 B00937            15182
 7 B01087            12903
 8 B01145            37142
 9 B01231            13048
10 B01312            37521
11 B01437            15118
12 B01626            14677
13 B01899            19014
14 B02133            12869
15 B02437            14797
16 B02550            51279
17 B02794            15062
18 B02849            14080
19 B03060            15870
20 B03160            33385
top5
# A tibble: 5 × 2
  dispatch_base num_rides
  <chr>             <int>
1 B00856            29617
2 B01145            37142
3 B01312            37521
4 B02550            51279
5 B03160            33385

create treemap for 20 busiest taxi dispatch stations

top20_tree <- ggplot(top20,aes(area=num_rides,fill=num_rides,index=dispatch_base)) + geom_treemap() +
  geom_treemap_text(label=top20$dispatch_base,colour="lightgrey") +
  scale_fill_viridis_c() +
  labs(title="20 Busiest VFH Dispatch Bases in NYC, March 2020",caption="Size/Color refers to num_rides")
top20_tree

dataset for time series

chrono <- taxi_dated %>%
  group_by(date) %>%
  summarise(rides=n())

plot rides over month

chrono_plot <- ggplot(chrono,aes(date,rides,(date))) +
  geom_point(aes(color=rides)) +
  labs(title="Rides per Day in NYC, March 2020",
       x="Date",
       y="Taxi Rides",
       caption="From NYC Taxi and Limousine Commission") +
  theme_dark() +
  scale_color_viridis_c(option="magma")
chrono_plot + geom_line(aes(date,rides,color=rides))

create a subset with dispatch bases, dates, and ridership

top5_list <- top5$dispatch_base
top5_dated <- taxi_dated %>%
  group_by(date,dispatch_base) %>%
  summarise(num=n())
top5_dated <- subset(top5_dated,dispatch_base %in% top5_list)
top5_dated <- top5_dated[,c("dispatch_base","num","date")]

create an alluvial

main <- ggplot(top5_dated,
               aes(x=date,y=num,alluvium=dispatch_base)) +
  geom_alluvium(aes(fill = dispatch_base), 
                color = "darkgrey",
                width = .1, 
                alpha = .8,
                decreasing = FALSE) +
  labs(title="NYC VFH Ridership, 5 Largest Dispatchers, March 2020",
       x="Date, March 2020",
       y="Number of rides",
       caption="From NYC Taxi and Limousine Commission") +
  scale_fill_paletteer_d("ggsci::default_jama")
main

This graph is actually surprising. Vehicle-For-Hire dispatch base B02550 (the red one) is not a taxi or limousine service. It refers to Agape Luxury, a non-emergency medical transportation service. All throughout the month, they were far ahead of any of their competitors or colleagues. Looking at the graph as a whole, the decline is clear (please note that the constant dips are weekends, not special events). The graph notes an average of a 60% decline from the beginning of the month to the end of the month, with the notable dates being the transition from the week of March the 9th to the week of March the 16, and it literally went downhill from there. The most constant dispatcher, B01145 (tan, referring to Ultra Radio Dispatcher) is a popular taxi service because it allows any type of reservation to be made/viewed in its mobile app.

This project clearly showed the massive decline in vehicle-for-hire activity during the month of March 2020. Unfortunately, these aren’t just statistics. In 2014, a NYC taxi license cost around a million dollars, though it had fallen to $137,000 by 2019. Nonetheless, many of the city’s 50,000 taxi drivers were stuck with mountains of debt and no way to pay it off. This did not help the growing suicide rates among taxi drivers due to competition from corporate ridesharing. While success rates have gone up, some drivers and their businesses simply never recovered.