library(tidyverse)
library(RColorBrewer)
library(arrow)
library(ggfortify)
library(lubridate)
library(treemapify)
library(streamgraph)
library(paletteer)
library(ggalluvial)
Project 1
NYC Vehicle-For-Hire Data Analysis for March 2020
In March 2020, we were all witness to one of the most expansive shutdowns ever put in place, where lights were switched off and doors locked, some temporarily and others permanently. For decades, taxi service has been a defining characteristic of New York City. This project attempts to analyze vehicle-for-hire activity between February 29 and March 31, 2020. The following are a few notable variables and their definitions:
- dispatch_base: The NYC Taxi and Limousine Commission ID of the base that dispatched the trip
- pickup_location, aka PULocationID: TLC Taxi Zone in which the trip began
- dropoff_location, aka DOLocationID: TLC Taxi Zone in which the trip ended
load libraries
bring in taxi dataset
setwd("C:/Users/desir_7411ic3/Desktop/Montgomery College/DATA110/Project 1")
= read_parquet("fhv_tripdata_2020-03.parquet")
taxi_orig head(taxi_orig)
# A tibble: 6 × 7
dispatching_base_num pickup_datetime dropOff_datetime PUlocationID
<chr> <dttm> <dttm> <dbl>
1 B00013 2020-02-29 19:52:09 2020-02-29 20:18:43 264
2 B00013 2020-02-29 19:59:46 2020-02-29 20:14:18 264
3 B00013 2020-02-29 19:48:29 2020-02-29 20:33:03 264
4 B00013 2020-02-29 19:51:41 2020-02-29 20:32:46 264
5 B00013 2020-02-29 19:58:29 2020-02-29 20:36:04 264
6 B00013 2020-02-29 19:42:04 2020-02-29 20:04:15 264
# ℹ 3 more variables: DOlocationID <dbl>, SR_Flag <???>,
# Affiliated_base_number <chr>
reorganize datetime columns
<- taxi_orig %>%
taxi_dated select(-SR_Flag) %>%
na.omit(taxi_dated) %>%
mutate(diff = difftime(dropOff_datetime,pickup_datetime,units="mins")) %>%
separate(pickup_datetime,c("date","time"),sep=" ") %>%
separate(dropOff_datetime,c("date_dropoff","time_dropoff"),sep=" ") %>%
select(-date_dropoff)
head(taxi_dated)
# A tibble: 6 × 8
dispatching_base_num date time time_dropoff PUlocationID DOlocationID
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 B00013 2020-02-29 19:52:… 20:18:43 264 264
2 B00013 2020-02-29 19:59:… 20:14:18 264 264
3 B00013 2020-02-29 19:48:… 20:33:03 264 264
4 B00013 2020-02-29 19:51:… 20:32:46 264 264
5 B00013 2020-02-29 19:58:… 20:36:04 264 264
6 B00013 2020-02-29 19:42:… 20:04:15 264 264
# ℹ 2 more variables: Affiliated_base_number <chr>, diff <drtn>
re-ordering and renaming
<- taxi_dated %>%
taxi_dated mutate(across(c('diff'), round, 2)) %>%
rename_at("PUlocationID",~"pickup_location") %>%
rename_at("DOlocationID",~"dropoff_location") %>%
rename_at("dispatching_base_num",~"dispatch_base") %>%
rename_at("Affiliated_base_number",~"taxi_base") %>%
rename_at("diff",~"ride_duration") %>%
na.omit(taxi_dated) %>%
select(dispatch_base,taxi_base,date,time,ride_duration,pickup_location,dropoff_location)
omit blank rows
<- taxi_dated[!(taxi_dated$taxi_base==""),]
taxi_dated <- taxi_dated[!(taxi_dated$ride_duration==""),] taxi_dated
linear regression
<- aggregate(ride_duration ~ pickup_location, data=taxi_dated, FUN=mean)
reg_model_data <- lm(pickup_location ~ ride_duration, data=reg_model_data)
reg_model summary(reg_model)
Call:
lm(formula = pickup_location ~ ride_duration, data = reg_model_data)
Residuals:
Min 1Q Median 3Q Max
-131.879 -66.988 0.829 67.121 132.169
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 133.44831 15.00706 8.892 <2e-16 ***
ride_duration -0.01403 0.51565 -0.027 0.978
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 77.21 on 259 degrees of freedom
Multiple R-squared: 2.858e-06, Adjusted R-squared: -0.003858
F-statistic: 0.0007401 on 1 and 259 DF, p-value: 0.9783
<- ggplot(reg_model_data,aes(pickup_location,ride_duration,color=pickup_location)) + geom_smooth()
reg reg
Don't know how to automatically pick scale for object of type <difftime>.
Defaulting to continuous.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Equation: ride_duration = -0.05(pickup_location) + 131.44
This linear regression posits pickup locations against ride duration. Taxi zones are roughly ordered by the year they grew in to their own neighborhood. Even allowing for gentrification and urban development, what this model finds is that the newer a community is, the less time it takes people to get where they want to go. This isn’t a surprising trend, as it makes sense to live as close to the center of your world as possible, but it is notable. Unfortunately, it is not statistically significant with a very high p-value of 0.9225. It should also be noted that this equation does not account for the majority of the data points, with an adjusted R^2 value near 0.
change date format
<- taxi_dated %>%
taxi_dated mutate(date = as.Date(taxi_dated$date))
<- lm(pickup_location ~ dropoff_location, data=taxi_dated) reg_model
compare pick-up and drop-off location zones
<- ggplot(taxi_dated,aes(x=pickup_location,dropoff_location,color="coral3")) +
zoning_graph labs(title="Pickup Location vs. Drop-Off Location, NYC March 2020",
x="Pickup Location Zone ###",
y="Drop-Off Location Zone ###",
caption="Sourced from NYC Taxi and Limousine Commission") +
geom_smooth(size=0.75) +
scale_x_continuous(breaks=seq(0,300,by=20)) +
theme_bw() +
theme(
legend.position = "none",
axis.text.x = element_text(angle=45)
)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
zoning_graph
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
create dataset with top 20 dispatch bases and another with top 5
<- taxi_dated %>%
top20 group_by(dispatch_base) %>%
summarise(num_rides=n())
<- top20 %>%
top20 top_n(n=20)
Selecting by num_rides
<- top20 %>%
top5 top_n(n=5)
Selecting by num_rides
top20
# A tibble: 20 × 2
dispatch_base num_rides
<chr> <int>
1 B00256 17444
2 B00310 22982
3 B00647 20525
4 B00856 29617
5 B00900 19579
6 B00937 15182
7 B01087 12903
8 B01145 37142
9 B01231 13048
10 B01312 37521
11 B01437 15118
12 B01626 14677
13 B01899 19014
14 B02133 12869
15 B02437 14797
16 B02550 51279
17 B02794 15062
18 B02849 14080
19 B03060 15870
20 B03160 33385
top5
# A tibble: 5 × 2
dispatch_base num_rides
<chr> <int>
1 B00856 29617
2 B01145 37142
3 B01312 37521
4 B02550 51279
5 B03160 33385
create treemap for 20 busiest taxi dispatch stations
<- ggplot(top20,aes(area=num_rides,fill=num_rides,index=dispatch_base)) + geom_treemap() +
top20_tree geom_treemap_text(label=top20$dispatch_base,colour="lightgrey") +
scale_fill_viridis_c() +
labs(title="20 Busiest VFH Dispatch Bases in NYC, March 2020",caption="Size/Color refers to num_rides")
top20_tree
dataset for time series
<- taxi_dated %>%
chrono group_by(date) %>%
summarise(rides=n())
plot rides over month
<- ggplot(chrono,aes(date,rides,(date))) +
chrono_plot geom_point(aes(color=rides)) +
labs(title="Rides per Day in NYC, March 2020",
x="Date",
y="Taxi Rides",
caption="From NYC Taxi and Limousine Commission") +
theme_dark() +
scale_color_viridis_c(option="magma")
+ geom_line(aes(date,rides,color=rides)) chrono_plot
create a subset with dispatch bases, dates, and ridership
<- top5$dispatch_base
top5_list <- taxi_dated %>%
top5_dated group_by(date,dispatch_base) %>%
summarise(num=n())
<- subset(top5_dated,dispatch_base %in% top5_list)
top5_dated <- top5_dated[,c("dispatch_base","num","date")] top5_dated
create an alluvial
<- ggplot(top5_dated,
main aes(x=date,y=num,alluvium=dispatch_base)) +
geom_alluvium(aes(fill = dispatch_base),
color = "darkgrey",
width = .1,
alpha = .8,
decreasing = FALSE) +
labs(title="NYC VFH Ridership, 5 Largest Dispatchers, March 2020",
x="Date, March 2020",
y="Number of rides",
caption="From NYC Taxi and Limousine Commission") +
scale_fill_paletteer_d("ggsci::default_jama")
main
This graph is actually surprising. Vehicle-For-Hire dispatch base B02550 (the red one) is not a taxi or limousine service. It refers to Agape Luxury, a non-emergency medical transportation service. All throughout the month, they were far ahead of any of their competitors or colleagues. Looking at the graph as a whole, the decline is clear (please note that the constant dips are weekends, not special events). The graph notes an average of a 60% decline from the beginning of the month to the end of the month, with the notable dates being the transition from the week of March the 9th to the week of March the 16, and it literally went downhill from there. The most constant dispatcher, B01145 (tan, referring to Ultra Radio Dispatcher) is a popular taxi service because it allows any type of reservation to be made/viewed in its mobile app.
This project clearly showed the massive decline in vehicle-for-hire activity during the month of March 2020. Unfortunately, these aren’t just statistics. In 2014, a NYC taxi license cost around a million dollars, though it had fallen to $137,000 by 2019. Nonetheless, many of the city’s 50,000 taxi drivers were stuck with mountains of debt and no way to pay it off. This did not help the growing suicide rates among taxi drivers due to competition from corporate ridesharing. While success rates have gone up, some drivers and their businesses simply never recovered.