library(tidyverse)
library(RColorBrewer)
library(arrow)
library(ggfortify)
library(lubridate)
library(treemapify)
library(streamgraph)
library(paletteer)
library(ggalluvial)Project 1
NYC Vehicle-For-Hire Data Analysis for March 2020
In March 2020, we were all witness to one of the most expansive shutdowns ever put in place, where lights were switched off and doors locked, some temporarily and others permanently. For decades, taxi service has been a defining characteristic of New York City. This project attempts to analyze vehicle-for-hire activity between February 29 and March 31, 2020. The following are a few notable variables and their definitions:
- dispatch_base: The NYC Taxi and Limousine Commission ID of the base that dispatched the trip
- pickup_location, aka PULocationID: TLC Taxi Zone in which the trip began
- dropoff_location, aka DOLocationID: TLC Taxi Zone in which the trip ended
load libraries
bring in taxi dataset
setwd("C:/Users/desir_7411ic3/Desktop/Montgomery College/DATA110/Project 1")
taxi_orig = read_parquet("fhv_tripdata_2020-03.parquet")
head(taxi_orig)# A tibble: 6 × 7
dispatching_base_num pickup_datetime dropOff_datetime PUlocationID
<chr> <dttm> <dttm> <dbl>
1 B00013 2020-02-29 19:52:09 2020-02-29 20:18:43 264
2 B00013 2020-02-29 19:59:46 2020-02-29 20:14:18 264
3 B00013 2020-02-29 19:48:29 2020-02-29 20:33:03 264
4 B00013 2020-02-29 19:51:41 2020-02-29 20:32:46 264
5 B00013 2020-02-29 19:58:29 2020-02-29 20:36:04 264
6 B00013 2020-02-29 19:42:04 2020-02-29 20:04:15 264
# ℹ 3 more variables: DOlocationID <dbl>, SR_Flag <???>,
# Affiliated_base_number <chr>
reorganize datetime columns
taxi_dated <- taxi_orig %>%
select(-SR_Flag) %>%
na.omit(taxi_dated) %>%
mutate(diff = difftime(dropOff_datetime,pickup_datetime,units="mins")) %>%
separate(pickup_datetime,c("date","time"),sep=" ") %>%
separate(dropOff_datetime,c("date_dropoff","time_dropoff"),sep=" ") %>%
select(-date_dropoff)
head(taxi_dated)# A tibble: 6 × 8
dispatching_base_num date time time_dropoff PUlocationID DOlocationID
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 B00013 2020-02-29 19:52:… 20:18:43 264 264
2 B00013 2020-02-29 19:59:… 20:14:18 264 264
3 B00013 2020-02-29 19:48:… 20:33:03 264 264
4 B00013 2020-02-29 19:51:… 20:32:46 264 264
5 B00013 2020-02-29 19:58:… 20:36:04 264 264
6 B00013 2020-02-29 19:42:… 20:04:15 264 264
# ℹ 2 more variables: Affiliated_base_number <chr>, diff <drtn>
re-ordering and renaming
taxi_dated <- taxi_dated %>%
mutate(across(c('diff'), round, 2)) %>%
rename_at("PUlocationID",~"pickup_location") %>%
rename_at("DOlocationID",~"dropoff_location") %>%
rename_at("dispatching_base_num",~"dispatch_base") %>%
rename_at("Affiliated_base_number",~"taxi_base") %>%
rename_at("diff",~"ride_duration") %>%
na.omit(taxi_dated) %>%
select(dispatch_base,taxi_base,date,time,ride_duration,pickup_location,dropoff_location)omit blank rows
taxi_dated <- taxi_dated[!(taxi_dated$taxi_base==""),]
taxi_dated <- taxi_dated[!(taxi_dated$ride_duration==""),]linear regression
reg_model_data <- aggregate(ride_duration ~ pickup_location, data=taxi_dated, FUN=mean)
reg_model <- lm(pickup_location ~ ride_duration, data=reg_model_data)
summary(reg_model)
Call:
lm(formula = pickup_location ~ ride_duration, data = reg_model_data)
Residuals:
Min 1Q Median 3Q Max
-131.879 -66.988 0.829 67.121 132.169
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 133.44831 15.00706 8.892 <2e-16 ***
ride_duration -0.01403 0.51565 -0.027 0.978
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 77.21 on 259 degrees of freedom
Multiple R-squared: 2.858e-06, Adjusted R-squared: -0.003858
F-statistic: 0.0007401 on 1 and 259 DF, p-value: 0.9783
reg <- ggplot(reg_model_data,aes(pickup_location,ride_duration,color=pickup_location)) + geom_smooth()
regDon't know how to automatically pick scale for object of type <difftime>.
Defaulting to continuous.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Warning: The following aesthetics were dropped during statistical transformation:
colour.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
variable into a factor?
Equation: ride_duration = -0.05(pickup_location) + 131.44
This linear regression posits pickup locations against ride duration. Taxi zones are roughly ordered by the year they grew in to their own neighborhood. Even allowing for gentrification and urban development, what this model finds is that the newer a community is, the less time it takes people to get where they want to go. This isn’t a surprising trend, as it makes sense to live as close to the center of your world as possible, but it is notable. Unfortunately, it is not statistically significant with a very high p-value of 0.9225. It should also be noted that this equation does not account for the majority of the data points, with an adjusted R^2 value near 0.
change date format
taxi_dated <- taxi_dated %>%
mutate(date = as.Date(taxi_dated$date))
reg_model <- lm(pickup_location ~ dropoff_location, data=taxi_dated)compare pick-up and drop-off location zones
zoning_graph <- ggplot(taxi_dated,aes(x=pickup_location,dropoff_location,color="coral3")) +
labs(title="Pickup Location vs. Drop-Off Location, NYC March 2020",
x="Pickup Location Zone ###",
y="Drop-Off Location Zone ###",
caption="Sourced from NYC Taxi and Limousine Commission") +
geom_smooth(size=0.75) +
scale_x_continuous(breaks=seq(0,300,by=20)) +
theme_bw() +
theme(
legend.position = "none",
axis.text.x = element_text(angle=45)
)Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
zoning_graph`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
create dataset with top 20 dispatch bases and another with top 5
top20 <- taxi_dated %>%
group_by(dispatch_base) %>%
summarise(num_rides=n())
top20 <- top20 %>%
top_n(n=20)Selecting by num_rides
top5 <- top20 %>%
top_n(n=5)Selecting by num_rides
top20# A tibble: 20 × 2
dispatch_base num_rides
<chr> <int>
1 B00256 17444
2 B00310 22982
3 B00647 20525
4 B00856 29617
5 B00900 19579
6 B00937 15182
7 B01087 12903
8 B01145 37142
9 B01231 13048
10 B01312 37521
11 B01437 15118
12 B01626 14677
13 B01899 19014
14 B02133 12869
15 B02437 14797
16 B02550 51279
17 B02794 15062
18 B02849 14080
19 B03060 15870
20 B03160 33385
top5# A tibble: 5 × 2
dispatch_base num_rides
<chr> <int>
1 B00856 29617
2 B01145 37142
3 B01312 37521
4 B02550 51279
5 B03160 33385
create treemap for 20 busiest taxi dispatch stations
top20_tree <- ggplot(top20,aes(area=num_rides,fill=num_rides,index=dispatch_base)) + geom_treemap() +
geom_treemap_text(label=top20$dispatch_base,colour="lightgrey") +
scale_fill_viridis_c() +
labs(title="20 Busiest VFH Dispatch Bases in NYC, March 2020",caption="Size/Color refers to num_rides")
top20_treedataset for time series
chrono <- taxi_dated %>%
group_by(date) %>%
summarise(rides=n())plot rides over month
chrono_plot <- ggplot(chrono,aes(date,rides,(date))) +
geom_point(aes(color=rides)) +
labs(title="Rides per Day in NYC, March 2020",
x="Date",
y="Taxi Rides",
caption="From NYC Taxi and Limousine Commission") +
theme_dark() +
scale_color_viridis_c(option="magma")
chrono_plot + geom_line(aes(date,rides,color=rides))create a subset with dispatch bases, dates, and ridership
top5_list <- top5$dispatch_base
top5_dated <- taxi_dated %>%
group_by(date,dispatch_base) %>%
summarise(num=n())
top5_dated <- subset(top5_dated,dispatch_base %in% top5_list)
top5_dated <- top5_dated[,c("dispatch_base","num","date")]create an alluvial
main <- ggplot(top5_dated,
aes(x=date,y=num,alluvium=dispatch_base)) +
geom_alluvium(aes(fill = dispatch_base),
color = "darkgrey",
width = .1,
alpha = .8,
decreasing = FALSE) +
labs(title="NYC VFH Ridership, 5 Largest Dispatchers, March 2020",
x="Date, March 2020",
y="Number of rides",
caption="From NYC Taxi and Limousine Commission") +
scale_fill_paletteer_d("ggsci::default_jama")
mainThis graph is actually surprising. Vehicle-For-Hire dispatch base B02550 (the red one) is not a taxi or limousine service. It refers to Agape Luxury, a non-emergency medical transportation service. All throughout the month, they were far ahead of any of their competitors or colleagues. Looking at the graph as a whole, the decline is clear (please note that the constant dips are weekends, not special events). The graph notes an average of a 60% decline from the beginning of the month to the end of the month, with the notable dates being the transition from the week of March the 9th to the week of March the 16, and it literally went downhill from there. The most constant dispatcher, B01145 (tan, referring to Ultra Radio Dispatcher) is a popular taxi service because it allows any type of reservation to be made/viewed in its mobile app.
This project clearly showed the massive decline in vehicle-for-hire activity during the month of March 2020. Unfortunately, these aren’t just statistics. In 2014, a NYC taxi license cost around a million dollars, though it had fallen to $137,000 by 2019. Nonetheless, many of the city’s 50,000 taxi drivers were stuck with mountains of debt and no way to pay it off. This did not help the growing suicide rates among taxi drivers due to competition from corporate ridesharing. While success rates have gone up, some drivers and their businesses simply never recovered.