library(tidyverse)
library(arrow)
library(lubridate)
library(sf)
library(cowplot)

Introduction and Exploratory Data Analysis

I’ll focus on a massive dataset detailing all taxis rides in New York City since 2009. The data is maintained by the NYC government at the following site.

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

The data are maintained in a series of many files, broken out month year, month and taxi type (e.g. yellow, green, rideshare). I’ll begin by reading in a single file to get a sense of the data. The data is stored in .parquet format, so I’ll need the arrow package’s read_parquet function.

pq_file <- tempfile()
url <- paste0('https://d37ci6vzurychx.cloudfront.net/',
              'trip-data/yellow_tripdata_2011-01.parquet')

download.file(url, pq_file, mode = 'wb')

df <- read_parquet(pq_file, as_data_frame = TRUE)

head(df)
## # A tibble: 6 × 19
##   VendorID tpep_pickup_datetime tpep_dropoff_datetime passenge…¹ trip_…² Ratec…³
##      <int> <dttm>               <dttm>                     <int>   <dbl>   <int>
## 1        2 2010-12-31 19:10:00  2010-12-31 19:12:00            4       0       1
## 2        2 2010-12-31 19:04:00  2010-12-31 19:13:00            4       0       1
## 3        2 2010-12-31 19:14:00  2010-12-31 19:16:00            4       0       1
## 4        2 2010-12-31 19:04:00  2010-12-31 19:06:00            5       0       1
## 5        2 2010-12-31 19:08:00  2010-12-31 19:08:00            5       0       1
## 6        2 2010-12-31 19:23:00  2010-12-31 19:23:00            1       0       1
## # … with 13 more variables: store_and_fwd_flag <chr>, PULocationID <int>,
## #   DOLocationID <int>, payment_type <int>, fare_amount <dbl>, extra <dbl>,
## #   mta_tax <dbl>, tip_amount <dbl>, tolls_amount <dbl>,
## #   improvement_surcharge <dbl>, total_amount <dbl>,
## #   congestion_surcharge <dbl>, airport_fee <dbl>, and abbreviated variable
## #   names ¹​passenger_count, ²​trip_distance, ³​RatecodeID

It looks like we’re dealing with a LOT of data. I’ll put together a quick summary to see how many rides we’re looking at.

df %>%
  filter(year(tpep_pickup_datetime) != 2022 | month(tpep_pickup_datetime) != 1) %>%
  group_by(date(tpep_pickup_datetime)) %>%
  summarize(ride_count = n())
## # A tibble: 32 × 2
##    `date(tpep_pickup_datetime)` ride_count
##    <date>                            <int>
##  1 2010-12-31                        56071
##  2 2011-01-01                       155237
##  3 2011-01-02                       156078
##  4 2011-01-03                       372666
##  5 2011-01-04                       411178
##  6 2011-01-05                       442927
##  7 2011-01-06                       488474
##  8 2011-01-07                       535683
##  9 2011-01-08                       521358
## 10 2011-01-09                       360196
## # … with 22 more rows

For single taxi type in a single month, we have ###### individual observations. So, it seems like we’ll need to do some aggregation. Moreover, even though this file is meant to captured only January 2022, there appear to be some observations from other years and months. Given that I plan to aggregate by month and year, I’ll need to perform some cleaning on each files.

I also notice that files for all taxi types are not available back to 2009. Green taxis were introduced in 2013, and for-hire vehicles (which refers to private rideshare cabs like Uber and Lyft) were not tracked until 2015 (originally under “fhv” and then split into high and low volume categories, represented as “fhvhv” and “fhv”, respectively).

After performing the above exploration on files from for each taxi type and across several years, it became clear that the file format has also changed a few times over the years. While I’d like to analyze costs and volume of passengers, those variables appear inconsistently across the data. The variables that appear most consistently are date-time and pickup/dropoff location. So, I’ll plan to focus on the total volume of rides across time and geography.

Because of the size of the data, I’ll need to aggregate as I read in. So, before working to bring in ALL the data, I’ll run a sample aggregation to test ensure I’m aggregating in a form that will support the above analysis. I’ll use the dataframe from the sample above from January 2011 for yellow cabs. Because the column names change over time and across different taxi types, I will first standardize the names for key columns. Then, I pull out the month and year based on the pick-up date, clean up any NAs in location fields, then filter out any cab rides that don’t have dates in the month/year in question. I had trouble creating distinct counts for pickups and dropoffs across locations with a single summarize function, so I’ll split the process in half. I create two summary data frames that separately sum all pickups and dropoffs by year, month and location. Then, I create a placeholder dataframe that lists all locations IDs (from 1 to 265, with 0 for NAs) and finally use a left_join to bring in the counts of pickups and dropoffs.

colnames(df) <- tolower(colnames(df)) %>%
        str_replace_all('.*pickup_date.*','pickup_date') %>%
        str_replace_all('total_am.*','total_amount') %>%
        str_replace_all('.*p.*location.*','pickup_location') %>%
        str_replace_all('.*d.*location.*','dropoff_location')

df <- df %>%
        mutate(year = year(pickup_date),
               month = month(pickup_date),
               pickup_location = replace_na(pickup_location,0),
               dropoff_location = replace_na(dropoff_location,0)) %>%
        filter(year == 2011 &
                 month == 1)

aggr_data1 <- df %>%
  group_by(year, month, pickup_location) %>%
  summarize(pickups = n(), .groups = 'keep')

aggr_data2 <- df %>%
  group_by(year, month, dropoff_location) %>%
  summarize(dropoffs = n(), .groups = 'keep')

holder <- data.frame(year = 2022,
                     month = 1,
                     location = seq(from = 0, to = 265))
  
holder %>%
  left_join(aggr_data1, by = c('year','month','location' = 'pickup_location')) %>%
  left_join(aggr_data2, by = c('year','month','location' = 'dropoff_location')) %>%
  head(10)
##    year month location pickups dropoffs
## 1  2022     1        0      NA       NA
## 2  2022     1        1      NA       NA
## 3  2022     1        2      NA       NA
## 4  2022     1        3      NA       NA
## 5  2022     1        4      NA       NA
## 6  2022     1        5      NA       NA
## 7  2022     1        6      NA       NA
## 8  2022     1        7      NA       NA
## 9  2022     1        8      NA       NA
## 10 2022     1        9      NA       NA

This approach gives me what I need in a size that I can handle, so let’s move ahead!

Read and aggregate data

Considering all of the above, I constructed the below function to pull and aggregate data. The function requires a specific taxi type, a starting year and an ending year. It reads through all files matching these input parameters and aggregates the volume of rides, broken out by year, month and location. Error handling accounts for missing files, and progress is tracked in a distinct progress dataframe.

pull_taxi_data_aggr <- function(taxi_type, start_year, end_year, quiet = FALSE) {
  aggregated_df <- data.frame(matrix(nrow = 0, ncol = 0))
  process_summary <- data.frame(matrix(nrow = 0, ncol = 0))
  
  for (yr in start_year:end_year) {
    for (mnth in 1:12) {
      # attempt file download
      time_check <- now()
      year_month <- paste0(yr,'-',sprintf('%02d',mnth))
      pq_file <- tempfile()
      url <- paste0('https://d37ci6vzurychx.cloudfront.net/trip-data/',
                    taxi_type,'_tripdata_',year_month,'.parquet')
      suppressWarnings({
        test <- tryCatch(download.file(url, pq_file, mode = 'wb', quiet = TRUE), 
                         error = function(e) e)
      })
      if('error' %in% class(test)) {
        progress <- data.frame(year_month, file_size = NA, time_check)
        process_summary <- bind_rows(process_summary, progress)
        next
      }
      
      # mark progress
      file_size <- file.size(pq_file) / 10^6
      progress <- data.frame(year_month, file_size , time_check)
      process_summary <- bind_rows(process_summary, progress)
      if (quiet == FALSE) {
       print(paste0('Progress  --  ',
                           'File: ',year_month,'  --  ',
                           'Size: ',file_size,'MB  --  ',
                           'Time: ',time_check)) 
      }
      
      # read to df and clean    
      pq_data <- read_parquet(pq_file, as_data_frame = TRUE)
      colnames(pq_data) <- tolower(colnames(pq_data)) %>%
        str_replace_all('.*pickup_date.*','pickup_date') %>%
        str_replace_all('total_am.*','total_amount') %>%
        str_replace_all('.*p.*location.*','pickup_location') %>%
        str_replace_all('.*d.*location.*','dropoff_location')
      
      pq_data <- pq_data %>%
        mutate(year = year(pickup_date),
               month = month(pickup_date),
               pickup_location = replace_na(pickup_location,0),
               dropoff_location = replace_na(dropoff_location,0)) %>%
        filter(year == yr &
                 month == mnth)
      
      #aggregate and add to collection df
      pickup_data <- pq_data %>%
        group_by(year, month, pickup_location) %>%
        summarize(pickups = n(), .groups = 'keep')
      
      dropoff_data <- pq_data %>%
        group_by(year, month, dropoff_location) %>%
        summarize(dropoffs = n(), .groups = 'keep')
      
      aggr_data <- data.frame(year = yr,
                                  month = mnth,
                                  location = seq(from = 0, to = 265)) %>%
        left_join(pickup_data, by = c('year','month','location' = 'pickup_location')) %>%
        left_join(dropoff_data, by = c('year','month','location' = 'dropoff_location'))
      
      aggregated_df <- bind_rows(aggregated_df, aggr_data)
      file.remove(pq_file)
    }
  }
  aggregated_df <- aggregated_df %>% mutate(type = taxi_type)
  process_summary <- process_summary %>% mutate(type = taxi_type)
  return(list(aggregated_df, 
              process_summary))
}

We then use this function in a loop to grab data for all four taxi types from 2009 to 2020. I ran this process previously and it took many several hours to download, read, clean and aggregates all the data. So, rather than have to repeat that whenever I wish to knit, I saved the output as a .csv. If I wish to skip this process, I can simply code the preloaded variable to Yes, and that previously aggregated data will be loaded to complete the rest of the functions in this markdown.

preloaded <- 'Yes'

if (preloaded == 'No') {
  greens <- pull_taxi_data_aggr(taxi_type = 'green', 
                               start_year = 2013,
                               end_year = 2022,
                               quiet = FALSE)
  
  fhvs <- pull_taxi_data_aggr(taxi_type = 'fhv', 
                             start_year = 2015,
                             end_year = 2022,
                             quiet = FALSE)
  
  fhvhvs <- pull_taxi_data_aggr(taxi_type = 'fhvhv', 
                               start_year = 2019,
                               end_year = 2022,
                               quiet = FALSE)
  
  yellows <- pull_taxi_data_aggr(taxi_type = 'yellow', 
                               start_year = 2011,
                               end_year = 2022,
                               quiet = FALSE)
  
  aggregated_df_all <- bind_rows(greens[[1]],
                               fhvs[[1]],
                               fhvhvs[[1]],
                               yellows[[1]])

  full_process_summary <-  bind_rows(greens[[2]],
                                     fhvs[[2]],
                                     fhvhvs[[2]],
                                     yellows[[2]])
  
  write_csv(aggregated_df_all, 'data/aggregated_taxi_data.csv')
  write_csv(full_process_summary, 'data/taxi_process_summary.csv')
  
} else {
  aggregated_df_all<- read_csv('data/aggregated_taxi_data.csv')
  full_process_summary <- read_csv('data/taxi_process_summary.csv')
}
## Rows: 104006 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): type
## dbl (5): year, month, location, pickups, dropoffs
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 408 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): year_month, type
## dbl  (1): file_size
## dttm (1): time_check
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

With our read-in function complete, I’ll construct a few summaries to check that data quality was maintained. I conclude the following:
1. Only 17 files were not downloaded. These consist of all green taxi files from 2013. I was also unable to download these files manually, so there appears to be some issue with nyc.gov. The remainder relate to files from December 2022, which are not yet uploaded to nyc.gov. Finally, one fhvhv file from Jan 2019 is missing, but again, this is missing from nyc.gov.
2. There are 1654 rows in which both pickups and dropoffs are empty, which raises some eyebrows. These appear to be related to low volume locations, but I purused the NA rows manually, and there did not appear to be any locations which had ZERO pickups or dropoffs, so my concerns of a data quality issue were eased.
3. There are 2.9 billion taxi trips captured in total in the dataset.
4. There are ~218 million pickups and ~411 dropoffs with missing locations. These are primarily related to fhv rides, for which locations were not tracked in the earlier datasets. This constitutes ~8 to ~14% of the dataset, which is quite material. In future analyses, I’ll need to think of some kind of workaround to address this gap, but for now, we’ll simply acknowledge this limitation before moving ahead.

filter(full_process_summary, is.na(file_size))
## # A tibble: 17 × 4
##    year_month file_size time_check          type  
##    <chr>          <dbl> <dttm>              <chr> 
##  1 2013-01           NA 2023-03-05 05:45:48 green 
##  2 2013-02           NA 2023-03-05 05:45:48 green 
##  3 2013-03           NA 2023-03-05 05:45:48 green 
##  4 2013-04           NA 2023-03-05 05:45:48 green 
##  5 2013-05           NA 2023-03-05 05:45:48 green 
##  6 2013-06           NA 2023-03-05 05:45:48 green 
##  7 2013-07           NA 2023-03-05 05:45:48 green 
##  8 2013-08           NA 2023-03-05 05:45:48 green 
##  9 2013-09           NA 2023-03-05 05:45:48 green 
## 10 2013-10           NA 2023-03-05 05:45:49 green 
## 11 2013-11           NA 2023-03-05 05:45:49 green 
## 12 2013-12           NA 2023-03-05 05:45:49 green 
## 13 2022-12           NA 2023-03-05 05:47:04 green 
## 14 2022-12           NA 2023-03-05 13:49:45 fhv   
## 15 2019-01           NA 2023-03-05 14:01:23 fhvhv 
## 16 2022-12           NA 2023-03-05 14:20:25 fhvhv 
## 17 2022-12           NA 2023-03-05 16:41:59 yellow
filter(aggregated_df_all, is.na(pickups) & is.na(dropoffs))
## # A tibble: 1,654 × 6
##     year month location pickups dropoffs type 
##    <dbl> <dbl>    <dbl>   <dbl>    <dbl> <chr>
##  1  2014     1        0      NA       NA green
##  2  2014     1       99      NA       NA green
##  3  2014     1      103      NA       NA green
##  4  2014     1      104      NA       NA green
##  5  2014     1      105      NA       NA green
##  6  2014     1      110      NA       NA green
##  7  2014     1      199      NA       NA green
##  8  2014     2        0      NA       NA green
##  9  2014     2        2      NA       NA green
## 10  2014     2       99      NA       NA green
## # … with 1,644 more rows
aggregated_df_all %>%
  summarize(pickups_total = sum(pickups, na.rm = TRUE),
            dropoffs_total = sum(dropoffs, na.rm = TRUE),
            .groups = 'keep')
## # A tibble: 1 × 2
##   pickups_total dropoffs_total
##           <dbl>          <dbl>
## 1    2905885647     2905885651
aggregated_df_all %>%
  group_by(location) %>%
  summarize(pickups_total = sum(pickups, na.rm = TRUE),
            dropoffs_total = sum(dropoffs, na.rm = TRUE),
            .groups = 'keep') %>%
  filter(location == 0 | location == 264 | location == 265)
## # A tibble: 3 × 3
## # Groups:   location [3]
##   location pickups_total dropoffs_total
##      <dbl>         <dbl>          <dbl>
## 1        0     163845457      304738931
## 2      264      49748690       28165142
## 3      265       5177406       79124536

Analysis

We can now plot the volume of rides over time, separated by taxi type (we use pickups, but as shown above, the volumes are essentially equal). A very clear trend emerges. We see the volume of yellow taxis slowly decline, while rideshare volumes increase. All taxi usage falls dramatically in 2020 due to COVID. Shortly after, rideshares begin returning to pre-pandemic levels, but yellow taxi volumes appear to just barely return.

aggregated_df_all %>%
  mutate(date = ymd(paste0(year,'-',sprintf('%02d',month),'-01'))) %>%
  group_by(date, type) %>%
  summarize(pickups = sum(pickups, na.rm = TRUE), 
            .groups = 'keep') %>%
  ggplot(aes(x = date, y = pickups, fill = type)) + 
  geom_area() +
  scale_fill_manual(values = c('blue','deepskyblue3','darkolivegreen3','gold3')) +
  scale_y_continuous(labels = scales::comma)

It also becomes clear that green taxis represent a very small sliver of overall volume. The same applies for fhv after 2019, when the heavy hitter (Uber, Lyft, etc.) were separated into the “high-volume” category. To make things clearer as we move ahead, I’ll group both fhv types under a new heading, “rideshare”, and I’ll group both yellow and green under the yellow category to broadly represent “public” NYC cabs. The plot below shows the same view as above, but with these consolidated groups.

aggregated_df_all_2  <- aggregated_df_all %>%
  mutate(date = ymd(paste0(year,'-',sprintf('%02d',month),'-01')),
         type = case_when(type == 'fhv' | type == 'fhvhv' ~ 'rideshare',
                          type == 'yellow' | type == 'green' ~ 'yellow'))

aggregated_df_all_2 %>%
  group_by(date, type) %>%
  summarize(pickups = sum(pickups, na.rm = TRUE), 
            .groups = 'keep') %>%
  ggplot(aes(x = date, y = pickups, fill = type)) + 
  geom_area() +
  scale_fill_manual(values = c('deepskyblue3','gold3')) +
  scale_y_continuous(labels = scales::comma)

The last thing we’ll plot is a view of geography, to see if there are any clear differences in where hail cabs and rideshares are most used. The NYC Taxi dataset also offers shapefiles that map out the various taxi locations in the taxi dataset. Below, we download those shape files and map them into a new combined dataframe.

shapefile <- tempfile()
url <- 'https://d37ci6vzurychx.cloudfront.net/misc/taxi_zones.zip'
download.file(url, shapefile) #\, mode = 'wb')
unzip(shapefile, exdir = 'data/nycshapefiles/')
nyc_shapes <- st_read('data/nycshapefiles/taxi_zones.shp')
## Reading layer `taxi_zones' from data source 
##   `C:\Users\keith\Documents\GitHub\cuny\D607\data\nycshapefiles\taxi_zones.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
combo_df <- aggregated_df_all_2 %>%
  left_join(nyc_shapes, by = c('location' = 'LocationID')) %>%
  filter(borough != 'Staten Island')

I’d like to put together a few map plots, so I’ll start by creating a function.

map_plot_taxi_data <- function(input, title, taxi_type, ride_type, year,
                               color = 'dodgerblue4', legend = TRUE) {
  input %>%
  filter(type == taxi_type, year == year) %>%
  group_by(year, location) %>%
  mutate(Ride_Volume = sum((!!as.symbol(ride_type)), na.rm = TRUE)) %>%
  ungroup() %>%
  ggplot() +
  geom_sf(aes(geometry = geometry, fill = Ride_Volume),
          show.legend = legend) +
  scale_fill_gradient(low = 'gray100', 
                      high = color,
                      limits = c(0,4*10^6),
                      labels = scales::comma) +
  ggtitle(title) +
  theme(axis.text.x=element_blank(),
        axis.text.y=element_blank())
}

Let’s test it out! We’ll start with a view of yellow (and green) cab pickups in 2014.

map_plot_taxi_data(input = combo_df, title = 'Yellow Pickups 2012',
                   taxi_type = 'yellow', ride_type = 'pickups', year = 2014, 
                   color = 'gold3', legend = TRUE)

Next, we’ll compare a few plots, showing dropoffs and pickups for rideshares and yellows in 2021. I’ve remove dthe legends to save space, but note that these plots (and all others below) have the same scale as the plot above, with color density ranging from 0 to 4 million rides.

p1 <- map_plot_taxi_data(input = combo_df, title = 'Rideshare Pickups 2021',
                   taxi_type = 'rideshare', ride_type = 'pickups', year = 2021, 
                   color = 'dodgerblue4', legend = FALSE)

p2 <- map_plot_taxi_data(input = combo_df, title = 'Rideshare Dropoffs 2021',
                   taxi_type = 'rideshare', ride_type = 'dropoffs', year = 2021, 
                   color = 'dodgerblue4', legend = FALSE)

p3 <- map_plot_taxi_data(input = combo_df, title = 'Yellow Pickups 2021',
                   taxi_type = 'yellow', ride_type = 'pickups', year = 2021, 
                   color = 'gold3', legend = FALSE)

p4 <- map_plot_taxi_data(input = combo_df, title = 'Yellow Dropoffs 2021',
                   taxi_type = 'yellow', ride_type = 'dropoffs', year = 2021, 
                   color = 'gold3', legend = FALSE)

plot_grid(p1, p2, p3, p4, nrow = 2, ncol = 2)

Finally, we’ll put together a series of plots to see how (or if) the mapping of pickups have changed over time. We’ll start with rideshares, looking at pickups from 2015 to 2022.

rideshare_plist <- list()

for (yr in 2015:2022) {
  plt <- map_plot_taxi_data(input = combo_df, title = yr,
                   taxi_type = 'rideshare', ride_type = 'pickups', year = yr, 
                   color = 'dodgerblue4', legend = FALSE)
  rideshare_plist[[yr-2014]] <- plt
}

plot_grid(plotlist = rideshare_plist, nrow = 2, ncol = 4)

And again for yellow cabs!

yellow_plist <- list()

for (yr in 2015:2022) {
  plt <- map_plot_taxi_data(input = combo_df, title = yr,
                   taxi_type = 'yellow', ride_type = 'pickups', year = yr, 
                   color = 'gold3', legend = FALSE)
  yellow_plist[[yr-2014]] <- plt
}

plot_grid(plotlist = yellow_plist, nrow = 2, ncol = 4)