Lab 10: Tools for Working with Big Data

Author

Amanda Rose Knudsen

Overview

This is a two part lab assignment. For both parts we will be using a dataset of New York City Taxi trips. In the first part you will complete a problem using data.tables and in the second part you will create a local spark instance on your computer and use the sparklyr library to simulate how you would process the dataset if it were stored on a distributed computing cluster.

You will need to download up to three data files for this lab, which are available via shared links in a google drive folder. The first is a dataset of taxi trips that took place in 2021. This file is 3GB in size so if your computer doesn’t have a large amount of RAM (at least 16GB), consider using the smaller alternative (but be clear which one you are using when you complete your assignment):

Click here to download 2021 Taxi Rides

The second dataset just contains the rides from November for the 2021 dataset, and as a result is 10 times smaller:

Click here to download the 2021 Taxi rides in November

Finally, the final dataset contains details on the meaning of the taxi location code, which is important for determining the actual geographic location of the pickup and dropoff of each taxi ride:

Click here to download the taxi location codes dataset

Problem 1

Use data.tables to load 2021 NYC taxi dataset (or, if your computer has low memory, the alternative dataset of just November 2021) and the dataset that describes the location of each taxi location code, and output the memory address of the data.table obtained after loading. Performing all operations in place, recode the the drop off and pick up time variables as date-times (whether or how you will do this depends on how your reader interprets the file). Then create new columns in the data table which are equal to the duration of each taxi ride (in whatever time units you prefer) and the average speed in miles per hour of each taxi ride. Next, set equal to NA all values of the ride speed where ride speed is either negative, greater than 90 mph, or where the ride time is longer than 3 hours.

I will be using the November 2021 taxi dataset.

taxi_november_2021 = fread("Data/taxi_november_2021.csv")
taxi_zone_lookup = fread("Data/taxi_zone_lookup.csv")

address(taxi_november_2021)

[1] "0x7f7c9d410600"

address(taxi_zone_lookup)

[1] "0x7f7cca1d2800"

taxi_november_2021[, 
                   tpep_pickup_datetime := as.POSIXct(
                     tpep_pickup_datetime, format = "%Y-%m-%d %H:%M:%S")]
taxi_november_2021[, 
                   tpep_dropoff_datetime := as.POSIXct(
                     tpep_dropoff_datetime, format = "%Y-%m-%d %H:%M:%S")]

taxi_november_2021[, 
                   ride_duration := as.numeric(
                     difftime(tpep_dropoff_datetime, tpep_pickup_datetime, 
                              units = "secs"))]
taxi_november_2021[, ride_speed := trip_distance / (ride_duration / 3600)]

taxi_november_2021[
  ride_speed < 0 | ride_speed > 90 | ride_duration > 3 * 3600, ride_speed := NA]

Next join with the location information so that the borough of origin and destination of each taxi ride is present in the data.table (this may require joining twice). Verify that this final data.table has the same memory address as the original data.frame. Hint: lubridate has a variety of functions for working with characters that represent time stamps.

taxi_november_2021[
  taxi_zone_lookup, 
  Pickup_Borough := i.Borough, 
  on = .(PULocationID = LocationID)
]

taxi_november_2021[
  taxi_zone_lookup, 
  Dropoff_Borough := i.Borough, 
  on = .(DOLocationID = LocationID)
]

address(taxi_november_2021)

[1] "0x7f7c9d410600"

Great! It’s still in the same place in memory.

For each combination of origin and destination boroughs, calculate the average speed (technically the average of the average speed) of taxi rides between those two boroughs and the total number of taxi rides between those boroughs, sort in descending order by average speed, and display the full answer.

taxi_november_2021[
  , .(
    Avg_Speed = mean(ride_speed, na.rm = TRUE),
    Total_Rides = .N
  ),
  by = .(Pickup_Borough, Dropoff_Borough)
][
  order(-Avg_Speed)
]

    Pickup_Borough Dropoff_Borough   Avg_Speed Total_Rides
            <char>          <char>       <num>       <int>
 1:  Staten Island             N/A 43.84049417          26
 2:            N/A   Staten Island 38.11481534          30
 3:  Staten Island          Queens 37.66992704          49
 4:  Staten Island           Bronx 37.58999241          23
 5:        Unknown             N/A 37.11519761         172
 6:          Bronx   Staten Island 36.62865538          20
 7:         Queens   Staten Island 33.44828501         320
 8:         Queens             N/A 32.87632058        5342
 9:            N/A             EWR 32.04668158          65
10:         Queens             EWR 31.16870975         261
11:  Staten Island        Brooklyn 30.99428937          88
12:        Unknown             EWR 29.71935144          76
13:       Brooklyn   Staten Island 29.57314078          96
14:        Unknown   Staten Island 28.98841566           9
15:         Queens           Bronx 27.84043705        5997
16:      Manhattan   Staten Island 27.52692660         422
17:  Staten Island       Manhattan 27.39645783          62
18:      Manhattan             EWR 26.41711331        8075
19:          Bronx             N/A 26.32886246          51
20:        Unknown           Bronx 26.00175539         256
21:            EWR       Manhattan 25.99845583           9
22:         Queens        Brooklyn 24.04950708       39361
23:       Brooklyn             N/A 23.45423925          86
24:      Manhattan             N/A 23.41831935        3305
25:            EWR             N/A 23.40482363           7
26:         Queens       Manhattan 22.96394386      153276
27:          Bronx          Queens 22.72712695         712
28:        Unknown        Brooklyn 22.22472080        1347
29:        Unknown          Queens 21.63697244        1938
30:       Brooklyn             EWR 21.62976079          91
31:          Bronx        Brooklyn 20.81530846         630
32:      Manhattan          Queens 20.56515281      108236
33:       Brooklyn          Queens 20.51130116        3171
34:       Brooklyn           Bronx 20.43693073         652
35:      Manhattan           Bronx 19.44195546       13140
36:         Queens          Queens 18.50440644       58761
37:          Bronx       Manhattan 15.63014678        2572
38:       Brooklyn       Manhattan 15.27951436       10653
39:        Unknown       Manhattan 15.15946702       13717
40:      Manhattan        Brooklyn 14.97504723       74550
41:         Queens         Unknown 13.19701650        1437
42:            N/A       Manhattan 11.58602777        2053
43:            N/A          Queens 11.38797018        1616
44:            N/A           Bronx 11.12427179         475
45:        Unknown         Unknown 10.39344196       12976
46:            N/A        Brooklyn 10.35626799        1193
47:       Brooklyn        Brooklyn 10.33467186       11387
48:          Bronx           Bronx 10.24424316        2349
49:       Brooklyn         Unknown 10.16165759         109
50:      Manhattan       Manhattan 10.09891331     2921076
51:  Staten Island   Staten Island  9.90185138          52
52:          Bronx         Unknown  9.75913447          27
53:      Manhattan         Unknown  9.46121244        8505
54:            N/A             N/A  5.51945874        1322
55:            EWR         Unknown  5.04395604          61
56:            EWR             EWR  3.11855668         472
57:            N/A         Unknown  2.81755172         127
58:            EWR          Queens  0.09141056           3
59:  Staten Island         Unknown  0.00000000           3
    Pickup_Borough Dropoff_Borough   Avg_Speed Total_Rides

Interestingly, or perhaps not that interestingly, from Staten Island to “N/A” has the highest average speed. We see a lot of “N/A” for good reason - taxis do not simply “stop” at the NYC border. I’d guess from Staten Island the “N/A” represents New Jersey or New York (non-NYC). Same with any situation where we see “N/A”.

Problem 2

Create a local spark instance on your computer and load the November taxi dataset and taxi location dataset into your spark instance by using spark_read_csv. Join the datasets so that that the taxi ride data has data on the borough of origin of each taxi ride. Create a new column in the dataset equal to the tip percentage, and filter the tip percentage data so that it excludes data points where either the tip or fare was less 0. Then, for taxi rides originating in each borough, calculate the mean and maximum tip percentage.

sc <- spark_connect(master = "local", version = "3.3.2")

Note that following along with our reading led me to install version 2.3 of spark, but I received error messages that said apparently this is no longer supported by sparklyr. So, I installed version 3.3.2 instead.

taxi_nov_table <- spark_read_csv(sc, name = "taxi", 
                                 path = "Data/taxi_november_2021.csv", 
                                 infer_schema = TRUE, header = TRUE)
location_table <- spark_read_csv(sc, name = "location", 
                               path = "Data/taxi_zone_lookup.csv", 
                               infer_schema = TRUE, header = TRUE)

taxi_with_borough <- taxi_nov_table |> 
  left_join(location_table, by = c("PULocationID" = "LocationID"))

Next we’ll calculat the tip percentage and filter where the tip or fare was less than 0 (so it’s greater than or equal to 0 for each):

taxi_nov_tips <- taxi_with_borough |> 
  mutate(
    tip_percentage = (tip_amount / total_amount) * 100
    ) |> 
  filter(tip_amount >= 0, total_amount >= 0)

Next we’ll group by Borough and calculate the mean and maximum tip percentage.

tip_nov_taxi_overview <- taxi_nov_tips |> 
  group_by(Borough) |> 
  summarise(
    mean_tip_percentage = mean(tip_percentage, na.rm = TRUE),
    max_tip_percentage = max(tip_percentage, na.rm = TRUE)
  )

tip_nov_taxi_overview |> collect()

# A tibble: 8 × 3
  Borough       mean_tip_percentage max_tip_percentage
  <chr>                       <dbl>              <dbl>
1 Brooklyn                     8.44               96.9
2 EWR                         11.7                98.5
3 N/A                          2.39               99.9
4 Staten Island                1.85               92.9
5 Manhattan                   12.2               137. 
6 Unknown                     11.9               100  
7 Queens                      10.8                99.7
8 Bronx                        1.25               91.8

For taxi rides on November 25th, filter the data so that tip percentage includes only rows where the tip and fare were non-negative, and the tip percentage is less than 100%. Make a plot of the distribution of cab fares for taxi rides originating in each borough (using ggplot2 or ggridges). Perform all of the computations and data wrangling in Spark, and only collect the final tibbles to display your results make your plot.

nov25 <- taxi_nov_tips |> 
  filter(to_date(tpep_pickup_datetime) == "2021-11-25",
         tip_amount >= 0,
         fare_amount >= 0,
         tip_percentage < 100)

taxi_nov25_borough <- nov25 |> 
  select(Borough, fare_amount) |> 
  collect()

ggplot(taxi_nov25_borough, aes(x = fare_amount, y = Borough)) +
  geom_point() +
  labs(
    title = "November 25: Cab Fares by Borough of Origin",
    x = "Fare Amount in $USD",
    y = "Borough"
  )

Interesting to see here that EWR (Newark airport - definitely not a borough!) shows up listed as a “borough” rather than “N/A” or “Unknown”.