taxi_november_2021 = fread("Data/taxi_november_2021.csv")
taxi_zone_lookup = fread("Data/taxi_zone_lookup.csv")Lab 10: Tools for Working with Big Data
Overview
This is a two part lab assignment. For both parts we will be using a dataset of New York City Taxi trips. In the first part you will complete a problem using data.tables and in the second part you will create a local spark instance on your computer and use the sparklyr library to simulate how you would process the dataset if it were stored on a distributed computing cluster.
You will need to download up to three data files for this lab, which are available via shared links in a google drive folder. The first is a dataset of taxi trips that took place in 2021. This file is 3GB in size so if your computer doesn’t have a large amount of RAM (at least 16GB), consider using the smaller alternative (but be clear which one you are using when you complete your assignment):
The second dataset just contains the rides from November for the 2021 dataset, and as a result is 10 times smaller:
Finally, the final dataset contains details on the meaning of the taxi location code, which is important for determining the actual geographic location of the pickup and dropoff of each taxi ride:
Problem 1
- Use
data.tablesto load 2021 NYC taxi dataset (or, if your computer has low memory, the alternative dataset of just November 2021) and the dataset that describes the location of each taxi location code, and output the memory address of the data.table obtained after loading. Performing all operations in place, recode the the drop off and pick up time variables as date-times (whether or how you will do this depends on how your reader interprets the file). Then create new columns in the data table which are equal to the duration of each taxi ride (in whatever time units you prefer) and the average speed in miles per hour of each taxi ride. Next, set equal toNAall values of the ride speed where ride speed is either negative, greater than 90 mph, or where the ride time is longer than 3 hours.
I will be using the November 2021 taxi dataset.
address(taxi_november_2021)[1] "0x7f7c9d410600"
address(taxi_zone_lookup)[1] "0x7f7cca1d2800"
taxi_november_2021[,
tpep_pickup_datetime := as.POSIXct(
tpep_pickup_datetime, format = "%Y-%m-%d %H:%M:%S")]
taxi_november_2021[,
tpep_dropoff_datetime := as.POSIXct(
tpep_dropoff_datetime, format = "%Y-%m-%d %H:%M:%S")]taxi_november_2021[,
ride_duration := as.numeric(
difftime(tpep_dropoff_datetime, tpep_pickup_datetime,
units = "secs"))]
taxi_november_2021[, ride_speed := trip_distance / (ride_duration / 3600)]taxi_november_2021[
ride_speed < 0 | ride_speed > 90 | ride_duration > 3 * 3600, ride_speed := NA]Next join with the location information so that the borough of origin and destination of each taxi ride is present in the data.table (this may require joining twice). Verify that this final data.table has the same memory address as the original data.frame. Hint: lubridate has a variety of functions for working with characters that represent time stamps.
taxi_november_2021[
taxi_zone_lookup,
Pickup_Borough := i.Borough,
on = .(PULocationID = LocationID)
]taxi_november_2021[
taxi_zone_lookup,
Dropoff_Borough := i.Borough,
on = .(DOLocationID = LocationID)
]address(taxi_november_2021)[1] "0x7f7c9d410600"
Great! It’s still in the same place in memory.
- For each combination of origin and destination boroughs, calculate the average speed (technically the average of the average speed) of taxi rides between those two boroughs and the total number of taxi rides between those boroughs, sort in descending order by average speed, and display the full answer.
taxi_november_2021[
, .(
Avg_Speed = mean(ride_speed, na.rm = TRUE),
Total_Rides = .N
),
by = .(Pickup_Borough, Dropoff_Borough)
][
order(-Avg_Speed)
] Pickup_Borough Dropoff_Borough Avg_Speed Total_Rides
<char> <char> <num> <int>
1: Staten Island N/A 43.84049417 26
2: N/A Staten Island 38.11481534 30
3: Staten Island Queens 37.66992704 49
4: Staten Island Bronx 37.58999241 23
5: Unknown N/A 37.11519761 172
6: Bronx Staten Island 36.62865538 20
7: Queens Staten Island 33.44828501 320
8: Queens N/A 32.87632058 5342
9: N/A EWR 32.04668158 65
10: Queens EWR 31.16870975 261
11: Staten Island Brooklyn 30.99428937 88
12: Unknown EWR 29.71935144 76
13: Brooklyn Staten Island 29.57314078 96
14: Unknown Staten Island 28.98841566 9
15: Queens Bronx 27.84043705 5997
16: Manhattan Staten Island 27.52692660 422
17: Staten Island Manhattan 27.39645783 62
18: Manhattan EWR 26.41711331 8075
19: Bronx N/A 26.32886246 51
20: Unknown Bronx 26.00175539 256
21: EWR Manhattan 25.99845583 9
22: Queens Brooklyn 24.04950708 39361
23: Brooklyn N/A 23.45423925 86
24: Manhattan N/A 23.41831935 3305
25: EWR N/A 23.40482363 7
26: Queens Manhattan 22.96394386 153276
27: Bronx Queens 22.72712695 712
28: Unknown Brooklyn 22.22472080 1347
29: Unknown Queens 21.63697244 1938
30: Brooklyn EWR 21.62976079 91
31: Bronx Brooklyn 20.81530846 630
32: Manhattan Queens 20.56515281 108236
33: Brooklyn Queens 20.51130116 3171
34: Brooklyn Bronx 20.43693073 652
35: Manhattan Bronx 19.44195546 13140
36: Queens Queens 18.50440644 58761
37: Bronx Manhattan 15.63014678 2572
38: Brooklyn Manhattan 15.27951436 10653
39: Unknown Manhattan 15.15946702 13717
40: Manhattan Brooklyn 14.97504723 74550
41: Queens Unknown 13.19701650 1437
42: N/A Manhattan 11.58602777 2053
43: N/A Queens 11.38797018 1616
44: N/A Bronx 11.12427179 475
45: Unknown Unknown 10.39344196 12976
46: N/A Brooklyn 10.35626799 1193
47: Brooklyn Brooklyn 10.33467186 11387
48: Bronx Bronx 10.24424316 2349
49: Brooklyn Unknown 10.16165759 109
50: Manhattan Manhattan 10.09891331 2921076
51: Staten Island Staten Island 9.90185138 52
52: Bronx Unknown 9.75913447 27
53: Manhattan Unknown 9.46121244 8505
54: N/A N/A 5.51945874 1322
55: EWR Unknown 5.04395604 61
56: EWR EWR 3.11855668 472
57: N/A Unknown 2.81755172 127
58: EWR Queens 0.09141056 3
59: Staten Island Unknown 0.00000000 3
Pickup_Borough Dropoff_Borough Avg_Speed Total_Rides
Interestingly, or perhaps not that interestingly, from Staten Island to “N/A” has the highest average speed. We see a lot of “N/A” for good reason - taxis do not simply “stop” at the NYC border. I’d guess from Staten Island the “N/A” represents New Jersey or New York (non-NYC). Same with any situation where we see “N/A”.
Problem 2
- Create a local
sparkinstance on your computer and load the November taxi dataset and taxi location dataset into your spark instance by usingspark_read_csv. Join the datasets so that that the taxi ride data has data on the borough of origin of each taxi ride. Create a new column in the dataset equal to the tip percentage, and filter the tip percentage data so that it excludes data points where either the tip or fare was less 0. Then, for taxi rides originating in each borough, calculate the mean and maximum tip percentage.
sc <- spark_connect(master = "local", version = "3.3.2")Note that following along with our reading led me to install version 2.3 of spark, but I received error messages that said apparently this is no longer supported by sparklyr. So, I installed version 3.3.2 instead.
taxi_nov_table <- spark_read_csv(sc, name = "taxi",
path = "Data/taxi_november_2021.csv",
infer_schema = TRUE, header = TRUE)
location_table <- spark_read_csv(sc, name = "location",
path = "Data/taxi_zone_lookup.csv",
infer_schema = TRUE, header = TRUE)taxi_with_borough <- taxi_nov_table |>
left_join(location_table, by = c("PULocationID" = "LocationID"))Next we’ll calculat the tip percentage and filter where the tip or fare was less than 0 (so it’s greater than or equal to 0 for each):
taxi_nov_tips <- taxi_with_borough |>
mutate(
tip_percentage = (tip_amount / total_amount) * 100
) |>
filter(tip_amount >= 0, total_amount >= 0) Next we’ll group by Borough and calculate the mean and maximum tip percentage.
tip_nov_taxi_overview <- taxi_nov_tips |>
group_by(Borough) |>
summarise(
mean_tip_percentage = mean(tip_percentage, na.rm = TRUE),
max_tip_percentage = max(tip_percentage, na.rm = TRUE)
)tip_nov_taxi_overview |> collect()# A tibble: 8 × 3
Borough mean_tip_percentage max_tip_percentage
<chr> <dbl> <dbl>
1 Brooklyn 8.44 96.9
2 EWR 11.7 98.5
3 N/A 2.39 99.9
4 Staten Island 1.85 92.9
5 Manhattan 12.2 137.
6 Unknown 11.9 100
7 Queens 10.8 99.7
8 Bronx 1.25 91.8
- For taxi rides on November 25th, filter the data so that tip percentage includes only rows where the tip and fare were non-negative, and the tip percentage is less than 100%. Make a plot of the distribution of cab fares for taxi rides originating in each borough (using ggplot2 or ggridges). Perform all of the computations and data wrangling in Spark, and only collect the final tibbles to display your results make your plot.
nov25 <- taxi_nov_tips |>
filter(to_date(tpep_pickup_datetime) == "2021-11-25",
tip_amount >= 0,
fare_amount >= 0,
tip_percentage < 100)taxi_nov25_borough <- nov25 |>
select(Borough, fare_amount) |>
collect()ggplot(taxi_nov25_borough, aes(x = fare_amount, y = Borough)) +
geom_point() +
labs(
title = "November 25: Cab Fares by Borough of Origin",
x = "Fare Amount in $USD",
y = "Borough"
) Interesting to see here that EWR (Newark airport - definitely not a borough!) shows up listed as a “borough” rather than “N/A” or “Unknown”.