New York City, known for its bustling streets and iconic yellow cabs, has one of the largest and most dynamic taxi industries in the world. Every day, thousands of yellow cabs navigate the city’s grid, transporting millions of passengers through its diverse boroughs. These cabs are not just a form of transportation—they represent the driving force of a city that never sleeps. As someone who has grown up with cab drivers in my family, I’ve always been intrigued as to what can makes them successful.
This analysis is rooted in the idea that the key to success for a New York City cab driver lies in understanding the data behind their trips and how factors like time of day, route choice, and passenger type influence earnings and overall success. Through a simple analysis of real trip data from NYC cabs, collected in September 2024, I aim to explore what makes a cab driver successful. By studying these patterns, I hope to uncover insights that could help current and aspiring drivers improve their strategies, enhance their efficiency, and ultimately thrive in one of the most competitive and rewarding professions in the city.
The Data will be sourced from - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
In this analysis I will follow Hadley Wickham Data Science Workflow.
Hadley Wickham has developed a structured approach to data science that emphasizes a clear, repeatable workflow for data analysis. His workflow is designed to ensure that data science projects are efficient, reproducible, and well-documented. The key steps in his workflow include:
Import: Load the necessary data into a suitable format for analysis. Tidy: Clean the data, handling missing values, errors, and ensuring consistency. Transform: Manipulate and reshape the data to derive new features or aggregates that make analysis easier. Visualize: Create visualizations to explore patterns, trends, and relationships in the data. Model: Apply statistical or machine learning models to make predictions or infer insights. Communicate: Share your findings through reports, presentations, or dashboards.
library(tidyverse)
library(arrow)
library(rvest)
library(pdftools)
library(tidyr)
From three different sources
data_green_tripdata <- read_parquet("green_tripdata_2024-09.parquet")
data_yellow_tripdata <- read_parquet("yellow_tripdata_2024-09.parquet")
taxi_zone_lookup <- read_csv("taxi_zone_lookup.csv")
pdf_file_yellow <- "data_dictionary_trip_records_yellow.pdf"
pdf_file_green <- "data_dictionary_trip_records_green.pdf"
pdf_text <- pdftools::pdf_text(pdf_file_yellow)
all_text_yellow <- paste(pdf_text, collapse = "\n")
pdf_text <- pdftools::pdf_text(pdf_file_green)
all_text_green <- paste(pdf_text, collapse = "\n")
print(all_text_yellow)
## [1] "Data Dictionary – Yellow Taxi Trip Records May 11, 2022 Page 1 of 2\n\nThis data dictionary describes yellow taxi trip data. For a dictionary describing green taxi data, or a map\nof the TLC Taxi Zones, please visit http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.\n Field Name Description\n\n VendorID A code indicating the TPEP provider that provided the record.\n\n 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.\n tpep_pickup_datetime The date and time when the meter was engaged.\n tpep_dropoff_datetime The date and time when the meter was disengaged.\n Passenger_count The number of passengers in the vehicle.\n\n This is a driver-entered value.\n Trip_distance The elapsed trip distance in miles reported by the taximeter.\n PULocationID TLC Taxi Zone in which the taximeter was engaged\n DOLocationID TLC Taxi Zone in which the taximeter was disengaged\n RateCodeID The final rate code in effect at the end of the trip.\n\n 1= Standard rate\n 2=JFK\n 3=Newark\n 4=Nassau or Westchester\n 5=Negotiated fare\n 6=Group ride\n Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle\n memory before sending to the vendor, aka “store and forward,”\n because the vehicle did not have a connection to the server.\n\n Y= store and forward trip\n N= not a store and forward trip\n Payment_type A numeric code signifying how the passenger paid for the trip.\n 1= Credit card\n 2= Cash\n 3= No charge\n 4= Dispute\n 5= Unknown\n 6= Voided trip\n Fare_amount The time-and-distance fare calculated by the meter.\n Extra Miscellaneous extras and surcharges. Currently, this only includes\n the $0.50 and $1 rush hour and overnight charges.\n MTA_tax $0.50 MTA tax that is automatically triggered based on the metered\n rate in use.\n Improvement_surcharge $0.30 improvement surcharge assessed trips at the flag drop. The\n improvement surcharge began being levied in 2015.\n Tip_amount Tip amount – This field is automatically populated for credit card\n tips. Cash tips are not included.\n Tolls_amount Total amount of all tolls paid in trip.\n Total_amount The total amount charged to passengers. Does not include cash tips.\n Congestion_Surcharge Total amount collected in trip for NYS congestion surcharge.\n Airport_fee $1.25 for pick up only at LaGuardia and John F. Kennedy Airports\n\nData Dictionary – Yellow Taxi Trip Records May 11, 2022 Page 2 of 2\n"
lines <- unlist(strsplit(all_text_yellow, "\n"))
lines <- lines[nchar(lines) > 0] # Remove empty lines
lines <- trimws(lines) # Trim leading/trailing whitespace
fields <- list()
current_description <- ""
for (line in lines) {
match <- regexec("^([A-Za-z_]+(?:[A-Za-z0-9_]*))\\s{2,}(.*)$", line)
matched <- regmatches(line, match)
if (length(matched[[1]]) > 1) {
if (nchar(current_description) > 0) {
fields[[length(fields) + 1]] <- c(field_name, current_description)
}
field_name <- matched[[1]][2]
description <- matched[[1]][3]
current_description <- description
} else {
if (grepl("^\\d=", line)) {
current_description <- paste(current_description, line)
}
}
}
if (nchar(current_description) > 0) {
fields[[length(fields) + 1]] <- c(field_name, current_description)
}
fields_yellow <- do.call(rbind, fields)
colnames(fields_yellow) <- c("Field_Name", "Description")
print(fields_yellow)
## Field_Name
## [1,] "VendorID"
## [2,] "tpep_pickup_datetime"
## [3,] "tpep_dropoff_datetime"
## [4,] "Passenger_count"
## [5,] "Trip_distance"
## [6,] "PULocationID"
## [7,] "DOLocationID"
## [8,] "RateCodeID"
## [9,] "Store_and_fwd_flag"
## [10,] "Payment_type"
## [11,] "Fare_amount"
## [12,] "Extra"
## [13,] "MTA_tax"
## [14,] "Improvement_surcharge"
## [15,] "Tip_amount"
## [16,] "Tolls_amount"
## [17,] "Total_amount"
## [18,] "Congestion_Surcharge"
## [19,] "Airport_fee"
## Description
## [1,] "A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc."
## [2,] "The date and time when the meter was engaged."
## [3,] "The date and time when the meter was disengaged."
## [4,] "The number of passengers in the vehicle."
## [5,] "The elapsed trip distance in miles reported by the taximeter."
## [6,] "TLC Taxi Zone in which the taximeter was engaged"
## [7,] "TLC Taxi Zone in which the taximeter was disengaged"
## [8,] "The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride"
## [9,] "This flag indicates whether the trip record was held in vehicle"
## [10,] "A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip"
## [11,] "The time-and-distance fare calculated by the meter."
## [12,] "Miscellaneous extras and surcharges. Currently, this only includes"
## [13,] "$0.50 MTA tax that is automatically triggered based on the metered"
## [14,] "$0.30 improvement surcharge assessed trips at the flag drop. The"
## [15,] "Tip amount – This field is automatically populated for credit card"
## [16,] "Total amount of all tolls paid in trip."
## [17,] "The total amount charged to passengers. Does not include cash tips."
## [18,] "Total amount collected in trip for NYS congestion surcharge."
## [19,] "$1.25 for pick up only at LaGuardia and John F. Kennedy Airports"
lines <- unlist(strsplit(all_text_green, "\n"))
lines <- lines[nchar(lines) > 0] # Remove empty lines
lines <- trimws(lines) # Trim leading/trailing whitespace
fields <- list()
current_description <- ""
for (line in lines) {
match <- regexec("^([A-Za-z_]+(?:[A-Za-z0-9_]*))\\s{2,}(.*)$", line)
matched <- regmatches(line, match)
if (length(matched[[1]]) > 1) {
if (nchar(current_description) > 0) {
fields[[length(fields) + 1]] <- c(field_name, current_description)
}
field_name <- matched[[1]][2]
description <- matched[[1]][3]
current_description <- description
} else {
if (grepl("^\\d=", line)) {
current_description <- paste(current_description, line)
}
}
}
if (nchar(current_description) > 0) {
fields[[length(fields) + 1]] <- c(field_name, current_description)
}
fields_green <- do.call(rbind, fields)
colnames(fields_green) <- c("Field_Name", "Description")
print(fields_green)
## Field_Name
## [1,] "VendorID"
## [2,] "lpep_pickup_datetime"
## [3,] "lpep_dropoff_datetime"
## [4,] "Passenger_count"
## [5,] "Trip_distance"
## [6,] "PULocationID"
## [7,] "DOLocationID"
## [8,] "RateCodeID"
## [9,] "Store_and_fwd_flag"
## [10,] "Payment_type"
## [11,] "Fare_amount"
## [12,] "Extra"
## [13,] "MTA_tax"
## [14,] "Improvement_surcharge"
## [15,] "Tip_amount"
## [16,] "Tolls_amount"
## [17,] "Total_amount"
## [18,] "Trip_type"
## Description
## [1,] "A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc."
## [2,] "The date and time when the meter was engaged."
## [3,] "The date and time when the meter was disengaged."
## [4,] "The number of passengers in the vehicle."
## [5,] "The elapsed trip distance in miles reported by the taximeter."
## [6,] "TLC Taxi Zone in which the taximeter was engaged"
## [7,] "TLC Taxi Zone in which the taximeter was disengaged"
## [8,] "The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride"
## [9,] "This flag indicates whether the trip record was held in vehicle"
## [10,] "A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip"
## [11,] "The time-and-distance fare calculated by the meter."
## [12,] "Miscellaneous extras and surcharges. Currently, this only includes"
## [13,] "$0.50 MTA tax that is automatically triggered based on the metered"
## [14,] "$0.30 improvement surcharge assessed on hailed trips at the flag"
## [15,] "Tip amount – This field is automatically populated for credit card"
## [16,] "Total amount of all tolls paid in trip."
## [17,] "The total amount charged to passengers. Does not include cash tips."
## [18,] "A code indicating whether the trip was a street-hail or a dispatch 1= Street-hail 2= Dispatch"
fields_df_yellow <- as.data.frame(fields_yellow)
fields_df_green <- as.data.frame(fields_green)
glimpse(fields_df_yellow)
## Rows: 19
## Columns: 2
## $ Field_Name <chr> "VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime…
## $ Description <chr> "A code indicating the TPEP provider that provided the rec…
library(dplyr)
indices_yellow <- which(fields_df_yellow$Field_Name %in% c("VendorID", "RateCodeID", "Payment_type"))
indices_green <- which(fields_df_green$Field_Name %in% c("VendorID", "RateCodeID", "Payment_type", "Trip_type"))
fields_df_yellow <- fields_df_yellow[indices_yellow, ]
fields_df_green <- fields_df_green[indices_green, ]
head(fields_df_yellow)
## Field_Name
## 1 VendorID
## 8 RateCodeID
## 10 Payment_type
## Description
## 1 A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
## 8 The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
## 10 A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
head(fields_df_green)
## Field_Name
## 1 VendorID
## 8 RateCodeID
## 10 Payment_type
## 18 Trip_type
## Description
## 1 A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
## 8 The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
## 10 A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
## 18 A code indicating whether the trip was a street-hail or a dispatch 1= Street-hail 2= Dispatch
library(dplyr)
library(stringr)
fields_df <- fields_df_yellow %>%
mutate(across(c(-Field_Name), ~ str_extract(.x, "(?<=[^\\d])\\d.*")))
fields_df$Field_Name[2] <- "RatecodeID"
fields_df$Field_Name[3] <- "payment_type"
print(fields_df)
## Field_Name
## 1 VendorID
## 8 RatecodeID
## 10 payment_type
## Description
## 1 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
## 8 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
## 10 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
library(tidyverse)
fields_df_tidy <- fields_df %>%
mutate(Description = str_split(Description, " (?=\\d+=)")) %>%
unnest(cols = c(Description)) %>%
mutate(
Code_Number = str_extract(Description, "^[0-9]+"),
Description = str_remove(Description, "^[0-9]+=\\s*")
) %>%
mutate(
Code_Number = str_trim(Code_Number),
Description = str_trim(Description)
)
print(fields_df_tidy)
## # A tibble: 14 × 3
## Field_Name Description Code_Number
## <chr> <chr> <chr>
## 1 VendorID Creative Mobile Technologies, LLC; 1
## 2 VendorID VeriFone Inc. 2
## 3 RatecodeID Standard rate 1
## 4 RatecodeID JFK 2
## 5 RatecodeID Newark 3
## 6 RatecodeID Nassau or Westchester 4
## 7 RatecodeID Negotiated fare 5
## 8 RatecodeID Group ride 6
## 9 payment_type Credit card 1
## 10 payment_type Cash 2
## 11 payment_type No charge 3
## 12 payment_type Dispute 4
## 13 payment_type Unknown 5
## 14 payment_type Voided trip 6
fields_df_tidy$Code_Number <- gsub("[^0-9]", "", fields_df_tidy$Code_Number)
fields_df_tidy$Code_Number <- as.numeric(fields_df_tidy$Code_Number)
glimpse(fields_df_tidy)
## Rows: 14
## Columns: 3
## $ Field_Name <chr> "VendorID", "VendorID", "RatecodeID", "RatecodeID", "Ratec…
## $ Description <chr> "Creative Mobile Technologies, LLC;", "VeriFone Inc.", "St…
## $ Code_Number <dbl> 1, 2, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6
#data_green_tripdata
#data_yellow_tripdata
#taxi_zone_lookup
#fields_df_tidy
data_green_tripdata <- data_green_tripdata %>%
mutate(cab_type = "green")
data_yellow_tripdata <- data_yellow_tripdata %>%
mutate(cab_type = "yellow")
data_tripdata <- bind_rows(data_green_tripdata, data_yellow_tripdata)
data_tripdata <- data_tripdata %>%
left_join(taxi_zone_lookup %>%
select(LocationID, Zone) %>%
rename(PUZone = Zone),
by = c("PULocationID" = "LocationID"))
data_tripdata <- data_tripdata %>%
left_join(taxi_zone_lookup %>%
select(LocationID, Zone) %>%
rename(DOZone = Zone),
by = c("DOLocationID" = "LocationID"))
data_tripdata <- data_tripdata %>%
left_join(fields_df_tidy %>%
filter(Field_Name == "VendorID") %>%
select(Code_Number, Description) %>%
rename(VendorDescription = Description),
by = c("VendorID" = "Code_Number"))
data_tripdata <- data_tripdata %>%
left_join(fields_df_tidy %>%
filter(Field_Name == "RatecodeID") %>%
select(Code_Number, Description) %>%
rename(RatecodeDescription = Description),
by = c("RatecodeID" = "Code_Number"))
data_tripdata <- data_tripdata %>%
left_join(fields_df_tidy %>%
filter(Field_Name == "payment_type") %>%
select(Code_Number, Description) %>%
rename(PaymentTypeDescription = Description),
by = c("payment_type" = "Code_Number"))
## Rows: 3,687,470
## Columns: 29
## $ VendorID <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ lpep_pickup_datetime <dttm> 2024-08-31 20:06:40, 2024-08-31 20:51:31, 2024…
## $ lpep_dropoff_datetime <dttm> 2024-08-31 20:31:15, 2024-08-31 21:14:11, 2024…
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N…
## $ RatecodeID <dbl> 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1,…
## $ PULocationID <dbl> 92, 80, 42, 41, 42, 223, 93, 66, 66, 130, 74, 9…
## $ DOLocationID <dbl> 37, 80, 116, 42, 42, 83, 10, 148, 262, 191, 42,…
## $ passenger_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2,…
## $ trip_distance <dbl> 10.66, 3.89, 0.80, 1.84, 0.00, 2.19, 7.27, 3.15…
## $ fare_amount <dbl> 43.6, 22.6, 6.5, 11.4, 29.0, 11.4, 30.3, 19.8, …
## $ extra <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,…
## $ mta_tax <dbl> 0.5, 0.5, 0.5, 0.5, 0.0, 0.5, 0.5, 0.5, 0.5, 0.…
## $ tip_amount <dbl> 0.00, 5.02, 0.00, 0.00, 0.00, 0.00, 6.56, 5.01,…
## $ tolls_amount <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ ehail_fee <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ improvement_surcharge <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ total_amount <dbl> 46.10, 30.12, 9.00, 13.90, 30.00, 13.90, 39.36,…
## $ payment_type <dbl> 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ trip_type <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ congestion_surcharge <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2.75,…
## $ cab_type <chr> "green", "green", "green", "green", "green", "g…
## $ tpep_pickup_datetime <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tpep_dropoff_datetime <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Airport_fee <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PUZone <chr> "Flushing", "East Williamsburg", "Central Harle…
## $ DOZone <chr> "Bushwick South", "East Williamsburg", "Hamilto…
## $ VendorDescription <chr> "VeriFone Inc.", "VeriFone Inc.", "VeriFone Inc…
## $ RatecodeDescription <chr> "Standard rate", "Standard rate", "Standard rat…
## $ PaymentTypeDescription <chr> "Cash", "Credit card", "Credit card", "Cash", "…
data_tripdata$total_time <- with(data_tripdata, {
time1 <- ifelse(!is.na(lpep_pickup_datetime) & !is.na(lpep_dropoff_datetime),
as.numeric(difftime(lpep_dropoff_datetime, lpep_pickup_datetime, units = "mins")), 0)
time2 <- ifelse(!is.na(tpep_pickup_datetime) & !is.na(tpep_dropoff_datetime),
as.numeric(difftime(tpep_dropoff_datetime, tpep_pickup_datetime, units = "mins")), 0)
time1 + time2
})
glimpse(data_tripdata$total_time)
## num [1:3687470] 24.58 22.67 4.27 8.68 1.35 ...
data_tripdata <- data_tripdata %>%
select(-lpep_pickup_datetime, -lpep_dropoff_datetime, -tpep_pickup_datetime, -tpep_dropoff_datetime, -VendorID, -RatecodeID, -PULocationID, -DOLocationID, -payment_type, -trip_type, -store_and_fwd_flag)
glimpse(data_tripdata)
## Rows: 3,687,470
## Columns: 19
## $ passenger_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2,…
## $ trip_distance <dbl> 10.66, 3.89, 0.80, 1.84, 0.00, 2.19, 7.27, 3.15…
## $ fare_amount <dbl> 43.6, 22.6, 6.5, 11.4, 29.0, 11.4, 30.3, 19.8, …
## $ extra <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,…
## $ mta_tax <dbl> 0.5, 0.5, 0.5, 0.5, 0.0, 0.5, 0.5, 0.5, 0.5, 0.…
## $ tip_amount <dbl> 0.00, 5.02, 0.00, 0.00, 0.00, 0.00, 6.56, 5.01,…
## $ tolls_amount <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ ehail_fee <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ improvement_surcharge <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ total_amount <dbl> 46.10, 30.12, 9.00, 13.90, 30.00, 13.90, 39.36,…
## $ congestion_surcharge <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2.75,…
## $ cab_type <chr> "green", "green", "green", "green", "green", "g…
## $ Airport_fee <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PUZone <chr> "Flushing", "East Williamsburg", "Central Harle…
## $ DOZone <chr> "Bushwick South", "East Williamsburg", "Hamilto…
## $ VendorDescription <chr> "VeriFone Inc.", "VeriFone Inc.", "VeriFone Inc…
## $ RatecodeDescription <chr> "Standard rate", "Standard rate", "Standard rat…
## $ PaymentTypeDescription <chr> "Cash", "Credit card", "Credit card", "Cash", "…
## $ total_time <dbl> 24.583333, 22.666667, 4.266667, 8.683333, 1.350…
## passenger_count trip_distance fare_amount
## 485435 0 0
## extra mta_tax tip_amount
## 0 0 0
## tolls_amount ehail_fee improvement_surcharge
## 0 3687470 0
## total_amount congestion_surcharge cab_type
## 0 485435 0
## Airport_fee PUZone DOZone
## 538171 0 0
## VendorDescription RatecodeDescription PaymentTypeDescription
## 109 522998 485435
## total_time
## 0
## passenger_count trip_distance fare_amount
## 485435 0 0
## extra mta_tax tip_amount
## 0 0 0
## tolls_amount ehail_fee improvement_surcharge
## 0 0 0
## total_amount congestion_surcharge cab_type
## 0 0 0
## Airport_fee PUZone DOZone
## 0 0 0
## VendorDescription RatecodeDescription PaymentTypeDescription
## 109 522998 485435
## total_time
## 0
library(ggplot2)
summary(data_tripdata$total_amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1000.00 15.96 21.48 28.51 31.50 1886.02
data_tripdata <- data_tripdata %>%
filter(total_amount >= 0)
count_above_500 <- data_tripdata %>%
filter(total_amount > 500) %>%
nrow()
count_above_500
## [1] 83
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 16.30 21.80 29.39 31.85 500.00
## # A tibble: 6 × 2
## PUZone avg_total_amount
## <chr> <dbl>
## 1 Rossville/Woodrow 131.
## 2 Outside of NYC 104.
## 3 Newark Airport 94.5
## 4 JFK Airport 82.9
## 5 Eltingville/Annadale/Prince's Bay 74.8
## 6 Baisley Park 70.6
## # A tibble: 6 × 2
## DOZone avg_total_amount
## <chr> <dbl>
## 1 Freshkills Park 205.
## 2 Arden Heights 161.
## 3 Rossville/Woodrow 154.
## 4 Eltingville/Annadale/Prince's Bay 145.
## 5 Charleston/Tottenville 145.
## 6 Great Kills 140.
filtered_data <- data_tripdata %>%
filter(total_time >= 0 & total_time <= 360)
correlation_filtered <- cor(filtered_data$total_time, filtered_data$total_amount, use = "complete.obs")
correlation_filtered
## [1] 0.7604364
So if you want to be a succesful cab driver then your better off driving a yellow cab and taking as much trips from JFK as you can. However the most profitable routes are usually from and to Staten Island.
-Have any questions?