DATA607Final

Introduction

New York City, known for its bustling streets and iconic yellow cabs, has one of the largest and most dynamic taxi industries in the world. Every day, thousands of yellow cabs navigate the city’s grid, transporting millions of passengers through its diverse boroughs. These cabs are not just a form of transportation—they represent the driving force of a city that never sleeps. As someone who has grown up with cab drivers in my family, I’ve always been intrigued as to what can makes them successful.

Purpose

This analysis is rooted in the idea that the key to success for a New York City cab driver lies in understanding the data behind their trips and how factors like time of day, route choice, and passenger type influence earnings and overall success. Through a simple analysis of real trip data from NYC cabs, collected in September 2024, I aim to explore what makes a cab driver successful. By studying these patterns, I hope to uncover insights that could help current and aspiring drivers improve their strategies, enhance their efficiency, and ultimately thrive in one of the most competitive and rewarding professions in the city.

The Data will be sourced from - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Data Science Workflow

In this analysis I will follow Hadley Wickham Data Science Workflow.

Hadley Wickham has developed a structured approach to data science that emphasizes a clear, repeatable workflow for data analysis. His workflow is designed to ensure that data science projects are efficient, reproducible, and well-documented. The key steps in his workflow include:

Import: Load the necessary data into a suitable format for analysis. Tidy: Clean the data, handling missing values, errors, and ensuring consistency. Transform: Manipulate and reshape the data to derive new features or aggregates that make analysis easier. Visualize: Create visualizations to explore patterns, trends, and relationships in the data. Model: Apply statistical or machine learning models to make predictions or infer insights. Communicate: Share your findings through reports, presentations, or dashboards.

The Libraries

library(tidyverse)
library(arrow)
library(rvest)
library(pdftools)
library(tidyr)

Import the Data

From three different sources

data_green_tripdata <- read_parquet("green_tripdata_2024-09.parquet")
data_yellow_tripdata <- read_parquet("yellow_tripdata_2024-09.parquet")

taxi_zone_lookup <- read_csv("taxi_zone_lookup.csv")

pdf_file_yellow <- "data_dictionary_trip_records_yellow.pdf"
pdf_file_green <- "data_dictionary_trip_records_green.pdf"

Data Tidy - Reading PDF’s

pdf_text <- pdftools::pdf_text(pdf_file_yellow)
all_text_yellow <- paste(pdf_text, collapse = "\n")
pdf_text <- pdftools::pdf_text(pdf_file_green)
all_text_green <- paste(pdf_text, collapse = "\n")

print(all_text_yellow)

## [1] "Data Dictionary – Yellow Taxi Trip Records       May 11, 2022                      Page 1 of 2\n\nThis data dictionary describes yellow taxi trip data. For a dictionary describing green taxi data, or a map\nof the TLC Taxi Zones, please visit http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.\n             Field Name                                             Description\n\n  VendorID                           A code indicating the TPEP provider that provided the record.\n\n                                     1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.\n  tpep_pickup_datetime               The date and time when the meter was engaged.\n  tpep_dropoff_datetime              The date and time when the meter was disengaged.\n  Passenger_count                    The number of passengers in the vehicle.\n\n                                     This is a driver-entered value.\n  Trip_distance                      The elapsed trip distance in miles reported by the taximeter.\n  PULocationID                       TLC Taxi Zone in which the taximeter was engaged\n  DOLocationID                       TLC Taxi Zone in which the taximeter was disengaged\n  RateCodeID                         The final rate code in effect at the end of the trip.\n\n                                     1= Standard rate\n                                     2=JFK\n                                     3=Newark\n                                     4=Nassau or Westchester\n                                     5=Negotiated fare\n                                     6=Group ride\n  Store_and_fwd_flag                 This flag indicates whether the trip record was held in vehicle\n                                     memory before sending to the vendor, aka “store and forward,”\n                                     because the vehicle did not have a connection to the server.\n\n                                     Y= store and forward trip\n                                     N= not a store and forward trip\n  Payment_type                       A numeric code signifying how the passenger paid for the trip.\n                                     1= Credit card\n                                     2= Cash\n                                     3= No charge\n                                     4= Dispute\n                                     5= Unknown\n                                     6= Voided trip\n  Fare_amount                        The time-and-distance fare calculated by the meter.\n  Extra                              Miscellaneous extras and surcharges. Currently, this only includes\n                                     the $0.50 and $1 rush hour and overnight charges.\n  MTA_tax                            $0.50 MTA tax that is automatically triggered based on the metered\n                                     rate in use.\n  Improvement_surcharge              $0.30 improvement surcharge assessed trips at the flag drop. The\n                                     improvement surcharge began being levied in 2015.\n  Tip_amount                         Tip amount – This field is automatically populated for credit card\n                                     tips. Cash tips are not included.\n  Tolls_amount                       Total amount of all tolls paid in trip.\n  Total_amount                       The total amount charged to passengers. Does not include cash tips.\n  Congestion_Surcharge               Total amount collected in trip for NYS congestion surcharge.\n  Airport_fee                        $1.25 for pick up only at LaGuardia and John F. Kennedy Airports\n\nData Dictionary – Yellow Taxi Trip Records   May 11, 2022   Page 2 of 2\n"

Data Tidy - Using regex to extract the field names and descriptions

lines <- unlist(strsplit(all_text_yellow, "\n"))

lines <- lines[nchar(lines) > 0]  # Remove empty lines
lines <- trimws(lines)  # Trim leading/trailing whitespace

fields <- list()

current_description <- ""

for (line in lines) {
  match <- regexec("^([A-Za-z_]+(?:[A-Za-z0-9_]*))\\s{2,}(.*)$", line)
  matched <- regmatches(line, match)
  
  if (length(matched[[1]]) > 1) {
    if (nchar(current_description) > 0) {
      fields[[length(fields) + 1]] <- c(field_name, current_description)
    }
    
    field_name <- matched[[1]][2]
    description <- matched[[1]][3]
    
    current_description <- description
  } else {
    if (grepl("^\\d=", line)) {
      current_description <- paste(current_description, line)
    }
  }
}

if (nchar(current_description) > 0) {
  fields[[length(fields) + 1]] <- c(field_name, current_description)
}

fields_yellow <- do.call(rbind, fields)
colnames(fields_yellow) <- c("Field_Name", "Description")

print(fields_yellow)

##       Field_Name             
##  [1,] "VendorID"             
##  [2,] "tpep_pickup_datetime" 
##  [3,] "tpep_dropoff_datetime"
##  [4,] "Passenger_count"      
##  [5,] "Trip_distance"        
##  [6,] "PULocationID"         
##  [7,] "DOLocationID"         
##  [8,] "RateCodeID"           
##  [9,] "Store_and_fwd_flag"   
## [10,] "Payment_type"         
## [11,] "Fare_amount"          
## [12,] "Extra"                
## [13,] "MTA_tax"              
## [14,] "Improvement_surcharge"
## [15,] "Tip_amount"           
## [16,] "Tolls_amount"         
## [17,] "Total_amount"         
## [18,] "Congestion_Surcharge" 
## [19,] "Airport_fee"          
##       Description                                                                                                                                   
##  [1,] "A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc."                        
##  [2,] "The date and time when the meter was engaged."                                                                                               
##  [3,] "The date and time when the meter was disengaged."                                                                                            
##  [4,] "The number of passengers in the vehicle."                                                                                                    
##  [5,] "The elapsed trip distance in miles reported by the taximeter."                                                                               
##  [6,] "TLC Taxi Zone in which the taximeter was engaged"                                                                                            
##  [7,] "TLC Taxi Zone in which the taximeter was disengaged"                                                                                         
##  [8,] "The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride"
##  [9,] "This flag indicates whether the trip record was held in vehicle"                                                                             
## [10,] "A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip"     
## [11,] "The time-and-distance fare calculated by the meter."                                                                                         
## [12,] "Miscellaneous extras and surcharges. Currently, this only includes"                                                                          
## [13,] "$0.50 MTA tax that is automatically triggered based on the metered"                                                                          
## [14,] "$0.30 improvement surcharge assessed trips at the flag drop. The"                                                                            
## [15,] "Tip amount – This field is automatically populated for credit card"                                                                          
## [16,] "Total amount of all tolls paid in trip."                                                                                                     
## [17,] "The total amount charged to passengers. Does not include cash tips."                                                                         
## [18,] "Total amount collected in trip for NYS congestion surcharge."                                                                                
## [19,] "$1.25 for pick up only at LaGuardia and John F. Kennedy Airports"

lines <- unlist(strsplit(all_text_green, "\n"))

lines <- lines[nchar(lines) > 0]  # Remove empty lines
lines <- trimws(lines)  # Trim leading/trailing whitespace

fields <- list()

current_description <- ""

for (line in lines) {
  match <- regexec("^([A-Za-z_]+(?:[A-Za-z0-9_]*))\\s{2,}(.*)$", line)
  matched <- regmatches(line, match)
  
  if (length(matched[[1]]) > 1) {
    if (nchar(current_description) > 0) {
      fields[[length(fields) + 1]] <- c(field_name, current_description)
    }
    
    field_name <- matched[[1]][2]
    description <- matched[[1]][3]
    
    current_description <- description
  } else {
    if (grepl("^\\d=", line)) {
      current_description <- paste(current_description, line)
    }
  }
}

if (nchar(current_description) > 0) {
  fields[[length(fields) + 1]] <- c(field_name, current_description)
}

fields_green <- do.call(rbind, fields)
colnames(fields_green) <- c("Field_Name", "Description")

print(fields_green)

##       Field_Name             
##  [1,] "VendorID"             
##  [2,] "lpep_pickup_datetime" 
##  [3,] "lpep_dropoff_datetime"
##  [4,] "Passenger_count"      
##  [5,] "Trip_distance"        
##  [6,] "PULocationID"         
##  [7,] "DOLocationID"         
##  [8,] "RateCodeID"           
##  [9,] "Store_and_fwd_flag"   
## [10,] "Payment_type"         
## [11,] "Fare_amount"          
## [12,] "Extra"                
## [13,] "MTA_tax"              
## [14,] "Improvement_surcharge"
## [15,] "Tip_amount"           
## [16,] "Tolls_amount"         
## [17,] "Total_amount"         
## [18,] "Trip_type"            
##       Description                                                                                                                                   
##  [1,] "A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc."                        
##  [2,] "The date and time when the meter was engaged."                                                                                               
##  [3,] "The date and time when the meter was disengaged."                                                                                            
##  [4,] "The number of passengers in the vehicle."                                                                                                    
##  [5,] "The elapsed trip distance in miles reported by the taximeter."                                                                               
##  [6,] "TLC Taxi Zone in which the taximeter was engaged"                                                                                            
##  [7,] "TLC Taxi Zone in which the taximeter was disengaged"                                                                                         
##  [8,] "The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride"
##  [9,] "This flag indicates whether the trip record was held in vehicle"                                                                             
## [10,] "A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip"     
## [11,] "The time-and-distance fare calculated by the meter."                                                                                         
## [12,] "Miscellaneous extras and surcharges. Currently, this only includes"                                                                          
## [13,] "$0.50 MTA tax that is automatically triggered based on the metered"                                                                          
## [14,] "$0.30 improvement surcharge assessed on hailed trips at the flag"                                                                            
## [15,] "Tip amount – This field is automatically populated for credit card"                                                                          
## [16,] "Total amount of all tolls paid in trip."                                                                                                     
## [17,] "The total amount charged to passengers. Does not include cash tips."                                                                         
## [18,] "A code indicating whether the trip was a street-hail or a dispatch 1= Street-hail 2= Dispatch"

Data Tidy - Convert the text data into dataframes

fields_df_yellow <- as.data.frame(fields_yellow)
fields_df_green <- as.data.frame(fields_green)
glimpse(fields_df_yellow)

## Rows: 19
## Columns: 2
## $ Field_Name  <chr> "VendorID", "tpep_pickup_datetime", "tpep_dropoff_datetime…
## $ Description <chr> "A code indicating the TPEP provider that provided the rec…

Data Tidy - Remove Columns not Needed

library(dplyr)
indices_yellow <- which(fields_df_yellow$Field_Name %in% c("VendorID", "RateCodeID", "Payment_type"))
indices_green <- which(fields_df_green$Field_Name %in% c("VendorID", "RateCodeID", "Payment_type", "Trip_type"))

fields_df_yellow <- fields_df_yellow[indices_yellow, ]
fields_df_green <- fields_df_green[indices_green, ]

head(fields_df_yellow)

##      Field_Name
## 1      VendorID
## 8    RateCodeID
## 10 Payment_type
##                                                                                                                                     Description
## 1                          A code indicating the TPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
## 8  The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
## 10      A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip

head(fields_df_green)

##      Field_Name
## 1      VendorID
## 8    RateCodeID
## 10 Payment_type
## 18    Trip_type
##                                                                                                                                     Description
## 1                          A code indicating the LPEP provider that provided the record. 1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
## 8  The final rate code in effect at the end of the trip. 1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
## 10      A numeric code signifying how the passenger paid for the trip. 1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip
## 18                                                A code indicating whether the trip was a street-hail or a dispatch 1= Street-hail 2= Dispatch

Data Tidy - Even More Regex (Also dropped one since they were the same)

library(dplyr)
library(stringr)

fields_df <- fields_df_yellow %>%
  mutate(across(c(-Field_Name), ~ str_extract(.x, "(?<=[^\\d])\\d.*")))

fields_df$Field_Name[2] <- "RatecodeID"
fields_df$Field_Name[3] <- "payment_type"

Data Tidy - Even More Regex (Also dropped one since they were the same)

print(fields_df)

##      Field_Name
## 1      VendorID
## 8    RatecodeID
## 10 payment_type
##                                                                               Description
## 1                                  1= Creative Mobile Technologies, LLC; 2= VeriFone Inc.
## 8  1= Standard rate 2=JFK 3=Newark 4=Nassau or Westchester 5=Negotiated fare 6=Group ride
## 10               1= Credit card 2= Cash 3= No charge 4= Dispute 5= Unknown 6= Voided trip

Data Tidy - Use regex to convert description in a dictionary.

library(tidyverse)
fields_df_tidy <- fields_df %>%
  mutate(Description = str_split(Description, " (?=\\d+=)")) %>%
  unnest(cols = c(Description)) %>%
  mutate(
    Code_Number = str_extract(Description, "^[0-9]+"),
    Description = str_remove(Description, "^[0-9]+=\\s*")
  ) %>%
  mutate(
    Code_Number = str_trim(Code_Number),
    Description = str_trim(Description)
  )

Data Tidy - Use regex to convert description in a dictionary.

print(fields_df_tidy)

## # A tibble: 14 × 3
##    Field_Name   Description                        Code_Number
##    <chr>        <chr>                              <chr>      
##  1 VendorID     Creative Mobile Technologies, LLC; 1          
##  2 VendorID     VeriFone Inc.                      2          
##  3 RatecodeID   Standard rate                      1          
##  4 RatecodeID   JFK                                2          
##  5 RatecodeID   Newark                             3          
##  6 RatecodeID   Nassau or Westchester              4          
##  7 RatecodeID   Negotiated fare                    5          
##  8 RatecodeID   Group ride                         6          
##  9 payment_type Credit card                        1          
## 10 payment_type Cash                               2          
## 11 payment_type No charge                          3          
## 12 payment_type Dispute                            4          
## 13 payment_type Unknown                            5          
## 14 payment_type Voided trip                        6

Data Tidy - Converting Code Number to Numeric

fields_df_tidy$Code_Number <- gsub("[^0-9]", "", fields_df_tidy$Code_Number)
fields_df_tidy$Code_Number <- as.numeric(fields_df_tidy$Code_Number)
glimpse(fields_df_tidy)

## Rows: 14
## Columns: 3
## $ Field_Name  <chr> "VendorID", "VendorID", "RatecodeID", "RatecodeID", "Ratec…
## $ Description <chr> "Creative Mobile Technologies, LLC;", "VeriFone Inc.", "St…
## $ Code_Number <dbl> 1, 2, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6

Data Tidy - Four Clean Datasets

#data_green_tripdata
#data_yellow_tripdata
#taxi_zone_lookup
#fields_df_tidy

Data Transform - Combining Green and Yellow Data

data_green_tripdata <- data_green_tripdata %>%
  mutate(cab_type = "green")
data_yellow_tripdata <- data_yellow_tripdata %>%
  mutate(cab_type = "yellow")

data_tripdata <- bind_rows(data_green_tripdata, data_yellow_tripdata)

Data Transform - Joining trip data with location Zones

data_tripdata <- data_tripdata %>%
  left_join(taxi_zone_lookup %>%
              select(LocationID, Zone) %>%
              rename(PUZone = Zone),
            by = c("PULocationID" = "LocationID"))

data_tripdata <- data_tripdata %>%
  left_join(taxi_zone_lookup %>%
              select(LocationID, Zone) %>%
              rename(DOZone = Zone),
            by = c("DOLocationID" = "LocationID"))

Data Transform - Joining trip data with ID values from dictionary

data_tripdata <- data_tripdata %>%
  left_join(fields_df_tidy %>%
              filter(Field_Name == "VendorID") %>%
              select(Code_Number, Description) %>%
              rename(VendorDescription = Description),
            by = c("VendorID" = "Code_Number"))

data_tripdata <- data_tripdata %>%
  left_join(fields_df_tidy %>%
              filter(Field_Name == "RatecodeID") %>%
              select(Code_Number, Description) %>%
              rename(RatecodeDescription = Description),
            by = c("RatecodeID" = "Code_Number"))

data_tripdata <- data_tripdata %>%
  left_join(fields_df_tidy %>%
              filter(Field_Name == "payment_type") %>%
              select(Code_Number, Description) %>%
              rename(PaymentTypeDescription = Description),
            by = c("payment_type" = "Code_Number"))

Data Transform - All Four Datasets Joined

## Rows: 3,687,470
## Columns: 29
## $ VendorID               <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ lpep_pickup_datetime   <dttm> 2024-08-31 20:06:40, 2024-08-31 20:51:31, 2024…
## $ lpep_dropoff_datetime  <dttm> 2024-08-31 20:31:15, 2024-08-31 21:14:11, 2024…
## $ store_and_fwd_flag     <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N…
## $ RatecodeID             <dbl> 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1,…
## $ PULocationID           <dbl> 92, 80, 42, 41, 42, 223, 93, 66, 66, 130, 74, 9…
## $ DOLocationID           <dbl> 37, 80, 116, 42, 42, 83, 10, 148, 262, 191, 42,…
## $ passenger_count        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2,…
## $ trip_distance          <dbl> 10.66, 3.89, 0.80, 1.84, 0.00, 2.19, 7.27, 3.15…
## $ fare_amount            <dbl> 43.6, 22.6, 6.5, 11.4, 29.0, 11.4, 30.3, 19.8, …
## $ extra                  <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,…
## $ mta_tax                <dbl> 0.5, 0.5, 0.5, 0.5, 0.0, 0.5, 0.5, 0.5, 0.5, 0.…
## $ tip_amount             <dbl> 0.00, 5.02, 0.00, 0.00, 0.00, 0.00, 6.56, 5.01,…
## $ tolls_amount           <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ ehail_fee              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ improvement_surcharge  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ total_amount           <dbl> 46.10, 30.12, 9.00, 13.90, 30.00, 13.90, 39.36,…
## $ payment_type           <dbl> 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ trip_type              <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ congestion_surcharge   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2.75,…
## $ cab_type               <chr> "green", "green", "green", "green", "green", "g…
## $ tpep_pickup_datetime   <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ tpep_dropoff_datetime  <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Airport_fee            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PUZone                 <chr> "Flushing", "East Williamsburg", "Central Harle…
## $ DOZone                 <chr> "Bushwick South", "East Williamsburg", "Hamilto…
## $ VendorDescription      <chr> "VeriFone Inc.", "VeriFone Inc.", "VeriFone Inc…
## $ RatecodeDescription    <chr> "Standard rate", "Standard rate", "Standard rat…
## $ PaymentTypeDescription <chr> "Cash", "Credit card", "Credit card", "Cash", "…

Data Transform - Calculate total time between pickup and dropoff

data_tripdata$total_time <- with(data_tripdata, {
  time1 <- ifelse(!is.na(lpep_pickup_datetime) & !is.na(lpep_dropoff_datetime), 
                  as.numeric(difftime(lpep_dropoff_datetime, lpep_pickup_datetime, units = "mins")), 0)
  time2 <- ifelse(!is.na(tpep_pickup_datetime) & !is.na(tpep_dropoff_datetime), 
                  as.numeric(difftime(tpep_dropoff_datetime, tpep_pickup_datetime, units = "mins")), 0)
  
  time1 + time2
})

glimpse(data_tripdata$total_time)

##  num [1:3687470] 24.58 22.67 4.27 8.68 1.35 ...

Data Transform - Dropping features not need for this analysis.

data_tripdata <- data_tripdata %>%
  select(-lpep_pickup_datetime, -lpep_dropoff_datetime, -tpep_pickup_datetime, -tpep_dropoff_datetime, -VendorID, -RatecodeID, -PULocationID, -DOLocationID, -payment_type, -trip_type, -store_and_fwd_flag)

glimpse(data_tripdata)

## Rows: 3,687,470
## Columns: 19
## $ passenger_count        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2,…
## $ trip_distance          <dbl> 10.66, 3.89, 0.80, 1.84, 0.00, 2.19, 7.27, 3.15…
## $ fare_amount            <dbl> 43.6, 22.6, 6.5, 11.4, 29.0, 11.4, 30.3, 19.8, …
## $ extra                  <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,…
## $ mta_tax                <dbl> 0.5, 0.5, 0.5, 0.5, 0.0, 0.5, 0.5, 0.5, 0.5, 0.…
## $ tip_amount             <dbl> 0.00, 5.02, 0.00, 0.00, 0.00, 0.00, 6.56, 5.01,…
## $ tolls_amount           <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ ehail_fee              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ improvement_surcharge  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ total_amount           <dbl> 46.10, 30.12, 9.00, 13.90, 30.00, 13.90, 39.36,…
## $ congestion_surcharge   <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2.75,…
## $ cab_type               <chr> "green", "green", "green", "green", "green", "g…
## $ Airport_fee            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PUZone                 <chr> "Flushing", "East Williamsburg", "Central Harle…
## $ DOZone                 <chr> "Bushwick South", "East Williamsburg", "Hamilto…
## $ VendorDescription      <chr> "VeriFone Inc.", "VeriFone Inc.", "VeriFone Inc…
## $ RatecodeDescription    <chr> "Standard rate", "Standard rate", "Standard rat…
## $ PaymentTypeDescription <chr> "Cash", "Credit card", "Credit card", "Cash", "…
## $ total_time             <dbl> 24.583333, 22.666667, 4.266667, 8.683333, 1.350…

Data Transform - Finding and Replacing features with na values

##        passenger_count          trip_distance            fare_amount 
##                 485435                      0                      0 
##                  extra                mta_tax             tip_amount 
##                      0                      0                      0 
##           tolls_amount              ehail_fee  improvement_surcharge 
##                      0                3687470                      0 
##           total_amount   congestion_surcharge               cab_type 
##                      0                 485435                      0 
##            Airport_fee                 PUZone                 DOZone 
##                 538171                      0                      0 
##      VendorDescription    RatecodeDescription PaymentTypeDescription 
##                    109                 522998                 485435 
##             total_time 
##                      0

Data Transform - Finding and Replacing features with na values

##        passenger_count          trip_distance            fare_amount 
##                 485435                      0                      0 
##                  extra                mta_tax             tip_amount 
##                      0                      0                      0 
##           tolls_amount              ehail_fee  improvement_surcharge 
##                      0                      0                      0 
##           total_amount   congestion_surcharge               cab_type 
##                      0                      0                      0 
##            Airport_fee                 PUZone                 DOZone 
##                      0                      0                      0 
##      VendorDescription    RatecodeDescription PaymentTypeDescription 
##                    109                 522998                 485435 
##             total_time 
##                      0

Exploratory Data Analysis

EDA - Using ggplot2 for visualizations

library(ggplot2)

EDA - Top 10 Most Frequent Pick-up Zones

EDA - Top 10 Most Frequent Drop-off Zones

EDA - Frequency by payment type

EDA - Density Plot of Total Amount (Somethings Wrong)

EDA - Fixing Outlier’s

summary(data_tripdata$total_amount)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1000.00    15.96    21.48    28.51    31.50  1886.02

data_tripdata <- data_tripdata %>%
  filter(total_amount >= 0)

count_above_500 <- data_tripdata %>%
  filter(total_amount > 500) %>%
  nrow()
count_above_500

## [1] 83

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   16.30   21.80   29.39   31.85  500.00

EDA - Fixed Density Plot

EDA - Top 15 Pickup Zones total average amount

## # A tibble: 6 × 2
##   PUZone                            avg_total_amount
##   <chr>                                        <dbl>
## 1 Rossville/Woodrow                            131. 
## 2 Outside of NYC                               104. 
## 3 Newark Airport                                94.5
## 4 JFK Airport                                   82.9
## 5 Eltingville/Annadale/Prince's Bay             74.8
## 6 Baisley Park                                  70.6

EDA - Top 15 Pickup Zones total average amount

EDA - Top 15 drop off zones by avg total amount

## # A tibble: 6 × 2
##   DOZone                            avg_total_amount
##   <chr>                                        <dbl>
## 1 Freshkills Park                               205.
## 2 Arden Heights                                 161.
## 3 Rossville/Woodrow                             154.
## 4 Eltingville/Annadale/Prince's Bay             145.
## 5 Charleston/Tottenville                        145.
## 6 Great Kills                                   140.

EDA - Top 15 drop off zones by avg total amount

EDA - Average Total Amount by Cab Type

EDA - Top 10 PUZone by Total Amount (Trips > 100)

EDA - Is there a correlation between longer trips and amount earned?

filtered_data <- data_tripdata %>%
  filter(total_time >= 0 & total_time <= 360)

correlation_filtered <- cor(filtered_data$total_time, filtered_data$total_amount, use = "complete.obs")
correlation_filtered

## [1] 0.7604364

EDA - Is there a correlation between longer trips and amount earned?

Conclusions - Basics

On average you make 29.39 per trip as a cab driver in NYC.
Yellow makes more on average then green cab.
Longer drives don’t always earn a significant amount more.
$15 - $30 is a good expected range however there was 63,000+ $100 trips in September.
An overwhelming majority of $100+ are from JFK then Laguardia.

Conclusions - Top 5 pickups zones by average earned is

Rossville/Woodrow $131.10667
Outside of NYC $103.87056
Newark Airport $94.54231
JFK Airport $82.94165
Eltingville/Annadale/Prince’s Bay $74.82500

Conclusions - Top 5 dropoff zones by average earned is

Freshkills Park $205.3375
Arden Heights $160.6545
Rossville/Woodrow 153.7494
Eltingville/Annadale/Prince’s Bay $145.3672
Charleston/Tottenville $144.6669
Great Kills $140.3974

Challenges

Getting the Taxi Dictionary to work
Wasn’t able to webscrape into a dataframe
Download PDF
Use multiple stages of regex and extract
Finally tidy data dictionary

Things learned not learned in class

Working with PDF’s
Building Slides from R using ioslides

Questions/Answers

-Have any questions?

DATA607Final

Erick Hadi

2024-12-01

Introduction

Purpose

Data Science Workflow

The Libraries

Import the Data

Data Tidy - Reading PDF’s

Data Tidy - Using regex to extract the field names and descriptions

Data Tidy - Convert the text data into dataframes

Data Tidy - Remove Columns not Needed

Data Tidy - Even More Regex (Also dropped one since they were the same)

Data Tidy - Even More Regex (Also dropped one since they were the same)

Data Tidy - Use regex to convert description in a dictionary.

Data Tidy - Use regex to convert description in a dictionary.

Data Tidy - Converting Code Number to Numeric

Data Tidy - Four Clean Datasets

Data Transform - Combining Green and Yellow Data

Data Transform - Joining trip data with location Zones

Data Transform - Joining trip data with ID values from dictionary

Data Transform - All Four Datasets Joined

Data Transform - Calculate total time between pickup and dropoff

Data Transform - Dropping features not need for this analysis.

Data Transform - Finding and Replacing features with na values

Data Transform - Finding and Replacing features with na values

Exploratory Data Analysis

EDA - Using ggplot2 for visualizations

EDA - Top 10 Most Frequent Pick-up Zones

EDA - Top 10 Most Frequent Drop-off Zones

EDA - Frequency by payment type

EDA - Density Plot of Total Amount (Somethings Wrong)

EDA - Fixing Outlier’s

EDA - Fixed Density Plot

EDA - Top 15 Pickup Zones total average amount

EDA - Top 15 Pickup Zones total average amount

EDA - Top 15 drop off zones by avg total amount

EDA - Top 15 drop off zones by avg total amount

EDA - Average Total Amount by Cab Type

EDA - Top 10 PUZone by Total Amount (Trips > 100)

EDA - Is there a correlation between longer trips and amount earned?

EDA - Is there a correlation between longer trips and amount earned?

Conclusions - Basics

Conclusions - Top 5 pickups zones by average earned is

Conclusions - Top 5 dropoff zones by average earned is

Conclusions - Recommend

Challenges

Things learned not learned in class

Questions/Answers