library(tidyverse)
library(knitr)
library(dplyr)
library(highcharter)Project 3
Source: https://www.pickpik.com/yellow-taxis-buses-streets-new-york-city-urban-77471
Introduction
The dataset I am using for Project 3 is the January 2025 Yellow Taxi Trip Record Data that is provided by NYC Taxi and Limousine Commission (TLC) which is given to them by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs. According to TLC, some of the data variables were recorded automatically when the meter was engaged and disengaged but some other variables such as passenger count were manually entered by the driver at the beginning of the trip. Data was collected and reported by two main Vendors, Creative Mobile Technologies LLC and VeriFone Inc.
The dataset contains 20 informational variables for taxi trips but I will most likely be focusing on location zone, date and time, fare amount, total tips, passenger count, and payment method. I am going to explore taxi demands, average prices, changes throughout the month, and how different factors affect the total cost of a taxi trip.
I chose this topic and dataset because I have been to NYC a few times but never used a taxi there before because I was worried that the prices would be too high so, when I saw this dataset, I was intrigued to investigate If i were right or wrong. I also liked how meaningful the data was because it uses real travel data instead of survey or simulated data.
Loading: Libraries and Dataset
# importing dataset
setwd("~/Desktop/Desktop - Jackie’s MacBook Pro/DATA 110")
nyc_taxi <- read_csv("~/Desktop/Desktop - Jackie’s MacBook Pro/DATA 110/yellow_tripdata_01.2025.csv")Cleaning: Removing, filtering, and renaming
To correctly interpret and clean the dataset, I used the TLC yellow taxi trip data dictionary which defines key variables and codes used for the trip records. For example, the data dictionary documents what each field represents and clarifies that some variables come directly from the taximeter and others are entered by the driver. This was is important to research to avoid misinterpretation and assumptions when when deciding which variables and records to filter out during cleaning.
Certified list from “www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf”
Removing and filtering
taxi_select <- nyc_taxi |>
# removing variable that I wont be focusing on
select(-VendorID, -mta_tax, -improvement_surcharge, -store_and_fwd_flag, -Airport_fee, -congestion_surcharge, -tolls_amount, -extra, cbd_congestion_fee) |>
# filtering out na or zero values that seem to make the data skewed and inaccurate
filter(passenger_count != "0",
passenger_count != "\\N",
trip_distance != "0",
fare_amount != "0") |>
# filtering out for $$ amounts that appeared negative. Seems like inaccurate data inputs.
filter(!grepl("-", total_amount, ignore.case = TRUE),
!grepl("-", fare_amount, ignore.case = TRUE)) |>
# filtering only for CC and Cash payments because "dispute" and "no charge" are either voided or uncompleted fares
filter(payment_type %in% c("1", "2"))Renaming:
taxi_clean <- taxi_select |>
# changing categorical variables that were input as numerical.
mutate(payment_type = recode(payment_type,
"1" = "Credit Card",
"2" = "Cash")) |>
mutate(RatecodeID = recode(RatecodeID,
"1" = "Standard Rate",
"2" = "JFK",
"3" = "Newark",
"4" = "Nassau or Westchester",
"5" = "Negotiated fare",
"6" = "Group ride",
"99" = "Null/unknown"))Multiple Linear Regression: Predicting Total Trip Cost
library(ggfortify)Adding Bounds
# adding some bounds after seeing a few outliers in initial render
taxi_bounds <- taxi_clean |>
filter(
trip_distance <= 150,
total_amount <= 500)Linear Regression Model : Predicting Trip Cost
fit3 <- lm( total_amount ~ trip_distance + passenger_count + tip_amount + payment_type, data = taxi_bounds)
summary(fit3)
Call:
lm(formula = total_amount ~ trip_distance + passenger_count +
tip_amount + payment_type, data = taxi_bounds)
Residuals:
Min 1Q Median 3Q Max
-400.13 -2.11 -0.58 1.40 441.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.738382 0.010299 1139.803 < 2e-16 ***
trip_distance 3.690533 0.001056 3495.157 < 2e-16 ***
passenger_count2 0.424114 0.010049 42.206 < 2e-16 ***
passenger_count3 0.841724 0.019996 42.094 < 2e-16 ***
passenger_count4 1.772517 0.025301 70.058 < 2e-16 ***
passenger_count5 -0.275446 0.043610 -6.316 2.68e-10 ***
passenger_count6 -0.436842 0.052948 -8.250 < 2e-16 ***
passenger_count7 62.145111 5.757214 10.794 < 2e-16 ***
passenger_count8 13.659283 2.350377 5.812 6.19e-09 ***
passenger_count9 29.024913 3.323929 8.732 < 2e-16 ***
tip_amount 1.749231 0.001270 1376.823 < 2e-16 ***
payment_typeCredit Card -2.756929 0.011499 -239.748 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.757 on 2769520 degrees of freedom
Multiple R-squared: 0.9268, Adjusted R-squared: 0.9268
F-statistic: 3.188e+06 on 11 and 2769520 DF, p-value: < 2.2e-16
autoplot(fit3)Interpretation
Equation: total_amount=11.738 + 3.691(trip_distance) + 0.424(passenger_count2) + 0.842(passenger_count3) + 1.773(passenger_count4) − 0.275(passenger_count5) − 0.437(passenger_count6) + 62.145(passenger_count7) + 13.659(passenger_count8) + 29.025(passenger_count9) + 1.749(tip_amount) − 2.757(payment_type_creditCard)
Before the final Fit, I had to set bounds because there were some extreme outliars with some points being at $70,000 trip cost and even some points being 40,000 miles in distance. I removed the “RateCodeID” because it had the highest amount meaning there was not a great amount of significance.
The expected total taxi cost is equal to a base cost of $11.74, plus $3.69 for each additional mile traveled, plus $1.75 for each additional dollar tipped. Passenger count and payment type are categorical variables so their coefficients represent differences in the total cost. Trips paid in cash have a $2.76 higher average total trip cost that trips paid by credit cards. This model explains approximately 92.68% of the variation in total taxi trip cost. All the variables are statistically significant at the 0.001 level meaning that the trip distance, tip amount, passenger count, and payment types are strongly associated with total trip cost.
I was expecting the Credit Card payments contribute to a higher trip cost since the tip amounts are included compared to cash payments where the tip is not includued in the total amount.
Visualization One: Total Cost VS Trip Distance by Payment Type
# separating date and time because I will be using time of day in one visualization
taxi_time <- taxi_bounds|>
separate(tpep_pickup_datetime, into = c("PU_date", "PU_time"), sep = " ", convert = TRUE)|>
separate(tpep_dropoff_datetime, into = c("DO_date", "DO_time"), sep = " ", convert = TRUE)p_dist <-
ggplot(taxi_bounds,aes(x = trip_distance,
y = total_amount,
size = tip_amount,
color = passenger_count)) +
geom_point(alpha = 0.25) +
facet_wrap(~ payment_type) +
scale_color_brewer(palette = "Set1") +
labs(title = "Total Taxi Cost VS Trip Distance by Payment Type",
x = "Trip Distance (miles)",
y = "Total Amount (USD)",
caption = "Source: NYC Taxi & Limousine Commission (TPEP)",
color = "Passenger Count",
size = "Tip Amount") +
theme_minimal()
p_distExplanation
For my first visualization, I did a facet chart showing how the total trip cost changes as trip distance increase for each payment type. Each point represents one taxi trip and the size of the points represent the tip amount. I also colored each point based on passenger count. I noticed that the most common passenger count was 1 and higher passenger counts did not relate to higher tip amount which I found surprising. Overall, the chart shows a clear relationship between trip distance and the total amount which means longer trips usually cost more. There are also a lot of points with higher amounts but short distance which is likely due to the tipping amounts. Also note that the CASH chart does not have a different size points because tip is not recorded if paid by cash.
Visualization Two: Taxi Demand and Average Trip Cost
Organizing Data Before Charting
daily_summary <- taxi_time |>
# Sorting by date
group_by(PU_date) |>
# Getting the averages
summarize(trip_count = n(), avg_total_amount = mean(total_amount, na.rm = TRUE)) |>
mutate(PU_date = as.Date(PU_date))Highcharter: Trips and Cost
highchart() |>
# Since I am using two Y axis, I had to research how to add the second one. I found this website where I got "yAxis = 0","yAxis = 1", "hc_yAxis_multiples", and "opposite = FALSE" from. This allowd me to add a second Axis and its title.
# Source: https://stackoverflow.com/questions/40084416/two-y-axis-in-highcharter-in-r
# Left axis: Total Trips
hc_add_series(data = daily_summary,
type = "column",
name = "Number of Trips",
hcaes(x = PU_date,
y = trip_count),
yAxis = 0,
colorByPoint = TRUE) |> # Wanted to color the bars by day. Found this line on https://api.highcharts.com/highcharts/plotOptions.column.colorByPoint
# Right Axis: Trip Cost
hc_add_series(data = daily_summary,
type = "line",
name = "Average Trip Cost ($)",
hcaes(x = PU_date,
y = avg_total_amount),
yAxis = 1) |>
# Text Inputs
hc_title(text = "Daily Taxi Demand and Average Trip Cost (Jan, 2025)") |>
hc_subtitle(text = "Bars show total trips per day and line shows average cost per trip. | Red colors = Weekend Days") |>
hc_xAxis(title = list(text = "Date of Trip"),type = "datetime") |>
hc_yAxis_multiples(
list(title = list(text = "Number of Taxi Trips"), opposite = FALSE),
list(title = list(text = "Average Total Trip Cost ($)"), opposite = TRUE)) |>
hc_caption(text = "Source: NYC Taxi and Limousine Commission Yellow Taxi Trip Records") |>
hc_colors(c("#0e4c6d","#03324e","#01263d","#ef4043","#be1e2d","#c43240","#72bad5")) |>
# Tool tip
hc_tooltip(
shared = FALSE,
pointFormat = paste(
"Total Trips: {point.trip_count} <br>",
"Avg Cost: ${point.avg_total_amount} <br>")) |>
hc_add_theme(hc_theme_economist())Explanation
My second visualization uses Highcharter to provide interactivity to show two daily patterns throughout the month of Jan, 2025. The bar shows the total number of taxi trips per day which represent the demand for taxis and the line shows the average cost of trip per day which represents pricing amounts. When looking at the bar and the line, you can tell what days have higher average trip costs and the demand for taxis. I color coded the days and specifically made Fri-Sun red colored to see if there is a trend of higher trip costs for the trips or higher demand on weekends. Surprisingly, I noticed that Friday and Saturday did have higher amount of trips but Thursdays specifically had the highest amounts throughout the entire month. Even the most amount of trips on one day was on Thurs, Jan16 with 108k+ trips on a single day. While the lowest amount of trips on one day was Mon, Jan 20 with 64k trips yet had the highest average trip cost at about $30.
Conclusion
For this project, I was able to explore how trip variables influence taxi pricing and how demand changed across a typical month. By cleaning the data, I was able to analyze trends and patterns based on the day and distance. The regression results show that trip distance and tip amount are the strongest predictors of total trip cost which makes sense given how taxi fares are structures. Payment type and passenger count also played a role into the total trip cost but their effects was smaller compared to distance and tipping.
The first visualization focuses on individual trips and shows a clear relationship between distance and total cost. Separating the data by payment type helped me see if there are noticeable differences in how total costs are distributed. Since cash tips are not reported, I was able to tell how tips affected total cost vs when tips were not recorded. I also saw that there was way more credit card payments than there were cash payments.
The second visualization focuses on daily patterns by comparing the number of trips per day with average cost. The result suggested that higher demand does not automatically lead to higher average costs which was interesting because I expected busier days to be more expensive. I noticed that even though demand fluctuates throughout the month, the pricing stays relatively around 27 per trip.
There was a few things I wanted to focus on but I was either unable to figure it out or decided to go with another approach. I attempted to calculate the total time elapsed for each trip using pickup and dropoff times but it was more complex than I thought. I also planned to create a interactive map by matching taxi LocationIDs with geographic coordinates since they are listed as numerical. Though I was able to gather the Location ID to borough list, I was unable to merge the two datasets and change the borough to coordinates.
Bibliography
“TLC Trip Record Data.” www.NYC.gov, https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page. Accessed 10 Dec. 2025.
“Data Dictionary – Yellow Taxi Trip Records.” www.NYC.gov, https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf. Accessed 10 Dec. 2025.
“Highcharts Core API Option: Colorbypoint.” Highcharts, https://api.highcharts.com/highcharts/plotOptions.column.colorByPoint. Accessed 13 Dec. 2025.
Fong, David. “Two Y Axis in Highcharter in R” Stack Overflow, 30, August, 2020. https://stackoverflow.com/questions/40084416/two-y-axis-in-highcharter-in-r