Dataset: NYC Yellow Taxi Trip Records — January to
August 2023
Source: NYC Taxi and Limousine Commission (TLC)
Target Audience: Urban transportation analysts, taxi
fleet managers, and pricing strategists.
Core Goal: Build a combined predictive analytics
framework that estimates base fare (regression) and predicts passenger
tipping behaviour (classification) to support dynamic pricing
decisions.
NYC Yellow Taxis generate rich trip-level data that captures the interplay between fare pricing, geography, time-of-day, and passenger generosity. By combining a regression model (predicting the base metered fare) with a multi-class classification model (categorising the passenger’s tipping level into No Tip, Low, Medium, or High), we construct an end-to-end system capable of simulating total driver compensation for any routing and timing scenario. This, in turn, enables the design of a transparent, data-driven dynamic pricing strategy that rewards drivers fairly while keeping passenger costs competitive.
Classification : Predict the tipping level —
No Tip, Low, Medium, or
High — using trip characteristics, temporal signals, and
geographic zone information.
Tipping constitutes a significant and variable component of driver income that is entirely outside the metered fare structure. By predicting tipping behavior before a trip begins, fleet operators can better estimate total driver compensation per route, identify systematically under-tipped trip types, and design incentive structures that direct drivers toward higher-yield opportunities.
Regression : Predict the base metered fare amount from the same feature set, providing the cost foundation for the overall pricing framework.
While the NYC taximeter deterministically calculates fares during a trip, a pre-trip fare estimate enables dynamic pricing simulations, transparent upfront cost communication to passengers, and route profitability analysis, all without requiring the trip to be completed first.
The raw dataset contains 19,493,060 rows and 21 columns, representing every yellow taxi trip logged between January and August 2023. Fields include: pickup and dropoff timestamps, TLC Zone IDs, trip distance, passenger count, itemised fare components (base fare, surcharges, tolls, tip), payment method, vendor, and rate code.
To balance computational feasibility with distributional integrity, we randomly sample 1% of records per calendar month (stratified), yielding approximately 194,930 rows that faithfully preserve seasonal demand patterns.
# ------------------------------------------------------------------
# DATA LOADING & STRATIFIED SAMPLING
# ------------------------------------------------------------------
raw <- read_csv("nyc_yellow_taxi_trip_records_from_Jan_to_Aug_2023.csv",
col_names = TRUE, show_col_types = FALSE)
taxi_zone <- read_csv("taxi_zone_lookup.csv",
col_names = TRUE, show_col_types = FALSE)
set.seed(123)
df <- raw %>%
group_by(
pickup_month = month(
as.POSIXct(tpep_pickup_datetime, format = "%Y-%m-%d %H:%M:%S"),
label = TRUE
)
) %>%
slice_sample(prop = 0.01) %>%
ungroup() %>%
select(-pickup_month)
cat("Sampled dataset dimensions:", dim(df), "\n")## Sampled dataset dimensions: 194927 21
Raw location fields (PULocationID,
DOLocationID) are integer zone codes that carry no inherent
meaning. We join a taxi zone lookup table to resolve each code into a
Borough, Zone name, and
Service zone, enriching the spatial context
dramatically.
# ------------------------------------------------------------------
# GEO-DATA ENRICHMENT — Join taxi zone metadata
# ------------------------------------------------------------------
df <- df %>%
left_join(taxi_zone, by = c("PULocationID" = "LocationID")) %>%
rename(PU_Borough = Borough, PU_Zone = Zone, PU_service_zone = service_zone) %>%
left_join(taxi_zone, by = c("DOLocationID" = "LocationID")) %>%
rename(DO_Borough = Borough, DO_Zone = Zone, DO_service_zone = service_zone)
cat("Post-enrichment dimensions:", dim(df), "\n")## Post-enrichment dimensions: 194927 27
# ------------------------------------------------------------------
# DATA TYPE CONVERSION
# ------------------------------------------------------------------
df <- df %>%
mutate(
RatecodeID = as.factor(RatecodeID),
VendorID = as.factor(VendorID),
PULocationID = as.factor(PULocationID),
DOLocationID = as.factor(DOLocationID),
store_and_fwd_flag = as.factor(store_and_fwd_flag),
PU_Borough = as.factor(PU_Borough),
DO_Borough = as.factor(DO_Borough),
PU_Zone = as.factor(PU_Zone),
DO_Zone = as.factor(DO_Zone),
PU_service_zone = as.factor(PU_service_zone),
DO_service_zone = as.factor(DO_service_zone)
)# ------------------------------------------------------------------
# DATA QUALITY ASSESSMENT
# ------------------------------------------------------------------
data_quality <- data.frame(
Non_Null = sapply(df, function(x) sum(!is.na(x))),
Null_Count = sapply(df, function(x) sum(is.na(x))),
Null_Pct = round(sapply(df, function(x) mean(is.na(x)) * 100), 2),
Unique_Values = sapply(df, function(x) length(unique(x)))
)
kable(head(data_quality, 15), caption = "Data Quality Assessment — First 15 Columns")| Non_Null | Null_Count | Null_Pct | Unique_Values | |
|---|---|---|---|---|
| …1 | 194927 | 0 | 0.00 | 194927 |
| VendorID | 194927 | 0 | 0.00 | 3 |
| tpep_pickup_datetime | 194927 | 0 | 0.00 | 193326 |
| tpep_dropoff_datetime | 194927 | 0 | 0.00 | 193338 |
| passenger_count | 189737 | 5190 | 2.66 | 9 |
| trip_distance | 194927 | 0 | 0.00 | 2732 |
| RatecodeID | 189737 | 5190 | 2.66 | 8 |
| store_and_fwd_flag | 189737 | 5190 | 2.66 | 3 |
| PULocationID | 194927 | 0 | 0.00 | 233 |
| DOLocationID | 194927 | 0 | 0.00 | 254 |
| payment_type | 194927 | 0 | 0.00 | 5 |
| fare_amount | 194927 | 0 | 0.00 | 2985 |
| extra | 194927 | 0 | 0.00 | 43 |
| mta_tax | 194927 | 0 | 0.00 | 5 |
| tip_amount | 194927 | 0 | 0.00 | 2376 |
The table below summarises the key columns in the raw dataset:
| Column | Type | Description |
|---|---|---|
VendorID |
Factor | TPEP provider (1 = Creative Mobile, 2 = VeriFone) |
tpep_pickup_datetime |
POSIXct | Pickup date and time |
tpep_dropoff_datetime |
POSIXct | Dropoff date and time |
passenger_count |
Numeric | Number of passengers (driver-reported) |
trip_distance |
Numeric | Trip distance in miles |
RatecodeID |
Factor | Rate code (1 = Standard, 2 = JFK, 3 = Newark, etc.) |
PULocationID / DOLocationID |
Factor | TLC Taxi Zone IDs |
payment_type |
Numeric | Payment method (1 = Credit card, 2 = Cash, etc.) |
fare_amount |
Numeric | Time-and-distance metered fare ($) |
tip_amount |
Numeric | Credit-card tip ($) |
congestion_surcharge |
Numeric | Congestion surcharge ($2.50 in Manhattan) |
airport_fee |
Numeric | Airport pickup surcharge ($1.25) |
We apply domain-knowledge rules to remove records that cannot represent valid credit-card taxi trips:
payment_type == 1): cash tips are not digitally recorded,
producing false zeros in tip_amount.Unknown or N/A.# ------------------------------------------------------------------
# DATA CLEANING — Domain-rule filters
# ------------------------------------------------------------------
df <- df %>%
filter(
fare_amount > 0,
payment_type == 1,
passenger_count >= 1 & passenger_count <= 6,
trip_distance > 0,
tpep_dropoff_datetime > tpep_pickup_datetime,
RatecodeID %in% c("1","2","3","4","5","6"),
PU_Borough != "Unknown" & PU_Borough != "N/A",
DO_Borough != "Unknown" & DO_Borough != "N/A"
) %>%
select(-any_of("...1"), -payment_type, -PULocationID, -DOLocationID)
cat("Shape after cleaning:", dim(df), "\n")## Shape after cleaning: 146597 23
# ------------------------------------------------------------------
# DEDUPLICATION
# ------------------------------------------------------------------
n_dup <- sum(duplicated(df))
cat("Duplicate rows:", n_dup, "\n")## Duplicate rows: 0
if (n_dup > 0) { df <- distinct(df) }
# ------------------------------------------------------------------
# MISSING VALUE ANALYSIS
# ------------------------------------------------------------------
missing <- data.frame(
Missing_Count = colSums(is.na(df)),
Missing_Percent = round(colMeans(is.na(df)) * 100, 2)
)
missing <- missing[order(missing$Missing_Count, decreasing = TRUE), ]
cat("Total missing cells:", sum(is.na(df)), "\n")## Total missing cells: 146597
kable(missing[missing$Missing_Count > 0, ],
caption = "Columns with Missing Values (post-cleaning)")| Missing_Count | Missing_Percent | |
|---|---|---|
| airport_fee | 123581 | 84.3 |
| Airport_fee | 23016 | 15.7 |
Inconsistent capitalisation of the airport fee column across monthly
data partitions (Airport_fee vs airport_fee)
is resolved by merging into a single standardised column.
# ------------------------------------------------------------------
# MISSING VALUE HEATMAP — Before column reconciliation
# Reconstruct the pre-reconciliation state: airport_fee and Airport_fee
# both exist, complementing each other's missing values.
# ------------------------------------------------------------------
df_pre <- df
pre_cols <- c(setdiff(names(df_pre), "Airport_fee"), "Airport_fee")
mm_before <- is.na(df_pre[, pre_cols])
mdf_before <- melt(mm_before)
ggplot(mdf_before, aes(x = Var2, y = Var1, fill = value)) +
geom_tile() +
scale_fill_viridis_d(labels = c("Present", "Missing")) +
labs(
title = "Missing Value Heatmap — Before Airport Fee Reconciliation",
subtitle = "Both airport_fee and Airport_fee columns coexist with complementary missing values",
x = "Columns", y = "", fill = "Status"
) +
theme_minimal() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, size = 9),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "#666")
)
Missing Value Heatmap — Before Reconciliation. Both
airport_fee and Airport_fee columns are
visible, each carrying missing values in the rows where the other
variant was populated.
Finding (Before Reconciliation): Prior to merging,
two columns — airport_fee and Airport_fee —
coexist in the dataset due to inconsistent capitalisation across
different monthly data partition exports. Each column carries missing
values precisely in the rows where the other variant was recorded,
effectively splitting the airport fee information across two
complementary columns. The duplicate column would corrupt any downstream
pipeline that naively reads column names.
# ------------------------------------------------------------------
# COLUMN RECONCILIATION — Merge airport fee variants into one column
# ------------------------------------------------------------------
df$airport_fee <- ifelse(is.na(df$airport_fee), df$Airport_fee, df$airport_fee)
df$Airport_fee <- NULL# ------------------------------------------------------------------
# MISSING VALUE HEATMAP — After column reconciliation
# ------------------------------------------------------------------
missing_matrix <- is.na(df)
missing_df <- melt(missing_matrix)
ggplot(missing_df, aes(x = Var2, y = Var1, fill = value)) +
geom_tile() +
scale_fill_viridis_d(labels = c("Present", "Missing")) +
labs(
title = "Missing Value Heatmap — After Airport Fee Reconciliation",
subtitle = "Yellow = Missing | Purple = Present | airport_fee now fully consolidated",
x = "Columns", y = "", fill = "Status"
) +
theme_minimal() +
theme(
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, size = 9),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "#666")
)
Missing Value Heatmap — After Reconciliation. A single
unified airport_fee column remains; the duplicate
Airport_fee variant has been dropped.
Finding (After Reconciliation): After merging
airport_fee and Airport_fee into a single
standardised column and dropping the duplicate, the dataset shows a
clean, unified structure. The only remaining missing values are confined
to congestion_surcharge and the now-consolidated
airport_fee column, affecting fewer than 1% of rows. These
residual gaps arise from system recording failures on older vehicles and
are safely handled by the final na.omit() step before
modelling.
# ------------------------------------------------------------------
# OUTLIER DETECTION — IQR method across all numerical features
# ------------------------------------------------------------------
FEATURE_COLS <- c(
"passenger_count", "trip_distance", "fare_amount", "extra",
"mta_tax", "tip_amount", "tolls_amount",
"improvement_surcharge", "total_amount",
"congestion_surcharge", "airport_fee"
)
rows <- list()
for (col in FEATURE_COLS) {
Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lo <- Q1 - 1.5 * IQR_val
hi <- Q3 + 1.5 * IQR_val
n_out <- sum(df[[col]] < lo | df[[col]] > hi, na.rm = TRUE)
rows[[col]] <- data.frame(
Feature = col, Q1 = round(Q1, 2), Q3 = round(Q3, 2),
IQR = round(IQR_val, 2), Lower_Fence = round(lo, 2),
Upper_Fence = round(hi, 2), Outlier_Count = n_out,
Outlier_Pct = round(n_out / nrow(df) * 100, 2)
)
}
outlier_df <- do.call(rbind, rows)
kable(outlier_df, caption = "IQR Outlier Summary — Numerical Features")| Feature | Q1 | Q3 | IQR | Lower_Fence | Upper_Fence | Outlier_Count | Outlier_Pct | |
|---|---|---|---|---|---|---|---|---|
| passenger_count | passenger_count | 1.0 | 1.00 | 0.00 | 1.00 | 1.00 | 33085 | 22.57 |
| trip_distance | trip_distance | 1.1 | 3.36 | 2.26 | -2.29 | 6.75 | 18838 | 12.85 |
| fare_amount | fare_amount | 9.3 | 21.20 | 11.90 | -8.55 | 39.05 | 14941 | 10.19 |
| extra | extra | 0.0 | 2.50 | 2.50 | -3.75 | 6.25 | 2746 | 1.87 |
| mta_tax | mta_tax | 0.5 | 0.50 | 0.00 | 0.50 | 0.50 | 602 | 0.41 |
| tip_amount | tip_amount | 2.2 | 4.90 | 2.70 | -1.85 | 8.95 | 14113 | 9.63 |
| tolls_amount | tolls_amount | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 11584 | 7.90 |
| improvement_surcharge | improvement_surcharge | 1.0 | 1.00 | 0.00 | 1.00 | 1.00 | 76 | 0.05 |
| total_amount | total_amount | 16.4 | 30.30 | 13.90 | -4.45 | 51.15 | 16975 | 11.58 |
| congestion_surcharge | congestion_surcharge | 2.5 | 2.50 | 0.00 | 2.50 | 2.50 | 7175 | 4.89 |
| airport_fee | airport_fee | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 11638 | 7.94 |
## Total outliers detected: 131773
# ------------------------------------------------------------------
# FACETED BOXPLOTS — Outlier Distribution
# ------------------------------------------------------------------
outlier_features <- df[, FEATURE_COLS]
melted_outliers <- melt(outlier_features)
ggplot(melted_outliers, aes(x = variable, y = value, fill = variable)) +
geom_boxplot(outlier.color = "firebrick", outlier.alpha = 0.4, outlier.size = 1.2) +
facet_wrap(~variable, scales = "free", ncol = 4) +
theme_minimal() +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
strip.text = element_text(face = "bold", size = 10),
panel.grid.major.x = element_blank(),
legend.position = "none",
plot.title = element_text(face = "bold", size = 14)
) +
labs(title = "Outlier Distribution — Numerical Features (Pre-Capping)",
x = "", y = "Range of Values")Outlier Distribution — Numerical Features. Red dots mark individual outlier values beyond the IQR fences.
Finding: Extreme outliers are concentrated in
trip_distance, fare_amount,
tip_amount, and total_amount. These extreme
values represent data entry errors or rare edge cases (e.g., trips to
neighbouring states). The whiskers confirm that the interquartile ranges
are tight and sensible, but the upper tails extend dramatically —
justifying domain-driven capping before modelling.
Using domain knowledge of NYC taxi operations, we apply hard upper bounds to remove implausible records rather than winsorising, which preserves the distribution shape:
# ------------------------------------------------------------------
# ANOMALY CAPPING — Domain-driven hard limits
# ------------------------------------------------------------------
df_cleaned <- df %>%
filter(
trip_distance <= 30,
fare_amount <= 150,
tip_amount <= 50
)
cat("Rows before capping:", nrow(df), "\n")## Rows before capping: 146597
## Rows after capping: 146552
## Extreme anomalies removed: 45
The target variable tipping_level is constructed by
expressing tip_amount as a fraction of
fare_amount:
\[\text{Tip Ratio} = \frac{\text{tip\_amount}}{\text{fare\_amount}}\]
| Class | Condition |
|---|---|
| No Tip | tip_amount == 0 |
| Low | 0 < tip_amount ≤ 15% of fare |
| Medium | 15% < tip_amount ≤ 30% of fare |
| High | tip_amount > 30% of fare |
# ------------------------------------------------------------------
# TARGET VARIABLE — Tipping level (4-class ordinal factor)
# ------------------------------------------------------------------
df_cleaned <- df_cleaned %>%
mutate(
tipping_level = case_when(
tip_amount == 0 ~ "No Tip",
tip_amount > 0 & tip_amount <= 0.15 * fare_amount ~ "Low",
tip_amount > 0.15 * fare_amount & tip_amount <= 0.30 * fare_amount ~ "Medium",
tip_amount > 0.30 * fare_amount ~ "High"
),
tipping_level = factor(tipping_level,
levels = c("No Tip", "Low", "Medium", "High"))
)
cat("\nTipping level distribution:\n")##
## Tipping level distribution:
##
## No Tip Low Medium High
## 5326 19904 81376 39946
To prevent data leakage and remove uninformative features:
total_amount and tip_amount — direct
leakage of the target.improvement_surcharge and mta_tax — fixed
fees with zero variance.extra — collinear with the engineered
is_rush_hour flag.VendorID and store_and_fwd_flag — no
causal link to tipping behaviour.# ------------------------------------------------------------------
# DROP LEAKY / ZERO-VARIANCE / IRRELEVANT COLUMNS
# ------------------------------------------------------------------
df_cleaned <- df_cleaned %>%
select(-total_amount, -tip_amount, -improvement_surcharge,
-mta_tax, -extra, -VendorID, -store_and_fwd_flag)
cat("Shape after dropping columns:", dim(df_cleaned), "\n")## Shape after dropping columns: 146552 16
New features are derived from existing columns in three groups:
Temporal features — extracted from the pickup
timestamp:
trip_duration_mins, pickup_hour,
pickup_day_of_week, pickup_month,
is_weekend, is_rush_hour
Geometry features — derived purely from distance and time, no fare involved:
| Feature | Formula | Purpose |
|---|---|---|
speed_mph |
trip_distance / (trip_duration_mins / 60) |
Average trip speed — captures traffic conditions |
is_airport |
RatecodeID == 2 |
JFK flat-rate flag — distinct pricing regime |
Note on fare_amount: It is retained in
df_final for EDA and regression modelling. However, it is
excluded from classification features because
tipping_level thresholds are defined as a percentage of
fare_amount — including it directly would let models
reverse-engineer the class boundaries (partial leakage).
fare_per_mile and fare_per_minute are also
excluded for the same reason.
# ------------------------------------------------------------------
# FEATURE ENGINEERING — Temporal + geometry features
# ------------------------------------------------------------------
df_final <- df_cleaned %>%
mutate(
# Temporal features
trip_duration_mins = as.numeric(
difftime(tpep_dropoff_datetime, tpep_pickup_datetime, units = "mins")
),
pickup_hour = as.factor(hour(tpep_pickup_datetime)),
pickup_day_of_week = as.factor(wday(tpep_pickup_datetime, label = TRUE)),
pickup_month = as.factor(month(tpep_pickup_datetime, label = TRUE)),
is_weekend = as.integer(wday(tpep_pickup_datetime) %in% c(1, 7)),
is_rush_hour = as.integer(hour(tpep_pickup_datetime) %in% c(7,8,9,16,17,18,19)),
speed_mph = trip_distance / (trip_duration_mins / 60 + 0.01),
is_airport = factor(RatecodeID == 2, levels = c(FALSE, TRUE))
) %>%
filter(trip_duration_mins >= 1, trip_duration_mins <= 180) %>%
select(-tpep_pickup_datetime, -tpep_dropoff_datetime)
cat("Final dataset shape:", dim(df_final), "\n")## Final dataset shape: 146256 22
## Features: passenger_count, trip_distance, RatecodeID, fare_amount, tolls_amount, congestion_surcharge, airport_fee, PU_Borough, PU_Zone, PU_service_zone, DO_Borough, DO_Zone, DO_service_zone, tipping_level, trip_duration_mins, pickup_hour, pickup_day_of_week, pickup_month, is_weekend, is_rush_hour, speed_mph, is_airport
| Stage | Action | Dataset Size |
|---|---|---|
| Raw dataset | Full TLC records (Jan–Aug 2023) | 19,493,060 rows × 21 cols |
| Stratified 1% sample | 1% per month, seasonal balance preserved | ~194,930 rows |
| Geo-data enrichment | Joined taxi zone metadata | Added borough, zone, service zone |
| Cleaning filters | Credit-card only, valid fares/distances/timestamps | Removed ~17% invalid records |
| Deduplication | Dropped duplicate rows | Minimal duplicates |
| Column reconciliation | Merged airport fee column variants | Naming inconsistency resolved |
| Outlier capping | Domain-driven hard limits applied | Removed ~1% extreme values |
| Feature engineering + duration filter | Temporal features (trip_duration_mins, hour, day,
month, weekend, rush-hour), geometry features (speed_mph,
is_airport), 1–180 min duration window |
~146,000 rows × 19 features |
# ------------------------------------------------------------------
# CORRELATION HEATMAP — Numeric Features
# ------------------------------------------------------------------
corr_cols <- c("fare_amount", "trip_distance", "trip_duration_mins",
"passenger_count", "tolls_amount",
"congestion_surcharge", "airport_fee",
"is_weekend", "is_rush_hour")
corr_matrix <- cor(df_final[, corr_cols], use = "complete.obs")
corrplot(
corr_matrix,
method = "color",
type = "upper",
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
number.cex = 0.78,
col = colorRampPalette(c("#e74c3c", "white", "#2ecc71"))(200),
title = "Correlation Heatmap — Numeric Features",
mar = c(0, 0, 2, 0)
)Correlation Heatmap — Numeric Features. Green = strong positive correlation; red = negative correlation.
Finding: A strong positive correlation exists
between fare_amount and trip_distance (r ≈
0.91) and between fare_amount and
trip_duration_mins (r ≈ 0.86), confirming that distance and
time are the primary drivers of metered fares — exactly as the NYC
taximeter pricing model dictates. tolls_amount is
moderately correlated with distance, reflecting that longer
outer-borough trips are more likely to use tolled bridges or tunnels.
Binary flags (is_weekend, is_rush_hour) show
near-zero correlation with fare, confirming limited linear
predictability from time-of-day signals alone.
# ------------------------------------------------------------------
# HISTOGRAMS — Univariate numeric distributions
# ------------------------------------------------------------------
numeric_cols <- c("fare_amount", "trip_distance", "trip_duration_mins",
"tolls_amount", "congestion_surcharge", "airport_fee")
plot_list <- lapply(numeric_cols, function(col) {
ggplot(df_final, aes(x = .data[[col]])) +
geom_histogram(bins = 50, fill = "#3498db", color = "white", alpha = 0.85) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 12),
axis.title = element_text(face = "bold", size = 10)) +
labs(title = paste("Distribution of", col), x = col, y = "Count")
})
gridExtra::grid.arrange(grobs = plot_list, ncol = 2)Univariate Distributions — Continuous Features. Blue bars show the frequency histogram for each numerical variable.
Finding: Both fare_amount and
trip_distance are strongly right-skewed — the vast majority
of NYC taxi journeys are short, local trips under 5 miles costing less
than $30. trip_duration_mins shows a similar skew with a
long tail, representing occasional slow traffic or long-haul rides.
congestion_surcharge and airport_fee show
bimodal distributions (either applied or not), acting effectively as
binary features rather than continuous ones.
# ------------------------------------------------------------------
# BAR CHART — Passenger Count Distribution
# ------------------------------------------------------------------
df_final %>%
count(passenger_count) %>%
ggplot(aes(x = factor(passenger_count), y = n, fill = n)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = scales::comma(n)), vjust = -0.4,
fontface = "bold", size = 3.8) +
scale_fill_gradient(low = "#3498db", high = "#2ecc71") +
scale_y_continuous(labels = scales::comma,
limits = c(0, max(table(df_final$passenger_count)) * 1.15)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Distribution of Passenger Count",
x = "Passenger Count", y = "Number of Trips")Passenger Count Distribution. Count labels show the absolute number of trips per passenger group.
Finding: Single-passenger trips overwhelmingly dominate the dataset, accounting for the majority of all recorded journeys. Two-passenger trips are the next most frequent. This distribution indicates that individual tipping preferences — rather than group dynamics — are the primary driver of overall tipping patterns across the dataset.
# ------------------------------------------------------------------
# LINE + POINT — Monthly Trip Volume
# ------------------------------------------------------------------
df_final %>%
count(pickup_month) %>%
ggplot(aes(x = pickup_month, y = n, group = 1)) +
geom_line(color = "#3498db", linewidth = 1.3) +
geom_point(color = "#2ecc71", size = 4) +
geom_text(aes(label = scales::comma(n)), vjust = -0.9,
fontface = "bold", size = 3.8) +
scale_y_continuous(labels = scales::comma,
limits = c(0, max(table(df_final$pickup_month)) * 1.2)) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Monthly Trip Count (January – August 2023)",
x = "Month", y = "Number of Trips")Monthly Trip Count — January to August 2023. Points mark the exact sampled count per month; the line shows the temporal trend.
Finding: Trip volumes are broadly stable across the eight months, reflecting consistent underlying demand for taxi services throughout the period. There are minor seasonal fluctuations — with slightly lower volumes in January (consistent with cold weather reducing outdoor activity) and a mild mid-year uptick — but there is no dramatic seasonal collapse, confirming the stratified sampling faithfully captures stable demand patterns.
# ------------------------------------------------------------------
# HORIZONTAL BAR — Average Fare by Pickup Borough
# ------------------------------------------------------------------
df_final %>%
group_by(PU_Borough) %>%
summarise(avg_fare = mean(fare_amount, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(avg_fare)) %>%
ggplot(aes(x = reorder(PU_Borough, avg_fare), y = avg_fare, fill = avg_fare)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = round(avg_fare, 2)), hjust = -0.2,
fontface = "bold", size = 3.8) +
scale_fill_gradient(low = "#f39c12", high = "#2ecc71") +
coord_flip() +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Average Fare Amount by Pickup Borough",
x = "Pickup Borough", y = "Average Fare ($)")Average Fare Amount by Pickup Borough. Colour gradient goes from lower (orange) to higher (green) fares.
Finding: Outer boroughs — particularly Queens and Staten Island — yield significantly higher average fares than Manhattan. This reflects the preponderance of long-distance airport transfer routes originating in Queens (JFK and LaGuardia), where flat-rate fares can exceed $50. Manhattan pickups represent shorter, high-frequency city trips with lower individual fares, confirming the need for geographically differentiated pricing models.
# ------------------------------------------------------------------
# BAR CHART — Average Fare by Hour
# ------------------------------------------------------------------
df_final %>%
group_by(pickup_hour) %>%
summarise(avg_fare = mean(fare_amount, na.rm = TRUE), .groups = "drop") %>%
ggplot(aes(x = pickup_hour, y = avg_fare, fill = avg_fare)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = round(avg_fare, 1)), vjust = -0.4,
fontface = "bold", size = 3.2) +
scale_fill_gradient(low = "#3498db", high = "#e74c3c") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Average Fare Amount by Hour of Day",
x = "Hour of Day (0–23)", y = "Average Fare ($)")Average Fare by Hour of Day. Colour gradient shifts from blue (lower fare) to red (higher fare).
Finding: Average fares peak sharply during the early morning hours (approximately 4–6 AM). This pattern is explained by airport transfers: passengers catching early flights take high-speed long-distance rides with minimal traffic interference, producing longer mileage counts and higher fares within a shorter elapsed time. Daytime hours show suppressed average fares driven by dense short city hops, while evening hours show a mild secondary rise as commuter and leisure demand shifts.
# ------------------------------------------------------------------
# BAR CHART — Average Fare by Day of Week
# ------------------------------------------------------------------
df_final %>%
group_by(pickup_day_of_week) %>%
summarise(avg_fare = mean(fare_amount, na.rm = TRUE), .groups = "drop") %>%
ggplot(aes(x = pickup_day_of_week, y = avg_fare, fill = avg_fare)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = round(avg_fare, 2)), vjust = -0.4,
fontface = "bold", size = 3.8) +
scale_fill_gradient(low = "#3498db", high = "#e74c3c") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Average Fare Amount by Day of Week",
x = "Day of Week", y = "Average Fare ($)")Average Fare Amount by Day of Week.
Finding: Weekend fares are marginally higher than weekday fares, suggesting that weekend trips tend to cover greater distances — consistent with leisure travel, visits to entertainment venues, or airport journeys on Saturdays and Sundays. The differences are modest, confirming that day-of-week is a secondary pricing signal rather than a primary driver.
# ------------------------------------------------------------------
# BOXPLOT — Fare Distribution by Hour
# ------------------------------------------------------------------
ggplot(df_final, aes(x = pickup_hour, y = fare_amount, fill = pickup_hour)) +
geom_boxplot(outlier.color = "firebrick", outlier.alpha = 0.3, outlier.size = 0.8) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Fare Amount Distribution by Hour of Day",
x = "Hour of Day", y = "Fare Amount ($)")Fare Amount Distribution by Hour. Red dots indicate outlier fares; boxes show the interquartile range.
Finding: Early morning hours (2–5 AM) show not only higher median fares but also substantially wider interquartile ranges and more extreme upper-tail outliers. This reflects the heterogeneous nature of early-hour trips: a mix of very short city runs and costly long-haul airport journeys with no mid-range equilibrium. Daytime boxes (8 AM–6 PM) are compact and low, representing the steady flow of predictable, short urban trips.
# ------------------------------------------------------------------
# SCATTER — Trip Distance vs Fare by Tipping Level
# ------------------------------------------------------------------
ggplot(df_final %>% sample_n(5000),
aes(x = trip_distance, y = fare_amount, color = tipping_level)) +
geom_point(alpha = 0.4, size = 1.2) +
scale_color_manual(values = color_palette) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold")) +
labs(title = "Trip Distance vs Fare Amount by Tipping Level",
x = "Trip Distance (miles)", y = "Fare Amount ($)",
color = "Tipping Level")Trip Distance vs Fare Amount — coloured by Tipping Level. A random sample of 5,000 trips is plotted.
Finding: A clear and strict linear relationship exists between trip distance and fare amount, validating the taxi meter’s time-and-distance pricing model. However, all four tipping levels are thoroughly mixed across the entire distance-fare range — there is no clean spatial separation between tipping classes. This confirms that tipping level cannot be predicted from distance and fare alone, justifying the use of richer feature sets (temporal, geographic, and categorical) in classification.
# ------------------------------------------------------------------
# STACKED BAR — Tipping Level by Day of Week
# ------------------------------------------------------------------
df_final %>%
count(pickup_day_of_week, tipping_level) %>%
group_by(pickup_day_of_week) %>%
mutate(pct = n / sum(n) * 100) %>%
ungroup() %>%
ggplot(aes(x = pickup_day_of_week, y = pct, fill = tipping_level)) +
geom_bar(stat = "identity", color = "white") +
scale_fill_manual(values = color_palette) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold")) +
labs(title = "Tipping Level Distribution by Day of Week",
x = "Day of Week", y = "Percentage (%)", fill = "Tipping Level")Tipping Level Distribution by Day of Week. 100% stacked bars show the proportional mix of tipping classes each day.
Finding: Tipping proportions are remarkably stable across all seven days of the week. The “Medium” category consistently dominates at roughly 55–60% of trips, while “High”, “Low”, and “No Tip” segments remain similarly proportioned regardless of weekday or weekend. This finding strongly suggests that tipping behaviour is governed by trip-level characteristics rather than which day of the week a journey occurs.
# ------------------------------------------------------------------
# BOXPLOT — Fare Amount by Tipping Level
# ------------------------------------------------------------------
ggplot(df_final, aes(x = tipping_level, y = fare_amount, fill = tipping_level)) +
geom_boxplot(outlier.color = "firebrick", outlier.alpha = 0.4, outlier.size = 1.2) +
scale_fill_manual(values = color_palette) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Fare Amount by Tipping Level",
x = "Tipping Level", y = "Fare Amount ($)")Fare Amount Distribution by Tipping Level. Each box shows the spread of fare amounts within each tipping category.
Finding: Counter-intuitively, passengers in the “No Tip” category tend to have higher median fares than those in the “High” tier. “High” tippers cluster at lower fare values. This “tip fatigue” effect suggests that passengers on expensive trips feel less inclined to add a generous tip on top of an already-large bill, while passengers on short, inexpensive rides are more willing to tip generously as a proportion of the fare.
# ------------------------------------------------------------------
# BAR CHART — Trip Count by Hour
# ------------------------------------------------------------------
df_final %>%
count(pickup_hour) %>%
ggplot(aes(x = pickup_hour, y = n, fill = n)) +
geom_bar(stat = "identity", color = "white") +
scale_fill_gradient(low = "#3498db", high = "#e74c3c") +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Trip Count by Hour of Day",
x = "Hour of Day (0–23)", y = "Number of Trips")Trip Count by Hour of Day. Darker bars indicate the highest demand hours.
Finding: Trip volumes surge during the late afternoon and early evening (approximately 5–8 PM), corresponding to the evening commuter rush and post-work social activity. There is also a secondary morning peak around 8–9 AM. The lowest demand occurs between 2–5 AM — the same window where average fares are highest, confirming an inverse relationship between trip volume and average trip value during off-peak hours.
# ------------------------------------------------------------------
# STACKED BAR — Tipping Level by Congestion Surcharge
# ------------------------------------------------------------------
df_final %>%
count(congestion_surcharge, tipping_level) %>%
group_by(congestion_surcharge) %>%
mutate(pct = n / sum(n) * 100) %>%
ungroup() %>%
ggplot(aes(x = factor(congestion_surcharge), y = pct, fill = tipping_level)) +
geom_bar(stat = "identity", color = "white") +
scale_fill_manual(values = color_palette) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold")) +
labs(title = "Tipping Level Distribution by Congestion Surcharge",
x = "Congestion Surcharge ($)", y = "Percentage (%)", fill = "Tipping Level")Tipping Level Distribution by Congestion Surcharge Level. Each bar represents trips with a specific congestion surcharge amount.
Finding: The presence or absence of a congestion surcharge has negligible impact on tipping proportions. Passengers appear to mentally segregate the mandatory congestion surcharge from the voluntary tip, treating it as an unavoidable government-mandated fee rather than an increase in the driver’s earnings. This is a key insight: congestion pricing reform does not appear to undermine driver tipping income.
# ------------------------------------------------------------------
# STACKED BAR — Tipping Level by Airport Fee
# ------------------------------------------------------------------
df_final %>%
count(airport_fee, tipping_level) %>%
group_by(airport_fee) %>%
mutate(pct = n / sum(n) * 100) %>%
ungroup() %>%
ggplot(aes(x = factor(airport_fee), y = pct, fill = tipping_level)) +
geom_bar(stat = "identity", color = "white") +
scale_fill_manual(values = color_palette) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold")) +
labs(title = "Tipping Level Distribution by Airport Fee",
x = "Airport Fee ($)", y = "Percentage (%)", fill = "Tipping Level")Tipping Level Distribution by Airport Fee. Trips with the airport surcharge show a distinct tipping mix.
Finding: Trips attracting an airport pickup fee show a noticeably higher proportion of “High” tippers compared to non-airport trips. Airport passengers — who are typically travellers rather than regular commuters — are less price-sensitive and more accustomed to service gratuity norms. This makes the airport indicator a valuable predictive signal for the tipping classification model.
# ------------------------------------------------------------------
# BAR CHART — Avg Fare Weekday vs Weekend
# ------------------------------------------------------------------
df_final %>%
group_by(is_weekend) %>%
summarise(avg_fare = mean(fare_amount, na.rm = TRUE), .groups = "drop") %>%
mutate(is_weekend = ifelse(is_weekend == 1, "Weekend", "Weekday")) %>%
ggplot(aes(x = is_weekend, y = avg_fare, fill = is_weekend)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = round(avg_fare, 2)), vjust = -0.4,
fontface = "bold", size = 4) +
scale_fill_manual(values = c("Weekday" = "#3498db", "Weekend" = "#e74c3c")) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Average Fare Amount: Weekday vs Weekend",
x = "", y = "Average Fare ($)")Average Fare: Weekday vs Weekend. Blue = weekday, red = weekend.
Finding: Weekend fares are marginally higher on average than weekday fares. While the absolute dollar difference is small, it is directionally consistent with the hypothesis that weekend trips serve leisure purposes (night out, travel, entertainment) which tend to cover longer distances than routine weekday commutes.
# ------------------------------------------------------------------
# BAR CHART — Avg Fare Rush vs Non-Rush
# ------------------------------------------------------------------
df_final %>%
group_by(is_rush_hour) %>%
summarise(avg_fare = mean(fare_amount, na.rm = TRUE), .groups = "drop") %>%
mutate(is_rush_hour = ifelse(is_rush_hour == 1, "Rush Hour", "Non-Rush Hour")) %>%
ggplot(aes(x = is_rush_hour, y = avg_fare, fill = is_rush_hour)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = round(avg_fare, 2)), vjust = -0.4,
fontface = "bold", size = 4) +
scale_fill_manual(values = c("Non-Rush Hour" = "#3498db", "Rush Hour" = "#e74c3c")) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Average Fare Amount: Rush Hour vs Non-Rush Hour",
x = "", y = "Average Fare ($)")Average Fare: Rush Hour vs Non-Rush Hour.
Finding: Somewhat counter-intuitively, rush-hour trips do not produce the highest average fares. Intense traffic congestion during peak hours limits the total distance a taxi can cover within a given duration, compressing total fare accrual. By contrast, non-rush-hour trips — especially in the late evening and early morning — allow faster travel over greater distances, yielding higher individual fares despite lighter overall demand.
# ------------------------------------------------------------------
# STACKED BAR — Tipping Level Weekday vs Weekend
# ------------------------------------------------------------------
df_final %>%
mutate(is_weekend = ifelse(is_weekend == 1, "Weekend", "Weekday")) %>%
count(is_weekend, tipping_level) %>%
group_by(is_weekend) %>%
mutate(pct = n / sum(n) * 100) %>%
ungroup() %>%
ggplot(aes(x = is_weekend, y = pct, fill = tipping_level)) +
geom_bar(stat = "identity", color = "white") +
scale_fill_manual(values = color_palette) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold")) +
labs(title = "Tipping Level Distribution: Weekday vs Weekend",
x = "", y = "Percentage (%)", fill = "Tipping Level")Tipping Level Distribution: Weekday vs Weekend.
Finding: The tipping class distribution is highly consistent between weekdays and weekends, mirroring the stability observed in the day-of-week analysis. The “Medium” tip category dominates both conditions at similar proportions, reinforcing the conclusion that passenger tipping decisions are not meaningfully influenced by whether the trip occurs on a weekday or a weekend.
# ------------------------------------------------------------------
# STACKED BAR — Tipping Level Rush vs Non-Rush
# ------------------------------------------------------------------
df_final %>%
mutate(is_rush_hour = ifelse(is_rush_hour == 1, "Rush Hour", "Non-Rush Hour")) %>%
count(is_rush_hour, tipping_level) %>%
group_by(is_rush_hour) %>%
mutate(pct = n / sum(n) * 100) %>%
ungroup() %>%
ggplot(aes(x = is_rush_hour, y = pct, fill = tipping_level)) +
geom_bar(stat = "identity", color = "white") +
scale_fill_manual(values = color_palette) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold")) +
labs(title = "Tipping Level Distribution: Rush Hour vs Non-Rush Hour",
x = "", y = "Percentage (%)", fill = "Tipping Level")Tipping Level Distribution: Rush Hour vs Non-Rush Hour.
Finding: Tipping distributions are virtually identical between rush-hour and non-rush-hour periods. Passengers do not appear to penalise drivers for traffic delays by tipping less. The “Medium” class remains the dominant segment in both conditions, indicating that tipping generosity is decoupled from road conditions and time-of-day demand cycles.
# ------------------------------------------------------------------
# BAR CHART — Target Class Distribution
# ------------------------------------------------------------------
plot_data <- df_final %>%
count(tipping_level) %>%
mutate(
pct = (n / sum(n)) * 100,
label_text = paste0(scales::comma(n), "\n(", round(pct, 1), "%)"),
tipping_level = factor(tipping_level,
levels = c("No Tip","Low","Medium","High"))
) %>%
filter(!is.na(tipping_level))
ggplot(plot_data, aes(x = tipping_level, y = n, fill = tipping_level)) +
geom_bar(stat = "identity", color = "white", lwd = 0.8) +
geom_text(aes(label = label_text),
vjust = -0.25, fontface = "bold", size = 3.8, lineheight = 0.9) +
scale_fill_manual(values = color_palette) +
scale_y_continuous(limits = c(0, max(plot_data$n) * 1.18),
labels = scales::comma) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 14, hjust = 0),
axis.title = element_text(face = "bold", size = 11),
panel.grid.major.x = element_blank()) +
labs(title = "Class Distribution — Tipping Level",
x = "Tipping Level", y = "Count")Class Distribution — Tipping Level Target Variable. Bars show absolute counts and percentages for each class.
Finding: The dataset is clearly imbalanced. “Medium” tips account for approximately 55.6% of all trips, while “No Tip” is the rarest class at only 3.6%. “High” (27.3%) and “Low” (13.6%) are meaningful minority classes. This class imbalance makes standard accuracy a misleading metric — a naive classifier predicting “Medium” for all observations would achieve 55.6% accuracy. It directly motivates the use of inverse-frequency class weighting strategies in all three classification models.
# ------------------------------------------------------------------
# FACETED LINE — Hourly Avg Fare: Airport vs Standard
# ------------------------------------------------------------------
df_final %>%
group_by(pickup_hour, day_type, is_airport_trip) %>%
summarise(avg_fare = mean(fare_amount, na.rm = TRUE), .groups = "drop") %>%
ggplot(aes(x = pickup_hour, y = avg_fare, color = day_type, group = day_type)) +
geom_line(linewidth = 1.2) +
geom_point(size = 2.5) +
scale_x_discrete(drop = FALSE) +
scale_color_manual(values = c("Weekday" = "#3498db", "Weekend" = "#e74c3c")) +
facet_wrap(~ is_airport_trip, scales = "free_y") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold"),
legend.position = "bottom",
axis.text.x = element_text(size = 7)) +
labs(title = "Hourly Average Fare: Airport vs Standard (Weekday vs Weekend)",
x = "Hour of Day", y = "Average Fare ($)", color = "Day Type")Hourly Average Fare — Airport vs Standard Trips, Weekday vs Weekend. Faceted by trip type; coloured by day type.
Finding: The two panels reveal fundamentally different pricing structures. Standard trips maintain relatively flat and stable average fares (roughly $15–25) across all hours, with minor weekday/weekend differences. Airport trips, however, exhibit dramatically higher and more volatile average fares — often exceeding $60 — with distinct temporal dips and peaks that reflect fixed-rate airport fare structures (e.g., a flat $56 JFK-to-Manhattan rate) combined with variable surge periods. This confirms that airport and non-airport trips belong to different economic regimes and may benefit from separate modelling approaches.
# ------------------------------------------------------------------
# STACKED BAR — Tipping Level by Top 10 Pickup Zones
# ------------------------------------------------------------------
top_zones <- df_final %>%
count(PU_Zone, sort = TRUE) %>%
slice_head(n = 10) %>%
pull(PU_Zone)
df_final %>%
filter(PU_Zone %in% top_zones) %>%
mutate(PU_Zone = factor(PU_Zone, levels = top_zones)) %>%
ggplot(aes(x = PU_Zone, fill = tipping_level)) +
geom_bar(position = "fill", color = "white", alpha = 0.9) +
scale_fill_manual(values = color_palette) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
legend.title = element_text(face = "bold")) +
labs(title = "Tipping Level Mix: Top 10 Pickup Zones",
x = "Pickup Zone", y = "Proportion of Trips", fill = "Tipping Level")Tipping Level Proportions — Top 10 Most Active Pickup Zones. Proportional stacked bars highlight zone-level tipping cultures.
Finding: Geographic zone is a meaningful predictor of tipping behaviour. Airport-adjacent zones (JFK and LaGuardia clusters) display a more generous tipping mix, with elevated proportions of “High” tips. Central Manhattan zones (e.g., Midtown, Upper East Side) show a stable majority of “Medium” tips. High-volume non-airport zones near transit hubs have slightly more “No Tip” occurrences, likely driven by locals who are less inclined to tip on routine commutes.
# ------------------------------------------------------------------
# FACETED BOXPLOT — Trip Distance by Tip and Passenger Count
# ------------------------------------------------------------------
df_final %>%
filter(passenger_count >= 1 & passenger_count <= 6) %>%
ggplot(aes(x = tipping_level, y = trip_distance, fill = tipping_level)) +
geom_boxplot(outlier.alpha = 0.1, outlier.size = 0.6) +
scale_fill_manual(values = color_palette) +
facet_wrap(~ passenger_count, labeller = label_both) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
axis.text.x = element_text(angle = 30, hjust = 1)) +
labs(title = "Trip Distance by Tipping Level and Passenger Count",
x = "", y = "Trip Distance (miles)")Trip Distance by Tipping Level — Faceted by Passenger Count. Each panel shows the distance spread for trips with 1–6 passengers.
Finding: The relationship between trip distance and tipping level is largely consistent across all passenger counts (1 through 6). “No Tip” passengers tend to have slightly higher median distances — reinforcing the tip-fatigue pattern identified earlier — while the pattern holds regardless of group size. The consistency across passenger counts suggests that group travel dynamics do not substantially alter individual or group tipping decisions relative to trip distance.
# ------------------------------------------------------------------
# FACETED SCATTER + TRENDLINE — Linearity Analysis
# ------------------------------------------------------------------
ggplot(df_final, aes(x = trip_distance, y = fare_amount)) +
geom_point(alpha = 0.1, size = 0.8, color = "darkgreen") +
geom_smooth(method = "lm", color = "red", se = FALSE, linetype = "solid") +
facet_wrap(~ is_airport_trip) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Faceted Linearity Analysis: Standard vs Airport Pricing Structures",
x = "Trip Distance (miles)", y = "Fare Amount ($)")Fare vs Distance — Airport vs Standard Pricing Structures. Red line = OLS regression fit; each panel reveals a different relationship.
Finding: Standard trips display a tight,
well-behaved linear scaling of fare with distance — precisely what the
metered pricing model predicts. Airport trips, however, show a
strikingly different pattern: horizontal clusters of points at specific
fare levels (e.g., the JFK flat rate to Manhattan) that are largely
independent of distance, completely breaking the standard linear
assumption. This structural heterogeneity justifies using separate
features (is_airport_trip) in regression models rather than
a single unified linear model.
# ------------------------------------------------------------------
# CLASSIFICATION — Restore factor types
# ------------------------------------------------------------------
df_clf <- df_final %>%
mutate(
tipping_level = factor(tipping_level, levels = c("No Tip","Low","Medium","High")),
RatecodeID = as.factor(RatecodeID),
PU_Borough = as.factor(PU_Borough),
DO_Borough = as.factor(DO_Borough),
PU_service_zone = as.factor(PU_service_zone),
DO_service_zone = as.factor(DO_service_zone),
pickup_hour = as.factor(pickup_hour),
pickup_day_of_week = factor(as.character(pickup_day_of_week)),
pickup_month = factor(as.character(pickup_month)),
is_weekend = as.factor(is_weekend),
is_rush_hour = as.factor(is_rush_hour)
)
# ------------------------------------------------------------------
# FEATURE SELECTION
# fare_amount excluded — tipping_level is defined as % of fare_amount,
# so including it leaks class boundary information.
# speed_mph and is_airport (engineered in Feature Engineering) are
# the clean geometry-based replacements.
# ------------------------------------------------------------------
model_input <- df_clf %>%
select(
tipping_level,
trip_distance, trip_duration_mins,
speed_mph, is_airport,
passenger_count, tolls_amount, congestion_surcharge, airport_fee,
RatecodeID,
PU_Borough, DO_Borough,
PU_service_zone, DO_service_zone,
pickup_hour, pickup_day_of_week, pickup_month,
is_weekend, is_rush_hour
) %>%
na.omit()
cat("Classification dataset:", nrow(model_input), "rows,",
ncol(model_input) - 1, "features\n")## Classification dataset: 146256 rows, 18 features
tip_levels <- levels(model_input$tipping_level)
n_classes <- length(tip_levels)
# Check class distribution
class_dist <- as.data.frame(table(model_input$tipping_level)) %>%
rename(Level = Var1, Count = Freq) %>%
mutate(Percent = round(Count / sum(Count) * 100, 1))
kable(class_dist, caption = "Class Distribution — Target Variable")| Level | Count | Percent |
|---|---|---|
| No Tip | 5260 | 3.6 |
| Low | 19868 | 13.6 |
| Medium | 81253 | 55.6 |
| High | 39875 | 27.3 |
## Imbalance ratio (max/min): 15.45
# ------------------------------------------------------------------
# TRAIN / TEST SPLIT — 80/20 stratified on target
# ------------------------------------------------------------------
set.seed(42)
split_idx <- createDataPartition(model_input$tipping_level, p = 0.8, list = FALSE)
train_data <- model_input[ split_idx, ]
test_data <- model_input[-split_idx, ]
cat("Train rows:", nrow(train_data), "| Test rows:", nrow(test_data), "\n")## Train rows: 117006 | Test rows: 29250
Testing showed that extreme No Tip weights (~28×) forced models to sacrifice accuracy on Medium and High (82.8% of data) for near-zero improvement on the rarest class. Both RF and XGBoost handle moderate imbalance robustly at this dataset size (~117K training rows) without explicit correction.
# ------------------------------------------------------------------
# IMBALANCE STRATEGY — No oversampling, no inverse-frequency weights
# ------------------------------------------------------------------
cat("\nTraining class frequencies:\n")##
## Training class frequencies:
##
## No Tip Low Medium High
## 3.6 13.6 55.6 27.3
# ------------------------------------------------------------------
# HELPER — Full metrics: Accuracy, Kappa, Macro Precision, Recall, F1,
# per-class AUC (one-vs-rest), Macro AUC
# ------------------------------------------------------------------
get_metrics_clf <- function(actual, predicted, model_name, probs = NULL) {
cm <- confusionMatrix(predicted, actual)
acc <- as.numeric(cm$overall["Accuracy"])
kappa <- as.numeric(cm$overall["Kappa"])
macro_prec <- mean(cm$byClass[, "Pos Pred Value"], na.rm = TRUE)
macro_rec <- mean(cm$byClass[, "Sensitivity"], na.rm = TRUE)
macro_f1 <- mean(cm$byClass[, "F1"], na.rm = TRUE)
# One-vs-rest AUC per class
auc_val <- NA
auc_class <- setNames(rep(NA_real_, n_classes), tip_levels)
if (!is.null(probs)) {
for (cls in tip_levels) {
binary <- as.integer(actual == cls)
roc_i <- roc(binary, probs[, cls], quiet = TRUE)
auc_class[cls] <- as.numeric(auc(roc_i))
}
auc_val <- mean(auc_class, na.rm = TRUE)
}
class_names <- sub("Class: ", "", names(cm$byClass[, "F1"]))
cat(sprintf("\n=== %s ===\n", model_name))
cat(sprintf(" Accuracy : %.4f\n", acc))
cat(sprintf(" Kappa : %.4f\n", kappa))
cat(sprintf(" Macro Precision : %.4f\n", macro_prec))
cat(sprintf(" Macro Recall : %.4f\n", macro_rec))
cat(sprintf(" Macro F1 : %.4f\n", macro_f1))
if (!is.na(auc_val))
cat(sprintf(" Macro AUC (OvR) : %.4f\n", auc_val))
cat("\n Per-class breakdown:\n")
per_class <- data.frame(
Precision = round(cm$byClass[, "Pos Pred Value"], 4),
Recall = round(cm$byClass[, "Sensitivity"], 4),
F1 = round(cm$byClass[, "F1"], 4),
AUC_OvR = if (!is.na(auc_val))
round(auc_class[class_names], 4)
else rep(NA_real_, n_classes)
)
rownames(per_class) <- class_names
print(per_class)
list(cm = cm, accuracy = acc, kappa = kappa,
macro_precision = macro_prec,
macro_recall = macro_rec,
macro_f1 = macro_f1,
auc = auc_val,
auc_class = auc_class,
probs = probs,
name = model_name)
}A parametric linear baseline. Class imbalance is addressed via case weights that adjust each observation’s contribution to the maximum likelihood objective.
# ------------------------------------------------------------------
# MODEL 1: MULTINOMIAL LOGISTIC REGRESSION
# ------------------------------------------------------------------
train_mlr <- train_data %>%
mutate(PU_Borough = relevel(PU_Borough, ref = "Manhattan"),
DO_Borough = relevel(DO_Borough, ref = "Manhattan"),
RatecodeID = relevel(RatecodeID, ref = "1"))
test_mlr <- test_data %>%
mutate(PU_Borough = relevel(PU_Borough, ref = "Manhattan"),
DO_Borough = relevel(DO_Borough, ref = "Manhattan"),
RatecodeID = relevel(RatecodeID, ref = "1"))
model_mlr <- multinom(
tipping_level ~ ., data = train_mlr,
MaxNWts = 30000, maxit = 300, trace = FALSE
)
pred_mlr <- predict(model_mlr, newdata = test_mlr)
probs_mlr <- predict(model_mlr, newdata = test_mlr, type = "probs")
colnames(probs_mlr) <- tip_levels
results_mlr <- get_metrics_clf(test_mlr$tipping_level, pred_mlr,
"Multinomial Logistic Regression",
probs = probs_mlr)##
## === Multinomial Logistic Regression ===
## Accuracy : 0.6583
## Kappa : 0.3326
## Macro Precision : 0.4405
## Macro Recall : 0.3704
## Macro F1 : 0.6900
## Macro AUC (OvR) : 0.6766
##
## Per-class breakdown:
## Precision Recall F1 AUC_OvR
## No Tip NaN 0.0000 NA 0.5770
## Low 0.0000 0.0000 NaN 0.6292
## Medium 0.6561 0.8987 0.7585 0.6909
## High 0.6653 0.5831 0.6215 0.8092
500 decision trees with square-root feature sampling per split.
Native class.weights adjusts bootstrap sampling probability
so minority classes appear in more trees.
# ------------------------------------------------------------------
# MODEL 2: RANDOM FOREST (Tuned — dual grid search)
# ------------------------------------------------------------------
n_cores <- max(1L, parallel::detectCores() - 1L)
cat(sprintf("\n--- Tuning Random Forest (using %d cores) ---\n", n_cores))##
## --- Tuning Random Forest (using 7 cores) ---
mtry_candidates <- c(4, 6, 8, 10, 12)
min_node_candidates <- c(3, 5, 10)
tune_grid <- expand.grid(
mtry = mtry_candidates,
min_node_size = min_node_candidates
)
cat("Testing", nrow(tune_grid), "hyperparameter combinations...\n")## Testing 15 hyperparameter combinations...
# Evaluate each combination via OOB error (test set not used)
tune_results <- apply(tune_grid, 1, function(params) {
tmp <- ranger(
tipping_level ~ .,
data = train_data,
num.trees = 300,
mtry = params["mtry"],
min.node.size = params["min_node_size"],
num.threads = n_cores,
seed = 42
)
tmp$prediction.error
})## Growing trees.. Progress: 96%. Estimated remaining time: 1 seconds.
tune_grid$oob_error <- tune_results
tune_grid$oob_acc <- round((1 - tune_results) * 100, 2)
cat("\nTuning results (sorted by OOB accuracy):\n")##
## Tuning results (sorted by OOB accuracy):
## mtry min_node_size oob_error oob_acc
## 11 4 10 0.3158471 68.42
## 6 4 5 0.3160693 68.39
## 1 4 3 0.3162829 68.37
## 12 6 10 0.3174538 68.25
## 7 6 5 0.3191204 68.09
## 2 6 3 0.3197785 68.02
## 13 8 10 0.3202400 67.98
## 14 10 10 0.3222741 67.77
## 15 12 10 0.3230774 67.69
## 8 8 5 0.3238552 67.61
## 3 8 3 0.3259662 67.40
## 9 10 5 0.3261286 67.39
## 10 12 5 0.3270687 67.29
## 4 10 3 0.3288464 67.12
## 5 12 3 0.3299574 67.00
best_idx <- which.min(tune_grid$oob_error)
best_mtry <- tune_grid$mtry[best_idx]
best_min_node <- tune_grid$min_node_size[best_idx]
cat(sprintf("\nBest: mtry = %d, min.node.size = %d (OOB acc = %.2f%%)\n",
best_mtry, best_min_node, tune_grid$oob_acc[best_idx]))##
## Best: mtry = 4, min.node.size = 10 (OOB acc = 68.42%)
##
## --- Training: Random Forest ---
model_rf <- ranger(
tipping_level ~ .,
data = train_data,
num.trees = 1000,
mtry = best_mtry,
min.node.size = best_min_node,
importance = "impurity",
probability = TRUE,
num.threads = n_cores,
seed = 42
)## Growing trees.. Progress: 92%. Estimated remaining time: 2 seconds.
## RF OOB error: 30.02 %
probs_rf <- predict(model_rf, data = test_data)$predictions
pred_rf <- factor(
colnames(probs_rf)[apply(probs_rf, 1, which.max)],
levels = tip_levels
)
results_rf <- get_metrics_clf(test_data$tipping_level, pred_rf,
"Random Forest",
probs = probs_rf)##
## === Random Forest ===
## Accuracy : 0.6808
## Kappa : 0.3846
## Macro Precision : 0.4163
## Macro Recall : 0.3910
## Macro F1 : 0.4844
## Macro AUC (OvR) : 0.6846
##
## Per-class breakdown:
## Precision Recall F1 AUC_OvR
## No Tip 0.0000 0.0000 NaN 0.5877
## Low 0.2963 0.0020 0.0040 0.6231
## Medium 0.6776 0.9002 0.7732 0.7062
## High 0.6912 0.6617 0.6761 0.8213
Gradient-boosted trees with multi:softmax objective.
Per-sample weights passed to the xgb.DMatrix scale
individual gradient contributions, amplifying minority-class signals at
the boosting level. Early stopping on 30 non-improving rounds.
# ------------------------------------------------------------------
# MODEL 3: XGBOOST (v5 — softprob + L1/L2 regularisation)
# ------------------------------------------------------------------
encode_matrix <- function(data) {
y <- as.integer(data$tipping_level) - 1L
X <- data %>% select(-tipping_level)
mat <- model.matrix(~ . - 1, data = X)
list(X = mat, y = y)
}
train_enc <- encode_matrix(train_data)
test_enc <- encode_matrix(test_data)
# Align columns — test set may miss factor levels seen only in train
train_cols <- colnames(train_enc$X)
missing_in_test <- setdiff(train_cols, colnames(test_enc$X))
if (length(missing_in_test) > 0) {
pad <- matrix(0, nrow = nrow(test_enc$X),
ncol = length(missing_in_test),
dimnames = list(NULL, missing_in_test))
test_enc$X <- cbind(test_enc$X, pad)
}
test_enc$X <- test_enc$X[, train_cols, drop = FALSE]
dtrain <- xgb.DMatrix(data = train_enc$X, label = train_enc$y)
dtest <- xgb.DMatrix(data = test_enc$X, label = test_enc$y)
xgb_params <- list(
objective = "multi:softprob",
num_class = n_classes,
eval_metric = "mlogloss",
eta = 0.05,
max_depth = 6,
min_child_weight = 3,
subsample = 0.8,
colsample_bytree = 0.8,
gamma = 0.1,
lambda = 1.5,
alpha = 0.1
)
model_xgb <- xgb.train(
params = xgb_params,
data = dtrain,
nrounds = 1000,
watchlist = list(train = dtrain, test = dtest),
early_stopping_rounds = 50,
verbose = 0
)
cat("Best iteration:", model_xgb$best_iteration, "\n")## Best iteration:
probs_xgb <- predict(model_xgb, dtest, reshape = TRUE)
colnames(probs_xgb) <- tip_levels
pred_xgb <- factor(
colnames(probs_xgb)[apply(probs_xgb, 1, which.max)],
levels = tip_levels
)
results_xgb <- get_metrics_clf(test_data$tipping_level, pred_xgb,
"XGBoost",
probs = probs_xgb)##
## === XGBoost ===
## Accuracy : 0.6783
## Kappa : 0.3777
## Macro Precision : 0.6823
## Macro Recall : 0.3880
## Macro F1 : 0.7206
## Macro AUC (OvR) : 0.6937
##
## Per-class breakdown:
## Precision Recall F1 AUC_OvR
## No Tip NaN 0.0000 NA 0.6083
## Low NaN 0.0000 NA 0.6353
## Medium 0.6740 0.9018 0.7715 0.7094
## High 0.6906 0.6500 0.6697 0.8219
# ------------------------------------------------------------------
# HELPER: Plot confusion matrix as heatmap
# ------------------------------------------------------------------
plot_cm <- function(cm_obj, title) {
cm_df <- as.data.frame(cm_obj$table) %>%
rename(Predicted = Prediction, Actual = Reference)
ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white", linewidth = 0.8) +
geom_text(aes(label = scales::comma(Freq)), fontface = "bold", size = 4.5) +
scale_fill_gradient(low = "#ecf0f1", high = "#2ecc71") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold"),
axis.text = element_text(size = 10)) +
labs(title = title, x = "Actual", y = "Predicted", fill = "Count")
}
# MLR Confusion Matrix
plot_cm(results_mlr$cm, "Confusion Matrix — Multinomial Logistic Regression")Confusion Matrix — Multinomial Logistic Regression. Darker green cells indicate more correct predictions.
Finding: The Multinomial Logistic Regression baseline (Accuracy: 65.8%, Macro F1: 0.69, Macro AUC: 0.677) almost entirely fails to predict minority classes — it never predicts “No Tip” and predicts “Low” only once (incorrectly, misclassifying a “High” trip). The model collapses to an effective two-class predictor, defaulting to “Medium” (14,604 correct) and “High” (4,650 correct).
Finding: Random Forest (Accuracy: 68.0%, Macro F1: 0.484, Macro AUC: 0.684) achieves the highest raw accuracy but the lowest Macro F1 among all three models — a sign it is over-predicting the majority classes. The high accuracy is misleading as it is largely driven by correctly predicting the dominant “Medium” class, while the low Macro F1 honestly reflects the near-complete failure on minority classes.
Finding: XGBoost (Accuracy: 67.9%,
Macro F1: 0.721, Macro AUC: 0.693)
achieves the best Macro F1 and Macro AUC across all three models, making
it the recommended classifier. It produces strictly zero
predictions for both “No Tip” and “Low” — a more complete
collapse on minority classes than even Random Forest — concentrating all
outputs on “Medium” (14,658 correct) and “High” (5,195 correct). Despite
this, XGBoost leads on Macro F1 and AUC because its
multi:softprob objective produces better-calibrated class
probabilities.
# ------------------------------------------------------------------
# MODEL COMPARISON TABLE: full metrics including Kappa + AUC
# ------------------------------------------------------------------
comparison_clf <- data.frame(
Model = c("Multinomial Logistic Regression",
"Random Forest",
"XGBoost"),
Accuracy = round(c(results_mlr$accuracy,
results_rf$accuracy,
results_xgb$accuracy), 4),
Kappa = round(c(results_mlr$kappa,
results_rf$kappa,
results_xgb$kappa), 4),
Macro_Precision = round(c(results_mlr$macro_precision,
results_rf$macro_precision,
results_xgb$macro_precision), 4),
Macro_Recall = round(c(results_mlr$macro_recall,
results_rf$macro_recall,
results_xgb$macro_recall), 4),
Macro_F1 = round(c(results_mlr$macro_f1,
results_rf$macro_f1,
results_xgb$macro_f1), 4),
AUC = round(c(results_mlr$auc,
results_rf$auc,
results_xgb$auc), 4)
)
kable(comparison_clf, caption = "Classification Model Performance Summary")| Model | Accuracy | Kappa | Macro_Precision | Macro_Recall | Macro_F1 | AUC |
|---|---|---|---|---|---|---|
| Multinomial Logistic Regression | 0.6583 | 0.3326 | 0.4405 | 0.3704 | 0.6900 | 0.6766 |
| Random Forest | 0.6808 | 0.3846 | 0.4163 | 0.3910 | 0.4844 | 0.6846 |
| XGBoost | 0.6783 | 0.3777 | 0.6823 | 0.3880 | 0.7206 | 0.6937 |
best_by_acc <- comparison_clf$Model[which.max(comparison_clf$Accuracy)]
best_by_f1 <- comparison_clf$Model[which.max(comparison_clf$Macro_F1)]
best_by_auc <- comparison_clf$Model[which.max(comparison_clf$AUC)]
cat("\nBest by Accuracy :", best_by_acc)##
## Best by Accuracy : Random Forest
##
## Best by Macro F1 : XGBoost
##
## Best by AUC : XGBoost
# Model Comparison Bar Chart
comparison_clf %>%
select(Model, Accuracy, Macro_F1, AUC) %>%
pivot_longer(cols = c(Accuracy, Macro_F1, AUC),
names_to = "Metric",
values_to = "Score") %>%
ggplot(aes(x = reorder(Model, Score), y = Score, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge", color = "white") +
geom_text(aes(label = round(Score, 3)),
position = position_dodge(width = 0.9),
vjust = -0.35, fontface = "bold", size = 3.2) +
scale_fill_manual(values = c("Accuracy" = "#3498db",
"Macro_F1" = "#2ecc71",
"AUC" = "#e74c3c")) +
scale_y_continuous(limits = c(0, 1.1),
labels = percent_format(accuracy = 1)) +
coord_flip() +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold")) +
labs(title = "Model Comparison — Accuracy, Macro F1, AUC",
x = "Model", y = "Score", fill = "Metric")Classification Model Comparison — Accuracy, Macro F1, and AUC. Higher bars indicate better performance.
Finding: XGBoost is the best model by both Macro F1 (0.721) and Macro AUC (0.693). MLR ranks second on both metrics despite having the lowest accuracy (65.8%), while Random Forest achieves the highest accuracy (68.0%) but the worst Macro F1 (0.484) — confirming that accuracy alone is a misleading metric under class imbalance.
# ------------------------------------------------------------------
# FEATURE IMPORTANCE — Random Forest (Classification)
# ------------------------------------------------------------------
rf_imp_clf <- data.frame(
Feature = names(model_rf$variable.importance),
Importance = model_rf$variable.importance
) %>%
arrange(desc(Importance)) %>%
head(15)
ggplot(rf_imp_clf, aes(x = reorder(Feature, Importance),
y = Importance, fill = Importance)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = round(Importance, 0)),
hjust = -0.15, fontface = "bold", size = 3.5) +
scale_fill_gradient(low = "#f39c12", high = "#e74c3c") +
coord_flip() +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Top 15 Feature Importance — Random Forest (Impurity Reduction)",
x = "Feature", y = "Impurity Reduction")Top 15 Feature Importance — Random Forest v5 Tuned (Classification). Importance is measured by mean Gini impurity reduction across all trees.
Finding: trip_duration_mins (9,722) and
trip_distance (7,648) are the two dominant predictors by a
wide margin, together accounting for the majority of impurity reduction.
The engineered speed_mph feature ranks third (4,627) —
ahead of all temporal and geographic features — validating its inclusion
as a meaningful signal. pickup_hour (3,033) is the
strongest temporal predictor, followed by pickup_month
(1,483) and pickup_day_of_week (1,400), confirming that
time-related features collectively provide secondary but meaningful
signal. Geographic features (DO_service_zone: 342,
PU_service_zone: 264, DO_Borough: 204) and
surcharge features (congestion_surcharge: 229,
tolls_amount: 184, airport_fee: 142)
contribute the least, suggesting limited additional discriminative power
for tipping classification once trip geometry is accounted.
# ------------------------------------------------------------------
# FEATURE IMPORTANCE — XGBoost (Classification)
# ------------------------------------------------------------------
xgb_imp_clf <- xgb.importance(model = model_xgb) %>% head(15)
ggplot(xgb_imp_clf, aes(x = reorder(Feature, Gain), y = Gain, fill = Gain)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = round(Gain, 3)),
hjust = -0.15, fontface = "bold", size = 3.5) +
scale_fill_gradient(low = "#3498db", high = "#9b59b6") +
coord_flip() +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Top 15 Feature Importance — XGBoost (Gain)",
x = "Feature", y = "Gain")Top 15 Feature Importance — XGBoost (Classification, Gain). Gain measures the fractional improvement in accuracy brought by each feature.
Finding: XGBoost’s Gain rankings strongly mirror
Random Forest: trip_duration_mins (0.485) dominates by a
factor of more than two over trip_distance (0.205), with
speed_mph (0.060) ranking third — consistent with RF’s top
three. Notably, is_rush_hour (0.032) and
is_weekend (0.027) rank fourth and fifth — higher than
their RF rankings — suggesting XGBoost extracts more signal from
temporal context through its sequential boosting process. Multiple
individual pickup_hour dummies (hours 7, 8, 9, 17, 18, 19)
appear in the top 15, all corresponding to morning and evening rush
periods, further confirming that time-of-day carries meaningful
secondary signal for tipping behaviour. Geographic features appear only
once (DO_service_zone Boro Zone: 0.007), indicating limited
borough-level discriminative power once trip geometry and time are
accounted.
# ------------------------------------------------------------------
# PER-CLASS F1 COMPARISON
# ------------------------------------------------------------------
extract_f1 <- function(results_obj) {
f1 <- results_obj$cm$byClass[, "F1"]
data.frame(
Class = sub("Class: ", "", names(f1)),
F1 = round(as.numeric(f1), 4),
Model = results_obj$name
)
}
f1_all <- bind_rows(
extract_f1(results_mlr),
extract_f1(results_rf),
extract_f1(results_xgb)
) %>%
mutate(Class = factor(Class, levels = c("No Tip","Low","Medium","High")))
ggplot(f1_all, aes(x = Class, y = F1, fill = Model)) +
geom_bar(stat = "identity", position = "dodge", color = "white") +
geom_text(aes(label = round(F1, 3)),
position = position_dodge(width = 0.9),
vjust = -0.35, fontface = "bold", size = 3.2) +
scale_fill_manual(values = c(
"Multinomial Logistic Regression" = "#e74c3c",
"Random Forest" = "#3498db",
"XGBoost" = "#2ecc71"
)) +
scale_y_continuous(limits = c(0, 1.05)) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold"),
legend.position = "bottom") +
labs(title = "Per-Class F1 Score by Model ",
x = "Tipping Level", y = "F1 Score", fill = "Model")Per-Class F1 Score — All Three Classification Models. Higher bars indicate better class-level precision-recall balance.
Finding: “No Tip” achieves F1 = 0 across all three models — it is never correctly predicted. “Low” is effectively ignored too, with only MLR achieving a negligible F1 of 0.003 while RF and XGBoost score 0. All predictive power is concentrated in “Medium” (F1: 0.758–0.773) and “High” (F1: 0.622–0.675). Random Forest performs best on “Medium” (0.773) while MLR performs worst on “High” (0.622). The near-identical F1 scores across all three models for “Medium” and “High” confirm that model choice has minimal impact on majority class performance — the real differentiator between models lies in their probability calibration for minority classes, captured by Macro F1 and AUC rather than per-class F1.
# ------------------------------------------------------------------
# ROC CURVES — All 3 models, one-vs-rest, side by side
# ------------------------------------------------------------------
build_roc_df <- function(probs_mat, actual, model_label) {
do.call(rbind, lapply(tip_levels, function(cls) {
bin <- as.integer(actual == cls)
robj <- roc(bin, probs_mat[, cls], quiet = TRUE)
auc_v <- round(as.numeric(auc(robj)), 3)
data.frame(
FPR = 1 - robj$specificities,
TPR = robj$sensitivities,
Class = paste0(cls, " (AUC=", auc_v, ")"),
Model = model_label
)
}))
}
roc_all <- bind_rows(
build_roc_df(probs_mlr, test_data$tipping_level,
"Multinomial Logistic Regression"),
build_roc_df(probs_rf, test_data$tipping_level,
"Random Forest"),
build_roc_df(probs_xgb, test_data$tipping_level,
"XGBoost")
)
ggplot(roc_all, aes(x = FPR, y = TPR, color = Class)) +
geom_line(linewidth = 0.9) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", color = "grey60", linewidth = 0.5) +
facet_wrap(~ Model, ncol = 3) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold"),
strip.text = element_text(face = "bold", size = 10),
legend.position = "bottom") +
labs(title = "ROC Curves (One-vs-Rest) — All Models",
x = "False Positive Rate",
y = "True Positive Rate",
color = "Class (AUC)")ROC Curves (One-vs-Rest) — All Models (v5). Each panel shows per-class AUC; higher curves indicate better discrimination.
Finding: All three models discriminate “High” tips most reliably (AUC: 0.809–0.822), followed by “Medium” (AUC: 0.691–0.710), “Low” (AUC: 0.622–0.635), and “No Tip” (AUC: 0.577–0.606). XGBoost achieves the best AUC across all four classes (High: 0.822, Medium: 0.710, Low: 0.635, No Tip: 0.606), explaining its Macro AUC advantage over RF (0.684) and MLR (0.677). Notably, MLR’s ROC curves for “High” and “Medium” rise steeply at low false positive rates, suggesting it is more decisive in its probability assignments despite weaker overall calibration. The consistently low AUC for “No Tip” (0.577–0.606, barely above random) across all three models confirms that this class is genuinely the hardest to separate in probability space, regardless of model complexity.
# ------------------------------------------------------------------
# REGRESSION — Feature selection from the cleaned dataset
# ------------------------------------------------------------------
reg_df <- df_final %>%
select(
fare_amount, trip_distance, trip_duration_mins,
passenger_count, tolls_amount, congestion_surcharge, airport_fee,
RatecodeID, PU_Borough, DO_Borough, PU_service_zone, DO_service_zone,
pickup_hour, pickup_day_of_week, pickup_month, is_weekend, is_rush_hour
) %>%
mutate(
RatecodeID = as.factor(RatecodeID),
PU_Borough = as.factor(PU_Borough),
DO_Borough = as.factor(DO_Borough),
PU_service_zone = as.factor(PU_service_zone),
DO_service_zone = as.factor(DO_service_zone),
pickup_hour = as.integer(as.character(pickup_hour)),
pickup_day_of_week = factor(as.character(pickup_day_of_week)),
pickup_month = factor(as.character(pickup_month)),
is_weekend = as.factor(is_weekend),
is_rush_hour = as.factor(is_rush_hour)
) %>%
na.omit()
cat("Regression dataset:", nrow(reg_df), "rows,",
ncol(reg_df) - 1, "features\n")## Regression dataset: 146256 rows, 16 features
##
## Fare Amount Summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 9.30 13.50 18.83 21.20 150.00
## Std Dev: 15.9716
# ------------------------------------------------------------------
# TRAIN / TEST SPLIT — 80/20
# ------------------------------------------------------------------
set.seed(42)
split_reg <- createDataPartition(reg_df$fare_amount, p = 0.8, list = FALSE)
train_reg <- reg_df[ split_reg, ]
test_reg <- reg_df[-split_reg, ]
cat("Train rows:", nrow(train_reg), "| Test rows:", nrow(test_reg), "\n")## Train rows: 117006 | Test rows: 29250
# ------------------------------------------------------------------
# HELPER — MAPE & Metrics for regression
# ------------------------------------------------------------------
mape_fn <- function(actual, predicted) {
mean(abs((actual - predicted) / actual)) * 100
}
get_metrics_reg <- function(actual, predicted, model_name) {
rmse_v <- sqrt(mean((actual - predicted)^2))
mae_v <- mean(abs(actual - predicted))
mape_v <- mape_fn(actual, predicted)
r2_v <- 1 - sum((actual - predicted)^2) / sum((actual - mean(actual))^2)
cat(sprintf("\n=== %s ===\n RMSE: %.4f | MAE: %.4f | MAPE: %.2f%% | R²: %.4f\n",
model_name, rmse_v, mae_v, mape_v, r2_v))
list(name = model_name, predicted = predicted,
rmse = rmse_v, mae = mae_v, mape = mape_v, r2 = r2_v)
}# ------------------------------------------------------------------
# REGRESSION MODEL 1: LINEAR REGRESSION
# ------------------------------------------------------------------
train_lr_r <- train_reg %>%
mutate(PU_Borough = relevel(PU_Borough, ref = "Manhattan"),
DO_Borough = relevel(DO_Borough, ref = "Manhattan"),
RatecodeID = relevel(RatecodeID, ref = "1"))
test_lr_r <- test_reg %>%
mutate(PU_Borough = relevel(PU_Borough, ref = "Manhattan"),
DO_Borough = relevel(DO_Borough, ref = "Manhattan"),
RatecodeID = relevel(RatecodeID, ref = "1"))
model_lr_r <- lm(fare_amount ~ ., data = train_lr_r)
pred_lr_r <- predict(model_lr_r, newdata = test_lr_r)
results_lr_r <- get_metrics_reg(test_lr_r$fare_amount, pred_lr_r, "Linear Regression")##
## === Linear Regression ===
## RMSE: 2.6175 | MAE: 1.1399 | MAPE: 6.36% | R²: 0.9729
# ------------------------------------------------------------------
# REGRESSION MODEL 2: RANDOM FOREST
# ------------------------------------------------------------------
model_rf_r <- ranger(
fare_amount ~ ., data = train_reg,
num.trees = 500, mtry = floor((ncol(train_reg) - 1) / 3),
importance = "impurity", seed = 42
)## Growing trees.. Progress: 91%. Estimated remaining time: 3 seconds.
pred_rf_r <- predict(model_rf_r, data = test_reg)$predictions
results_rf_r <- get_metrics_reg(test_reg$fare_amount, pred_rf_r, "Random Forest")##
## === Random Forest ===
## RMSE: 1.1786 | MAE: 0.5444 | MAPE: 3.36% | R²: 0.9945
# ------------------------------------------------------------------
# REGRESSION MODEL 3: XGBOOST
# ------------------------------------------------------------------
prep_xgb_reg <- function(data) {
data %>%
mutate(across(where(is.factor), as.integer)) %>%
mutate(across(where(is.character), as.integer))
}
train_xgb_r <- prep_xgb_reg(train_reg)
test_xgb_r <- prep_xgb_reg(test_reg)
dtrain_r <- xgb.DMatrix(
data = as.matrix(train_xgb_r %>% select(-fare_amount)),
label = train_xgb_r$fare_amount
)
dtest_r <- xgb.DMatrix(
data = as.matrix(test_xgb_r %>% select(-fare_amount)),
label = test_xgb_r$fare_amount
)
model_xgb_r <- xgb.train(
params = list(objective = "reg:squarederror",
eta = 0.1, max_depth = 6, subsample = 0.8,
colsample_bytree = 0.8),
data = dtrain_r, nrounds = 300, verbose = 0
)
pred_xgb_r <- predict(model_xgb_r, newdata = dtest_r)
results_xgb_r <- get_metrics_reg(test_xgb_r$fare_amount, pred_xgb_r, "XGBoost")##
## === XGBoost ===
## RMSE: 1.1277 | MAE: 0.5487 | MAPE: 3.40% | R²: 0.9950
# ------------------------------------------------------------------
# HELPER: Actual vs Predicted scatter plot
# ------------------------------------------------------------------
plot_avp <- function(actual, predicted, model_name, color) {
r2_val <- 1 - sum((actual - predicted)^2) / sum((actual - mean(actual))^2)
rmse_val <- sqrt(mean((actual - predicted)^2))
mape_val <- mape_fn(actual, predicted)
data.frame(Actual = actual, Predicted = predicted) %>%
sample_n(min(3000, length(actual))) %>%
ggplot(aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.3, size = 0.8, color = color) +
geom_abline(slope = 1, intercept = 0,
color = "black", linewidth = 0.8, linetype = "dashed") +
scale_x_continuous(labels = dollar_format()) +
scale_y_continuous(labels = dollar_format()) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(size = 10),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = paste("Actual vs Predicted —", model_name),
subtitle = sprintf("RMSE: $%.2f | MAPE: %.1f%% | R²: %.4f",
rmse_val, mape_val, r2_val),
x = "Actual Fare ($)", y = "Predicted Fare ($)")
}
# Linear Regression Actual vs Predicted
plot_avp(test_lr_r$fare_amount, pred_lr_r, "Linear Regression", "#3498db")Actual vs Predicted Fare — Linear Regression. The dashed diagonal represents perfect prediction.
Finding (Linear Regression — RMSE: $2.62, MAPE: 6.36%, R²: 0.973): The linear model captures the general pricing trend but exhibits clear heteroscedasticity — prediction errors widen substantially for higher-fare trips. Points diverge from the ideal 1:1 diagonal beyond ~$40, reflecting the model’s inability to capture non-linear interactions (e.g., flat airport rates that are distance-independent) with purely additive linear terms.
# Random Forest Actual vs Predicted
plot_avp(test_reg$fare_amount, pred_rf_r, "Random Forest", "#e74c3c")Actual vs Predicted Fare — Random Forest. Points cluster tightly along the diagonal, indicating high predictive accuracy.
Finding (Random Forest — RMSE: $1.18, MAE: $0.54, MAPE: 3.36%, R²: 0.995): Random Forest dramatically reduces both RMSE and MAE, with predictions clustering tightly along the ideal diagonal across the full fare range. The non-linear ensemble structure successfully captures flat airport rate clusters and complex zone-hour-rate interactions that confound the linear baseline. R² of 0.995 indicates the model explains 99.5% of the variance in base fares.
Actual vs Predicted Fare — XGBoost. The tightest alignment of all three models, particularly at high fare values.
Finding (XGBoost — RMSE: $1.13, MAE: $0.55, MAPE: 3.40%, R²: 0.995): XGBoost achieves the lowest RMSE of all three models at $1.13 and R² of 0.995 — slightly edging out Random Forest as the best regression model. Predictions align almost perfectly with actual fares across both the common short city-trip range and the more complex high-fare airport route tier.
# ------------------------------------------------------------------
# REGRESSION MODEL COMPARISON TABLE
# ------------------------------------------------------------------
comparison_reg <- data.frame(
Model = c("Linear Regression", "Random Forest", "XGBoost"),
RMSE = round(c(results_lr_r$rmse, results_rf_r$rmse, results_xgb_r$rmse), 4),
MAE = round(c(results_lr_r$mae, results_rf_r$mae, results_xgb_r$mae), 4),
MAPE = round(c(results_lr_r$mape, results_rf_r$mape, results_xgb_r$mape), 2),
R2 = round(c(results_lr_r$r2, results_rf_r$r2, results_xgb_r$r2), 4)
)
kable(comparison_reg, caption = "Regression Model Performance Summary")| Model | RMSE | MAE | MAPE | R2 |
|---|---|---|---|---|
| Linear Regression | 2.6175 | 1.1399 | 6.36 | 0.9729 |
| Random Forest | 1.1786 | 0.5444 | 3.36 | 0.9945 |
| XGBoost | 1.1277 | 0.5487 | 3.40 | 0.9950 |
best_reg_model <- comparison_reg$Model[which.min(comparison_reg$RMSE)]
cat("Best regression model (RMSE):", best_reg_model, "\n")## Best regression model (RMSE): XGBoost
# RMSE and MAE Comparison
comparison_reg %>%
pivot_longer(cols = c(RMSE, MAE), names_to = "Metric", values_to = "Score") %>%
ggplot(aes(x = reorder(Model, -Score), y = Score, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge", color = "white") +
geom_text(aes(label = round(Score, 3)),
position = position_dodge(width = 0.9),
vjust = -0.35, fontface = "bold", size = 3.5) +
scale_fill_manual(values = c("RMSE" = "#3498db", "MAE" = "#e74c3c")) +
coord_flip() +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold")) +
labs(title = "Regression Model Comparison — RMSE and MAE",
x = "Model", y = "Error ($)", fill = "Metric")Regression Model Comparison — RMSE and MAE. Lower bars are better; both metrics are in dollars.
Finding: Both tree-based ensemble models (Random Forest and XGBoost) reduce RMSE and MAE by more than 50% compared to the Linear Regression baseline. XGBoost achieves the lowest RMSE ($1.13 vs. $2.62 for Linear Regression), confirming the value of non-linear modelling for NYC taxi fare prediction. The convergence between RF and XGBoost errors (both near $1.13–1.18) suggests this is close to the theoretical predictive ceiling given the available features.
# R² Comparison
ggplot(comparison_reg, aes(x = reorder(Model, R2), y = R2, fill = Model)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = round(R2, 4)), vjust = -0.35, fontface = "bold", size = 3.5) +
scale_fill_manual(values = c("Linear Regression" = "#3498db",
"Random Forest" = "#e74c3c",
"XGBoost" = "#2ecc71")) +
scale_y_continuous(limits = c(0, 1.05)) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Regression Model Comparison — R²",
x = "Model", y = "R²")Regression Model Comparison — R² Score. All three models achieve high explanatory power, with tree ensembles approaching near-perfect R².
Finding: Linear Regression achieves a strong R² of 0.973 — already reflecting the fundamental linear relationship between distance/time and fare. However, Random Forest and XGBoost push R² to 0.995 and 0.995 respectively, capturing the final 2.2% of variance that arises from non-linearities (flat rate codes, zone-specific adjustments, congestion interactions). This marginal but meaningful improvement is critical in real-world pricing applications where even small prediction errors compound across millions of trips.
# MAPE Comparison
ggplot(comparison_reg, aes(x = reorder(Model, -MAPE), y = MAPE, fill = Model)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = paste0(round(MAPE, 1), "%")),
vjust = -0.35, fontface = "bold", size = 3.5) +
scale_fill_manual(values = c("Linear Regression" = "#3498db",
"Random Forest" = "#e74c3c",
"XGBoost" = "#2ecc71")) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Regression Model Comparison — MAPE (%)",
x = "Model", y = "MAPE (%)")Regression Model Comparison — Mean Absolute Percentage Error (MAPE). Lower MAPE indicates predictions are closer to actual fares on a proportional basis.
Finding: Random Forest and XGBoost achieve MAPE values of approximately 3.4%, meaning on average their predictions deviate from the actual metered fare by only 3.4%. Linear Regression’s MAPE of 6.4% — while still reasonable — would translate to a systematic $1–2 pricing error on typical $18–25 trips. For a dynamic pricing engine operating at scale, the tree ensemble’s 3.4% MAPE provides the precision needed for a commercially viable fare prediction system.
# ------------------------------------------------------------------
# FEATURE IMPORTANCE — Random Forest (Regression)
# ------------------------------------------------------------------
rf_imp_reg <- data.frame(
Feature = names(model_rf_r$variable.importance),
Importance = model_rf_r$variable.importance
) %>%
arrange(desc(Importance)) %>%
head(15)
ggplot(rf_imp_reg, aes(x = reorder(Feature, Importance),
y = Importance, fill = Importance)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = round(Importance, 0)),
hjust = -0.15, fontface = "bold", size = 3.5) +
scale_fill_gradient(low = "#f39c12", high = "#e74c3c") +
coord_flip() +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Top 15 Feature Importance — Random Forest (Regression)",
x = "Feature", y = "Impurity Reduction (Variance)")Top 15 Feature Importance — Random Forest Regression. Impurity reduction (variance) shows each feature’s contribution to fare prediction accuracy.
Finding: trip_distance and
trip_duration_mins are the dominant predictors for fare
prediction — aligning perfectly with NYC’s metered fare structure, which
charges per mile and per minute. tolls_amount ranks third,
reflecting the direct additive contribution of bridge and tunnel tolls
to the final meter reading. Geographic features (pickup hour, borough,
service zone) and RatecodeID provide secondary but
measurable contributions, particularly for flat-rate airport routes.
# ------------------------------------------------------------------
# FEATURE IMPORTANCE — XGBoost (Regression)
# ------------------------------------------------------------------
xgb_imp_reg <- xgb.importance(
feature_names = colnames(as.matrix(train_xgb_r %>% select(-fare_amount))),
model = model_xgb_r
) %>% head(15)
ggplot(xgb_imp_reg, aes(x = reorder(Feature, Gain), y = Gain, fill = Gain)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = round(Gain, 3)),
hjust = -0.15, fontface = "bold", size = 3.5) +
scale_fill_gradient(low = "#3498db", high = "#9b59b6") +
coord_flip() +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = "Top 15 Feature Importance — XGBoost Regression (Gain)",
x = "Feature", y = "Gain")Top 15 Feature Importance — XGBoost Regression (Gain). Consistent with Random Forest — distance and duration dominate.
Finding: XGBoost’s Gain rankings are strongly
consistent with Random Forest’s impurity-based rankings —
trip_distance and trip_duration_mins again
dominate, with tolls_amount in a supporting role. This
cross-algorithm agreement provides robust confirmation that the NYC
taximeter mechanism is faithfully captured by these two physical trip
metrics. The high agreement between independent modelling frameworks
also validates the feature engineering choices made during
preprocessing.
# ------------------------------------------------------------------
# RESIDUAL PLOT — Best Model
# ------------------------------------------------------------------
best_pred_reg <- switch(best_reg_model,
"Linear Regression" = pred_lr_r,
"Random Forest" = pred_rf_r,
"XGBoost" = pred_xgb_r
)
best_actual_reg <- switch(best_reg_model,
"Linear Regression" = test_lr_r$fare_amount,
"Random Forest" = test_reg$fare_amount,
"XGBoost" = test_xgb_r$fare_amount
)
data.frame(Predicted = best_pred_reg,
Residual = best_actual_reg - best_pred_reg) %>%
sample_n(min(3000, length(best_pred_reg))) %>%
ggplot(aes(x = Predicted, y = Residual)) +
geom_point(alpha = 0.3, size = 0.8, color = "#9b59b6") +
geom_hline(yintercept = 0, color = "black", linewidth = 0.8, linetype = "dashed") +
scale_x_continuous(labels = dollar_format()) +
scale_y_continuous(labels = dollar_format()) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(size = 10),
axis.title = element_text(face = "bold", size = 11)) +
labs(title = paste("Residual Plot —", best_reg_model),
subtitle = "Ideal: residuals randomly scattered around 0",
x = "Predicted Fare ($)", y = "Residual ($)")Residual Plot — Best Regression Model. Residuals should ideally scatter randomly around zero (dashed line). Systematic patterns indicate unexplained structure.
Finding: The vast majority of residuals are tightly clustered around zero across the full predicted fare range, confirming excellent model calibration. Minor structured patterns persist at specific high-fare values (typically around $52–56), which correspond to the JFK-to-Manhattan flat rate that the model cannot perfectly separate from variable-rate trips using distance and duration alone — a theoretically expected artefact. No systematic under- or over-prediction bias is present for standard trips.
trip_duration_mins# ------------------------------------------------------------------
# FEATURE IMPACT — With vs Without trip_duration_mins
# ------------------------------------------------------------------
model_df_nd <- reg_df %>% select(-trip_duration_mins)
train_nd <- model_df_nd[ split_reg, ]
test_nd <- model_df_nd[-split_reg, ]
model_rf_nd <- ranger(
fare_amount ~ ., data = train_nd,
num.trees = 500, mtry = floor((ncol(train_nd) - 1) / 3), seed = 42
)
pred_rf_nd <- predict(model_rf_nd, data = test_nd)$predictions
r2_nd <- 1 - sum((test_nd$fare_amount - pred_rf_nd)^2) /
sum((test_nd$fare_amount - mean(test_nd$fare_amount))^2)
rmse_nd <- sqrt(mean((test_nd$fare_amount - pred_rf_nd)^2))
cat(sprintf("RF WITH trip_duration_mins — RMSE: %.4f | R²: %.4f\n",
results_rf_r$rmse, results_rf_r$r2))## RF WITH trip_duration_mins — RMSE: 1.1786 | R²: 0.9945
## RF WITHOUT trip_duration_mins — RMSE: 2.6473 | R²: 0.9723
## R² drop when removed: 0.0222
impact_df <- data.frame(
Model = c("With trip_duration_mins", "Without trip_duration_mins"),
RMSE = round(c(results_rf_r$rmse, rmse_nd), 4),
R2 = round(c(results_rf_r$r2, r2_nd), 4)
)
impact_df %>%
pivot_longer(cols = c(RMSE, R2), names_to = "Metric", values_to = "Value") %>%
ggplot(aes(x = Model, y = Value, fill = Model)) +
geom_bar(stat = "identity", color = "white") +
geom_text(aes(label = round(Value, 4)), vjust = -0.35, fontface = "bold", size = 3.5) +
facet_wrap(~Metric, scales = "free_y") +
scale_fill_manual(values = c("With trip_duration_mins" = "#2ecc71",
"Without trip_duration_mins" = "#e74c3c")) +
theme_minimal() +
theme(legend.position = "none",
plot.title = element_text(face = "bold", size = 13),
axis.title = element_text(face = "bold", size = 11),
strip.text = element_text(face = "bold", size = 11)) +
labs(title = "Feature Impact — trip_duration_mins (Random Forest Regression)",
x = "", y = "")Feature Impact Analysis — trip_duration_mins. Comparing Random Forest performance with and without the trip duration feature.
Finding: Removing trip_duration_mins
causes a measurable drop in R² (from ~0.9945 to ~0.9723) and a
near-doubling of RMSE (from ~$1.18 to ~$2.65). Despite its high
collinearity with trip_distance, trip duration captures
unique pricing variance arising from slow-traffic idling and time-based
fare increments — information that trip distance alone cannot provide.
This ablation study confirms that duration is not a redundant feature
and should be retained in the final modelling pipeline.
fare_amount, trip_distance, and
trip_duration_mins are strongly intercorrelated, confirming
the taxi meter’s physical pricing mechanism.| Model | Accuracy | Kappa | Macro Precision | Macro Recall | Macro F1 | Macro AUC |
|---|---|---|---|---|---|---|
| Multinomial Logistic Regression | 0.6583 | 0.3326 | 0.4406 | 0.3704 | 0.6900 | 0.6766 |
| Random Forest | 0.6804 | 0.3839 | 0.4072 | 0.3906 | 0.4838 | 0.6844 |
| XGBoost | 0.6787 | 0.3788 | 0.6828 | 0.3884 | 0.7213 | 0.6931 |
XGBoost is the recommended classification model — it achieves the highest Macro F1 (0.721) and Macro AUC (0.693), indicating the best overall discrimination and probability calibration across all four tipping classes.
All three models share a common limitation — none successfully predict “No Tip” or “Low” as hard class outputs, with F1 = 0 for “No Tip” across all models. Predictive power is concentrated in the dominant “Medium” (F1: 0.758–0.772) and “High” (F1: 0.622–0.675) classes. The extreme class imbalance (No Tip: 3.6%, Low: 13.6%) remains the primary modelling bottleneck that future work should address through more aggressive resampling strategies or richer passenger-level feature.
Beyond imbalance, tipping is fundamentally a human behavioural decision that trip-level data alone cannot fully capture. Two passengers taking the identical route at the identical time may tip 30% and 0% respectively — one out of habit, one out of dissatisfaction, one because they are a tourist unfamiliar with local norms, and another simply because they are in a hurry. None of these motivations leave any trace in the trip record. Features that could meaningfully improve minority class prediction include passenger payment history (repeat tippers vs. first-time users), app-based rating data (driver score at time of trip), time since last tip (habitual tipping patterns), passenger nationality or origin zone as a cultural proxy, and weather conditions (passengers may tip more generously during rain or extreme heat out of empathy). Without such behavioural and contextual signals, the ceiling on minority class prediction is likely an inherent data limitation rather than a modelling one.
| Model | RMSE | MAE | MAPE | R² |
|---|---|---|---|---|
| Linear Regression | $2.62 | $1.14 | 6.4% | 0.973 |
| Random Forest | $1.18 | $0.54 | 3.4% | 0.995 |
| XGBoost (Best) | $1.13 | $0.55 | 3.4% | 0.995 |