This project revolves around analyzing NYC taxi trips and their respective durations. The primary objective is to construct an effective machine learning model capable of predicting trip durations. By achieving accurate predictions of trip durations, we can obtain reliable estimations for the expected arrival times of trips.
Developing an effective machine learning model for time arrival estimation offers several significant business advantages:
Better Customer Experience: Predicting trip durations helps passengers plan better, reducing wait times and travel uncertaintyEfficient Dispatch and Routing: Accurate predictions optimize taxi assignments and routes, reducing idle time and improving fleet useQuality Service: Reliable estimates reduce rushed driving, leading to safer and more comfortable ridesCost Efficiency: Optimized operations save fuel, vehicle wear, and overall expensesCompetitive Edge: Reliable predictions attract customers and boost loyalty, setting you apart in the marketSmart Decision-Making: Data analysis unveils peak times, routes, and demand patterns for informed choicesPartnerships and Integration: Accurate predictions open doors for collaborations with other servicesEffective Marketing: Promote precise arrival times to attract more users and repeat businessEfficient Workforce: Predictive insights aid in allocating drivers as per demand, reducing downtimeData Value: Monetize data by selling insights on traffic patterns and commuter behavior.
## [1] TRUE
There are missing values. Let’s check which columns!
## VendorID lpep_pickup_datetime lpep_dropoff_datetime
## 32518 0 0
## store_and_fwd_flag RatecodeID PULocationID
## 0 32518 0
## DOLocationID passenger_count trip_distance
## 0 32518 0
## fare_amount extra mta_tax
## 0 0 0
## tip_amount tolls_amount ehail_fee
## 0 0 83691
## improvement_surcharge total_amount payment_type
## 0 0 32518
## trip_type congestion_surcharge
## 32518 32518
I don’t need these columns, so I’ll just remove them
trip <- trip_raw %>%
select_if(~ all(!is.na(.))) %>%
select(-extra,
-mta_tax,
-tip_amount,
-improvement_surcharge,
-total_amount,
-store_and_fwd_flag,
-tolls_amount)
trip %>% tail()## [1] FALSE
Good, the data is now free of missing values!
I’m gonna parse lpep_pickup_datetime and lpep_dropoff_datetime to datetime and then rename them
trip <- trip %>%
mutate_if(is.character, ymd_hms) %>%
rename(PU_datetime = lpep_pickup_datetime,
DO_datetime = lpep_dropoff_datetime)
trip %>% head()I’m gonna parse PULocationID and DOLocationID to factor
## Rows: 83,691
## Columns: 6
## $ PU_datetime <dttm> 2021-07-01 00:30:52, 2021-07-01 00:25:36, 2021-07-01 00…
## $ DO_datetime <dttm> 2021-07-01 00:35:36, 2021-07-01 01:01:31, 2021-07-01 00…
## $ PULocationID <fct> 74, 116, 97, 74, 42, 24, 75, 82, 74, 41, 74, 167, 197, 7…
## $ DOLocationID <fct> 168, 265, 33, 42, 244, 239, 243, 82, 42, 42, 236, 167, 2…
## $ trip_distance <dbl> 1.20, 13.69, 0.95, 1.24, 1.10, 1.90, 0.00, 0.66, 1.72, 1…
## $ fare_amount <dbl> 6.0, 42.0, 6.5, 6.5, 7.0, 8.0, 17.5, 5.0, 7.0, 7.5, 9.0,…
By looking at the data summary, I can detect weird records or people may call them anomalies
## PU_datetime DO_datetime
## Min. :2008-12-31 23:12:53.00 Min. :2008-12-31 23:27:09.00
## 1st Qu.:2021-07-08 19:50:55.00 1st Qu.:2021-07-08 20:25:41.50
## Median :2021-07-16 11:58:17.00 Median :2021-07-16 12:26:04.00
## Mean :2021-07-16 10:39:52.05 Mean :2021-07-16 11:04:18.35
## 3rd Qu.:2021-07-24 12:48:20.00 3rd Qu.:2021-07-24 13:13:41.50
## Max. :2021-08-01 00:06:03.00 Max. :2021-08-01 19:55:44.00
##
## PULocationID DOLocationID trip_distance fare_amount
## 74 : 8770 74 : 3666 Min. : 0.00 Min. :-150.00
## 75 : 7713 75 : 3122 1st Qu.: 1.35 1st Qu.: 9.00
## 41 : 4761 42 : 2904 Median : 2.76 Median : 16.00
## 42 : 3229 41 : 2527 Mean : 194.35 Mean : 20.39
## 95 : 2486 236 : 1700 3rd Qu.: 6.20 3rd Qu.: 26.83
## 166 : 2391 238 : 1546 Max. :260517.93 Max. : 480.00
## (Other):54341 (Other):68226
There are 3 anomalies:
trip_distance == 0. Who the hell order taxi to stay in the same place?trip_distance == 260517.93. Did the taxi fly?fare_amount < 0. Did he/she steal the driver’s money?
trip <- trip %>%
filter(trip_distance >= 0.1,
fare_amount >= 1) %>%
killOutliers(trip_distance) # This function is from my package
trip %>% summary()## PU_datetime DO_datetime
## Min. :2008-12-31 23:12:53.0 Min. :2008-12-31 23:27:09.00
## 1st Qu.:2021-07-08 20:59:30.0 1st Qu.:2021-07-08 21:34:01.25
## Median :2021-07-16 12:23:28.0 Median :2021-07-16 12:44:06.00
## Mean :2021-07-16 13:13:35.2 Mean :2021-07-16 13:35:26.66
## 3rd Qu.:2021-07-24 13:04:00.0 3rd Qu.:2021-07-24 13:28:10.00
## Max. :2021-08-01 00:06:03.0 Max. :2021-08-01 15:50:56.00
##
## PULocationID DOLocationID trip_distance fare_amount
## 74 : 8534 74 : 3434 Min. : 0.10 Min. : 2.0
## 75 : 7338 75 : 2837 1st Qu.: 1.41 1st Qu.: 8.5
## 41 : 4583 42 : 2746 Median : 2.64 Median : 15.0
## 42 : 3004 41 : 2409 Mean : 3.77 Mean : 17.6
## 95 : 2330 236 : 1643 3rd Qu.: 5.23 3rd Qu.: 23.5
## 166 : 2296 238 : 1511 Max. :13.92 Max. :160.0
## (Other):44943 (Other):58448
Cool, looking beautiful
There are 3 useful informations I’m gonna extract:
trip_duration: The minute difference between
PU_datetime and DO_datetimehour: the hour of the PU_datetimeis_weekend: Is the PU_datetime in weekend or
nottrip <- trip %>%
mutate(trip_duration = as.numeric(round(difftime(DO_datetime, PU_datetime, units = "mins"), 2)),
is_weekend = as.factor(ifelse(wday(PU_datetime) > 5, 1, 0)),
hour = as.factor(hour(PU_datetime)))
trip %>% head()
trip_durationis gonna be the target variable
I’ll remove the datetime columns and fare_amount
By looking at the target summary, I can detect weird records or people may call them anomalies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.52 14.00 21.86 23.00 1438.77
There are 2 anomalies:
trip_duration == 0. Did they teleport?trip_duration == 1438.77. Were they dragging the car?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.33 13.73 15.84 21.30 44.72
Perfecto
It would be great if I could aggregate to get and use the average trip_duration of each POLocationID to DOLocationID as a feature with this code below:
But unfortunately, I don’t have enough data :(
ggplot(trip, aes(x = trip_duration)) +
geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
labs(x = "Trip Duration",
y = "Density",
title = "Distribution of Trip Duration") +
theme_minimal()The trip duration is distributed normally and mostly around 8 minutes
ggplot(trip, aes(x = trip_distance)) +
geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
labs(x = "Trip Distance",
y = "Density",
title = "Distribution of Trip Distance") +
theme_minimal()Trip distance is also distributed normally and mostly around 1.75 KM
plot_data <- trip %>% mutate(hour_range = getHourRange(as.integer(hour))) # getHourRange() is from personal package my package
ggplot(plot_data, aes(x = trip_distance, y = trip_duration, color = trip_duration)) +
geom_point() +
labs(x = "Trip Distance",
y = "Trip Duration",
title = "Trip Distance & Duration Relationship per Hour Range",
color = "Is Weekend") +
theme_gray() +
facet_grid(. ~ hour_range) +
theme(legend.position = "none")They are mostly the same in terms of the the relationship pattern
plot_data <- trip %>%
group_by(hour, is_weekend) %>%
summarise(count = n()) %>%
ungroup()
ggplot(plot_data, aes(x = hour, y = count)) +
geom_bar(stat = "identity", aes(fill = count), show.legend = F) +
labs(x = "Hour of Day",
y = "Trips",
title = "Number of Trips by Hour") +
facet_grid(. ~ is_weekend,
labeller = labeller(is_weekend = c("0" = "Weekday", "1" = "Weekend"))) +
theme_minimal()Number of trips are high around 8 to 18 o’clock
This will give us insight into which hours experience high traffic congestion the most
plot_data <- trip %>%
group_by(hour, is_weekend) %>%
summarise(ratio = mean(trip_duration/trip_distance)) %>%
ungroup()
ggplot(plot_data, aes(x = hour, y = ratio)) +
geom_bar(stat = "identity", aes(fill = ratio), show.legend = F) +
labs(x = "Hour of Day",
y = "Trip Duration per Distance",
title = "Trip Duration per Distance by Hour") +
facet_grid(. ~ is_weekend,
labeller = labeller(is_weekend = c("0" = "Weekday", "1" = "Weekend"))) +
theme_minimal()Traffic congestion mostly occur between 13 and 18 o’clock
train_data <- trip[indices, ] %>% select(-PULocationID,
-DOLocationID)
test_data <- trip[-indices, ] %>% select(-PULocationID,
-DOLocationID)
X_train <- train_data %>% select(-trip_duration)
y_train <- train_data$trip_duration
X_test <- test_data %>% select(-trip_duration)
y_test <- test_data$trip_duration##
## Call:
## lm(formula = trip_duration ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.147 -3.466 -0.827 2.662 37.326
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.797590 0.213754 17.766 < 2e-16 ***
## trip_distance 2.636803 0.008239 320.050 < 2e-16 ***
## is_weekend1 -0.014847 0.050868 -0.292 0.77039
## hour1 -0.328531 0.333683 -0.985 0.32485
## hour2 -0.389243 0.388882 -1.001 0.31687
## hour3 -1.181921 0.416058 -2.841 0.00450 **
## hour4 -1.300279 0.385308 -3.375 0.00074 ***
## hour5 -1.599021 0.329178 -4.858 1.19e-06 ***
## hour6 -0.600088 0.269575 -2.226 0.02601 *
## hour7 1.198183 0.243219 4.926 8.40e-07 ***
## hour8 2.230058 0.232150 9.606 < 2e-16 ***
## hour9 2.301029 0.229197 10.040 < 2e-16 ***
## hour10 2.126995 0.228501 9.308 < 2e-16 ***
## hour11 2.809875 0.229031 12.269 < 2e-16 ***
## hour12 3.077923 0.230163 13.373 < 2e-16 ***
## hour13 3.726444 0.231074 16.127 < 2e-16 ***
## hour14 4.274314 0.230670 18.530 < 2e-16 ***
## hour15 4.493975 0.230991 19.455 < 2e-16 ***
## hour16 4.578300 0.231953 19.738 < 2e-16 ***
## hour17 4.325192 0.232275 18.621 < 2e-16 ***
## hour18 3.189397 0.231089 13.802 < 2e-16 ***
## hour19 1.987397 0.234743 8.466 < 2e-16 ***
## hour20 1.446235 0.246563 5.866 4.50e-09 ***
## hour21 1.369564 0.253470 5.403 6.57e-08 ***
## hour22 0.671668 0.260220 2.581 0.00985 **
## hour23 0.557497 0.268400 2.077 0.03780 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.657 on 55748 degrees of freedom
## Multiple R-squared: 0.6517, Adjusted R-squared: 0.6515
## F-statistic: 4172 on 25 and 55748 DF, p-value: < 2.2e-16
Intercept (3.797590): The intercept represents the predicted trip duration when all other predictor variables are zero.
trip_distance (2.636803): For every unit increase in the trip distance, the predicted trip duration increases by approximately 2.64 units.
is_weekend1 (-0.014847): The “is_weekend1” variable doesn’t have a significant impact on trip duration (p = 0.77039).
hour1 to hour23: Each “hourX” coefficient represents the change in predicted trip duration associated with each hour of the day.
Residual Standard Error (5.657): Average magnitude of prediction errors.
Multiple R-squared (0.6517): About 65.17% of variability in trip duration explained by predictors.
Adjusted R-squared (0.6515): Adjusted R-squared considers model complexity.
F-statistic (4172) and p-value (< 2.2e-16): Model is statistically significant.
Residuals represent differences between actual and predicted trip durations. - Min: The minimum value of the residuals is -34.147. This means that in some cases, the model’s predictions were off by as much as 34.147 units less than the actual trip duration. These are the most negative differences between predicted and actual values.
1Q (First Quartile): The first quartile value of the residuals is -3.466. This means that 25% of the residuals fall below this value. In other words, a quarter of the model’s predictions were off by less than 3.466 units compared to the actual trip duration.
Median: The median value of the residuals is -0.827. This is the middle value of the residuals when they are ordered from smallest to largest. It indicates that 50% of the model’s predictions were below this value and 50% were above it.
3Q (Third Quartile): The third quartile value of the residuals is 2.662. This means that 75% of the residuals fall below this value. In other words, three-quarters of the model’s predictions were off by less than 2.662 units compared to the actual trip duration.
Max: The maximum value of the residuals is 37.326. This means that in some cases, the model’s predictions were off by as much as 37.326 units more than the actual trip duration. These are the most positive differences between predicted and actual values.
pred_lm <- predict(model_lm, X_test)
getRegressionErrors(pred_lm, y_test) # This function is from my personal package## Target Min Max MAE MAPE MSE RMSE
## 1 44.68 4.134 36.954 31.381 5.602
Based on the evaluation metrics for the target variable:
- The minimum value is 1, indicating the smallest actual value observed in the target variable.
- The maximum value is 44.68, representing the largest actual value observed.
The model’s performance metrics are as follows:
MAE (Mean Absolute Error): The average absolute difference between the model’s predictions and the actual values is 4.134. This > metric quantifies the magnitude of errors without considering their direction.
MAPE (Mean Absolute Percentage Error): The average percentage difference between the model’s predictions and the actual values is 36.954%. This metric provides insight into the relative error between predictions and actual values.
MSE (Mean Squared Error): The average squared difference between the model’s predictions and the actual values is 31.381. This metric emphasizes larger errors as it squares the differences.
RMSE (Root Mean Squared Error): The square root of the MSE is 5.602. RMSE provides a measure of the average magnitude of errors, combining both their size and direction.
residuals <- y_test - pred_lm
ggplot(data = data.frame(y_test = y_test, pred_lm = pred_lm), aes(x = y_test, y = pred_lm)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "red") +
labs(title = "Predicted vs Actual Values",
x = "Actual Values",
y = "Predicted Values") +
theme_minimal()ggplot(data.frame(pred_lm, residuals),
aes(x = pred_lm, y = residuals)) +
geom_point() +
geom_smooth() +
labs(title = "Residuals Spread Pattern",
x = "Predicted Values",
y = "Residuals") +
theme_minimal()Cool. Why? Because:
- It has a linear pattern
- The residual bounces randomly around 0
ggplot(data = data.frame(residuals), aes(x = residuals)) +
geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
labs(title = "Density of Residuals",
x = "Residuals",
y = "Frequency") +
theme_minimal()Cool. Why? Cause the residuals are mostly around zero
ggplot(data = data.frame(pred_lm, residuals), aes(x = pred_lm, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(title = "Residuals vs Predicted Values",
x = "Predicted Values",
y = "Residuals") +
theme_minimal()Cool. Why? Cause the residual has random cloud shape (no discernible pattern)
In conclusion, the model’s predictions exhibit a relatively small mean absolute error (MAE), indicating that, on average, the predictions are off by around 4.134 units from the actual values. The MAPE value of 36.954% suggests that the average percentage difference between predictions and actual values is moderate. The MSE and RMSE values provide further insight into the spread and magnitude of errors, with RMSE indicating that the model’s predictions have an average error of around 5.602 units. Overall, while the model seems to be making reasonable predictions, further analysis and optimization could potentially improve its performance.
In my next Rpubs publication, I will extend this Time Arrival Estimation project by incorporating more robust machine learning algorithms, including Deep Learning architectures. The document will showcase how these advanced techniques improve prediction accuracy and reliability. By exploring a broader range of methods, the goal is to enhance the project’s predictive power and provide a comprehensive assessment of different models. Stay tuned on my Rpubs for insights into how Deep Learning and other potent algorithms can elevate time arrival estimation.