This project is a continuation of my previous project titled “Time Arrival Estimation with Machine Learning.” In this iteration, I aim to enhance the predictive model developed earlier by leveraging advanced machine learning algorithms and deep learning architectures. This project revolves around analyzing NYC taxi trips and their respective durations. The primary objective is to construct an effective machine learning model capable of predicting trip durations. By achieving accurate predictions of trip durations, we can obtain reliable estimations for the expected arrival times of trips.
Developing an effective machine learning model for time arrival estimation offers several significant business advantages:
Better Customer Experience: Predicting trip durations helps passengers plan better, reducing wait times and travel uncertaintyEfficient Dispatch and Routing: Accurate predictions optimize taxi assignments and routes, reducing idle time and improving fleet useQuality Service: Reliable estimates reduce rushed driving, leading to safer and more comfortable ridesCost Efficiency: Optimized operations save fuel, vehicle wear, and overall expensesCompetitive Edge: Reliable predictions attract customers and boost loyalty, setting you apart in the marketSmart Decision-Making: Data analysis unveils peak times, routes, and demand patterns for informed choicesPartnerships and Integration: Accurate predictions open doors for collaborations with other servicesEffective Marketing: Promote precise arrival times to attract more users and repeat businessEfficient Workforce: Predictive insights aid in allocating drivers as per demand, reducing downtimeData Value: Monetize data by selling insights on traffic patterns and commuter behavior.
## [1] TRUE
There are missing values. Let’s check which columns!
## VendorID lpep_pickup_datetime lpep_dropoff_datetime
## 32518 0 0
## store_and_fwd_flag RatecodeID PULocationID
## 0 32518 0
## DOLocationID passenger_count trip_distance
## 0 32518 0
## fare_amount extra mta_tax
## 0 0 0
## tip_amount tolls_amount ehail_fee
## 0 0 83691
## improvement_surcharge total_amount payment_type
## 0 0 32518
## trip_type congestion_surcharge
## 32518 32518
I don’t need these columns, so I’ll just remove them
trip <- trip_raw %>%
select_if(~ all(!is.na(.))) %>%
select(-extra,
-mta_tax,
-tip_amount,
-improvement_surcharge,
-total_amount,
-store_and_fwd_flag,
-tolls_amount)
trip %>% tail()## [1] FALSE
Good, the data is now free of missing values!
I’m gonna parse lpep_pickup_datetime and lpep_dropoff_datetime to datetime and then rename them
trip <- trip %>%
mutate_if(is.character, ymd_hms) %>%
rename(PU_datetime = lpep_pickup_datetime,
DO_datetime = lpep_dropoff_datetime)
trip %>% head()I’m gonna parse PULocationID and DOLocationID to factor
## Rows: 83,691
## Columns: 6
## $ PU_datetime <dttm> 2021-07-01 00:30:52, 2021-07-01 00:25:36, 2021-07-01 00…
## $ DO_datetime <dttm> 2021-07-01 00:35:36, 2021-07-01 01:01:31, 2021-07-01 00…
## $ PULocationID <fct> 74, 116, 97, 74, 42, 24, 75, 82, 74, 41, 74, 167, 197, 7…
## $ DOLocationID <fct> 168, 265, 33, 42, 244, 239, 243, 82, 42, 42, 236, 167, 2…
## $ trip_distance <dbl> 1.20, 13.69, 0.95, 1.24, 1.10, 1.90, 0.00, 0.66, 1.72, 1…
## $ fare_amount <dbl> 6.0, 42.0, 6.5, 6.5, 7.0, 8.0, 17.5, 5.0, 7.0, 7.5, 9.0,…
By looking at the data summary, I can detect weird records or people may call them anomalies
## PU_datetime DO_datetime
## Min. :2008-12-31 23:12:53.00 Min. :2008-12-31 23:27:09.00
## 1st Qu.:2021-07-08 19:50:55.00 1st Qu.:2021-07-08 20:25:41.50
## Median :2021-07-16 11:58:17.00 Median :2021-07-16 12:26:04.00
## Mean :2021-07-16 10:39:52.05 Mean :2021-07-16 11:04:18.35
## 3rd Qu.:2021-07-24 12:48:20.00 3rd Qu.:2021-07-24 13:13:41.50
## Max. :2021-08-01 00:06:03.00 Max. :2021-08-01 19:55:44.00
##
## PULocationID DOLocationID trip_distance fare_amount
## 74 : 8770 74 : 3666 Min. : 0.00 Min. :-150.00
## 75 : 7713 75 : 3122 1st Qu.: 1.35 1st Qu.: 9.00
## 41 : 4761 42 : 2904 Median : 2.76 Median : 16.00
## 42 : 3229 41 : 2527 Mean : 194.35 Mean : 20.39
## 95 : 2486 236 : 1700 3rd Qu.: 6.20 3rd Qu.: 26.83
## 166 : 2391 238 : 1546 Max. :260517.93 Max. : 480.00
## (Other):54341 (Other):68226
There are 3 anomalies:
trip_distance == 0. Who the hell order taxi to stay in the same place?trip_distance == 260517.93. Did the taxi fly?fare_amount < 0. Did he/she steal the driver’s money?
trip <- trip %>%
filter(trip_distance >= 0.1,
fare_amount >= 1) %>%
killOutliers(trip_distance) # This function is from my package
trip %>% summary()## PU_datetime DO_datetime
## Min. :2008-12-31 23:12:53.0 Min. :2008-12-31 23:27:09.00
## 1st Qu.:2021-07-08 20:59:30.0 1st Qu.:2021-07-08 21:34:01.25
## Median :2021-07-16 12:23:28.0 Median :2021-07-16 12:44:06.00
## Mean :2021-07-16 13:13:35.2 Mean :2021-07-16 13:35:26.66
## 3rd Qu.:2021-07-24 13:04:00.0 3rd Qu.:2021-07-24 13:28:10.00
## Max. :2021-08-01 00:06:03.0 Max. :2021-08-01 15:50:56.00
##
## PULocationID DOLocationID trip_distance fare_amount
## 74 : 8534 74 : 3434 Min. : 0.10 Min. : 2.0
## 75 : 7338 75 : 2837 1st Qu.: 1.41 1st Qu.: 8.5
## 41 : 4583 42 : 2746 Median : 2.64 Median : 15.0
## 42 : 3004 41 : 2409 Mean : 3.77 Mean : 17.6
## 95 : 2330 236 : 1643 3rd Qu.: 5.23 3rd Qu.: 23.5
## 166 : 2296 238 : 1511 Max. :13.92 Max. :160.0
## (Other):44943 (Other):58448
Cool, looking beautiful
There are 3 useful informations I’m gonna extract:
trip_duration: The minute difference between
PU_datetime and DO_datetimehour: the hour of the PU_datetimeis_weekend: Is the PU_datetime in weekend or
nottrip <- trip %>%
mutate(trip_duration = as.numeric(round(difftime(DO_datetime, PU_datetime, units = "mins"), 2)),
is_weekend = as.factor(ifelse(wday(PU_datetime) > 5, 1, 0)),
hour = as.factor(hour(PU_datetime)))
trip %>% head()
trip_durationis gonna be the target variable
I’ll remove the datetime columns and fare_amount
By looking at the target summary, I can detect weird records or people may call them anomalies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.52 14.00 21.86 23.00 1438.77
There are 2 anomalies:
trip_duration == 0. Did they teleport?trip_duration == 1438.77. Were they dragging the car?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 8.33 13.73 15.84 21.30 44.72
Perfecto
It would be great if I could aggregate to get and use the average trip_duration of each POLocationID to DOLocationID as a feature with this code below:
But unfortunately, I don’t have enough data :(
ggplot(trip, aes(x = trip_duration)) +
geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
labs(x = "Trip Duration",
y = "Density",
title = "Distribution of Trip Duration") +
theme_minimal()The trip duration is distributed normally and mostly around 8 minutes
ggplot(trip, aes(x = trip_distance)) +
geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
labs(x = "Trip Distance",
y = "Density",
title = "Distribution of Trip Distance") +
theme_minimal()Trip distance is also distributed normally and mostly around 1.75 KM
plot_data <- trip %>% mutate(hour_range = getHourRange(as.integer(hour))) # getHourRange() is from personal package my package
ggplot(plot_data, aes(x = trip_distance, y = trip_duration, color = trip_duration)) +
geom_point() +
labs(x = "Trip Distance",
y = "Trip Duration",
title = "Trip Distance & Duration Relationship per Hour Range",
color = "Is Weekend") +
theme_gray() +
facet_grid(. ~ hour_range) +
theme(legend.position = "none")They are mostly the same in terms of the the relationship pattern
plot_data <- trip %>%
group_by(hour, is_weekend) %>%
summarise(count = n()) %>%
ungroup()
ggplot(plot_data, aes(x = hour, y = count)) +
geom_bar(stat = "identity", aes(fill = count), show.legend = F) +
labs(x = "Hour of Day",
y = "Trips",
title = "Number of Trips by Hour") +
facet_grid(. ~ is_weekend,
labeller = labeller(is_weekend = c("0" = "Weekday", "1" = "Weekend"))) +
theme_minimal()Number of trips are high around 8 to 18 o’clock
This will give us insight into which hours experience high traffic congestion the most
plot_data <- trip %>%
group_by(hour, is_weekend) %>%
summarise(ratio = mean(trip_duration/trip_distance)) %>%
ungroup()
ggplot(plot_data, aes(x = hour, y = ratio)) +
geom_bar(stat = "identity", aes(fill = ratio), show.legend = F) +
labs(x = "Hour of Day",
y = "Trip Duration per Distance",
title = "Trip Duration per Distance by Hour") +
facet_grid(. ~ is_weekend,
labeller = labeller(is_weekend = c("0" = "Weekday", "1" = "Weekend"))) +
theme_minimal()Traffic congestion mostly occur between 13 and 18 o’clock
train_data <- trip[indices, ] %>% select(-PULocationID,
-DOLocationID)
test_data <- trip[-indices, ] %>% select(-PULocationID,
-DOLocationID)
X_train <- train_data %>% select(-trip_duration)
y_train <- train_data$trip_duration
X_test <- test_data %>% select(-trip_duration)
y_test <- test_data$trip_duration##
## Call:
## lm(formula = trip_duration ~ ., data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.147 -3.466 -0.827 2.662 37.326
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.797590 0.213754 17.766 < 2e-16 ***
## trip_distance 2.636803 0.008239 320.050 < 2e-16 ***
## is_weekend1 -0.014847 0.050868 -0.292 0.77039
## hour1 -0.328531 0.333683 -0.985 0.32485
## hour2 -0.389243 0.388882 -1.001 0.31687
## hour3 -1.181921 0.416058 -2.841 0.00450 **
## hour4 -1.300279 0.385308 -3.375 0.00074 ***
## hour5 -1.599021 0.329178 -4.858 1.19e-06 ***
## hour6 -0.600088 0.269575 -2.226 0.02601 *
## hour7 1.198183 0.243219 4.926 8.40e-07 ***
## hour8 2.230058 0.232150 9.606 < 2e-16 ***
## hour9 2.301029 0.229197 10.040 < 2e-16 ***
## hour10 2.126995 0.228501 9.308 < 2e-16 ***
## hour11 2.809875 0.229031 12.269 < 2e-16 ***
## hour12 3.077923 0.230163 13.373 < 2e-16 ***
## hour13 3.726444 0.231074 16.127 < 2e-16 ***
## hour14 4.274314 0.230670 18.530 < 2e-16 ***
## hour15 4.493975 0.230991 19.455 < 2e-16 ***
## hour16 4.578300 0.231953 19.738 < 2e-16 ***
## hour17 4.325192 0.232275 18.621 < 2e-16 ***
## hour18 3.189397 0.231089 13.802 < 2e-16 ***
## hour19 1.987397 0.234743 8.466 < 2e-16 ***
## hour20 1.446235 0.246563 5.866 4.50e-09 ***
## hour21 1.369564 0.253470 5.403 6.57e-08 ***
## hour22 0.671668 0.260220 2.581 0.00985 **
## hour23 0.557497 0.268400 2.077 0.03780 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.657 on 55748 degrees of freedom
## Multiple R-squared: 0.6517, Adjusted R-squared: 0.6515
## F-statistic: 4172 on 25 and 55748 DF, p-value: < 2.2e-16
pred_lm <- predict(model_lm, X_test)
getRegressionErrors(pred_lm, y_test) # This function is from my personal package## Target Min Max MAE MAPE MSE RMSE
## 1 44.68 4.134 36.954 31.381 5.602
Not so good. But don’t worry! I’ll try another algorithm
model_rf <- randomForest(x = X_train,
y = y_train,
ntree = 55,
mtry = ncol(X_train),
importance = T)
model_rf##
## Call:
## randomForest(x = X_train, y = y_train, ntree = 55, mtry = ncol(X_train), importance = T)
## Type of random forest: regression
## Number of trees: 55
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 31.62449
## % Var explained: 65.56
## Target Min Max MAE MAPE MSE RMSE
## 1 44.68 3.949 29.498 31.481 5.611
It’s way better than my linear regression model in terms of MAPE. But, I’ll try using Feed-Forward Neural Network Architecture
factor_columns <- c("PULocationID", "DOLocationID", "is_weekend", "hour")
trip_encoded <- oneHotEncode(trip, factor_columns) %>% as.matrix() # oneHotEncode() is from my personal package
X_train <- trip_encoded[indices, -which(colnames(trip_encoded) == 'trip_duration')]
y_train <- trip_encoded[indices, "trip_duration"]
X_test <- trip_encoded[-indices, -which(colnames(trip_encoded) == 'trip_duration')]
y_test <- trip_encoded[-indices, "trip_duration"]model_fnn <- keras_model_sequential(name = "ETA-Predictor_NYC-Taxi_1.0.0") %>%
layer_dense(units = 256,
activation = "relu",
input_shape = ncol(X_train),
name = "hidden_1") %>%
layer_dense(units = 128,
activation = "relu",
name = "hidden_2") %>%
layer_dense(units = 64,
activation = "relu",
name = "hidden_3") %>%
layer_dense(units = 1,
activation = "linear",
name = "output")
model_fnn %>% compile(
loss = "mean_squared_error",
optimizer = optimizer_adam(learning_rate = 0.01),
metrics = "mean_absolute_percentage_error" # I'm gonna use MAPE as the metric because the trip_duration is never gonna be zero
)
history <- model_fnn %>% fit(
X_train, y_train,
epochs = 18,
batch_size = 8192,
validation_data = list(X_test, y_test),
verbose = 1
)
model_fnn## Model: "ETA-Predictor_NYC-Taxi_1.0.0"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## hidden_1 (Dense) (None, 256) 137216
## hidden_2 (Dense) (None, 128) 32896
## hidden_3 (Dense) (None, 64) 8256
## output (Dense) (None, 1) 65
## ================================================================================
## Total params: 178,433
## Trainable params: 178,433
## Non-trainable params: 0
## ________________________________________________________________________________
## Target Min Max MAE MAPE MSE RMSE
## 1 44.68 3.213 25.254 20.603 4.539
Alhamdulillah! It’s way better than my random forest model in terms of all metrics! But, I’ll try and see how KNN algorithm performs
I will try combining KNN and FNN model, I will add the predicted value from KNN as a feature and see how the FNN performs
model_fnn_knn <- keras_model_sequential(name = "ETA-Predictor_NYC-Taxi_1.0.1") %>%
layer_dense(units = 256,
activation = "relu",
input_shape = ncol(X_train),
name = "hidden_1") %>%
layer_dense(units = 128,
activation = "relu",
name = "hidden_2") %>%
layer_dense(units = 64,
activation = "relu",
name = "hidden_3") %>%
layer_dense(units = 1,
activation = "linear",
name = "output")
model_fnn_knn %>% compile(
loss = "mean_squared_error",
optimizer = optimizer_adam(learning_rate = 0.01),
metrics = "mean_absolute_percentage_error"
)
history <- model_fnn_knn %>% fit(
X_train, y_train,
epochs = 30,
batch_size = 8192,
validation_data = list(X_test, y_test),
verbose = 1
)
model_fnn_knn## Model: "ETA-Predictor_NYC-Taxi_1.0.1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## hidden_1 (Dense) (None, 256) 137472
## hidden_2 (Dense) (None, 128) 32896
## hidden_3 (Dense) (None, 64) 8256
## output (Dense) (None, 1) 65
## ================================================================================
## Total params: 178,689
## Trainable params: 178,689
## Non-trainable params: 0
## ________________________________________________________________________________
So, the model that wins the model selection is:
## Model: "ETA-Predictor_NYC-Taxi_1.0.0"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## hidden_1 (Dense) (None, 256) 137216
## hidden_2 (Dense) (None, 128) 32896
## hidden_3 (Dense) (None, 64) 8256
## output (Dense) (None, 1) 65
## ================================================================================
## Total params: 178,433
## Trainable params: 178,433
## Non-trainable params: 0
## ________________________________________________________________________________
Congratulations to our model model_fnn
residuals_fnn <- y_test - pred_fnn
ggplot(data = data.frame(y_test = y_test, pred_fnn = pred_fnn), aes(x = y_test, y = pred_fnn)) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "red") +
labs(title = "Predicted vs Actual Values",
x = "Actual Values",
y = "Predicted Values") +
theme_minimal()ggplot(data.frame(pred_fnn, residuals_fnn),
aes(x = pred_fnn, y = residuals_fnn)) +
geom_point() +
geom_smooth() +
labs(title = "Residuals Spread Pattern",
x = "Predicted Values",
y = "Residuals") +
theme_minimal()Cool. Why? Because:
- It has a linear pattern
- The residual bounces randomly around 0
ggplot(data = data.frame(residuals_fnn), aes(x = residuals_fnn)) +
geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
labs(title = "Density of Residuals",
x = "Residuals",
y = "Frequency") +
theme_minimal()Cool. Why? Cause the residuals are mostly around zero
ggplot(data = data.frame(pred_fnn, residuals_fnn), aes(x = pred_fnn, y = residuals_fnn)) +
geom_point() +
geom_hline(yintercept = 0, color = "red") +
labs(title = "Residuals vs Predicted Values",
x = "Predicted Values",
y = "Residuals") +
theme_minimal()Cool. Why? Cause the residual has random cloud shape (no discernible pattern)
Because my neural network model uses encoded features, I can’t directly do multicollinearity check. So I’m gonna use my linear model just to check the multicollinearity of the features
## GVIF Df GVIF^(1/(2*Df))
## trip_distance 1.016746 1 1.008338
## is_weekend 1.006300 1 1.003145
## hour 1.023017 23 1.000495
Cool. Why? Cause there’s no feature with vif >= 10
In conclusion, after a comprehensive exploration of multiple algorithms including linear regression, random forest, feedforward neural networks (FNN), and k-nearest neighbors (KNN), I have finally found the best model which is the FNN model. The FNN model’s performance metrics, including MAE, MAPE, MSE, and RMSE, exhibit highly favorable results, with an overall accuracy that suggests its readiness for practical deployment. Moreover, the model satisfies crucial regression assumptions such as linearity, normality of residuals, homoscedasticity of residuals, and absence of multicollinearity. With its exceptional predictive capabilities and fulfillment of key assumptions, this model is poised for effective real-world application in time arrival estimation tasks