Introduction

This project is a continuation of my previous project titled “Time Arrival Estimation with Machine Learning.” In this iteration, I aim to enhance the predictive model developed earlier by leveraging advanced machine learning algorithms and deep learning architectures. This project revolves around analyzing NYC taxi trips and their respective durations. The primary objective is to construct an effective machine learning model capable of predicting trip durations. By achieving accurate predictions of trip durations, we can obtain reliable estimations for the expected arrival times of trips.

Developing an effective machine learning model for time arrival estimation offers several significant business advantages:

  1. Better Customer Experience: Predicting trip durations helps passengers plan better, reducing wait times and travel uncertainty
  2. Efficient Dispatch and Routing: Accurate predictions optimize taxi assignments and routes, reducing idle time and improving fleet use
  3. Quality Service: Reliable estimates reduce rushed driving, leading to safer and more comfortable rides
  4. Cost Efficiency: Optimized operations save fuel, vehicle wear, and overall expenses
  5. Competitive Edge: Reliable predictions attract customers and boost loyalty, setting you apart in the market
  6. Smart Decision-Making: Data analysis unveils peak times, routes, and demand patterns for informed choices
  7. Partnerships and Integration: Accurate predictions open doors for collaborations with other services
  8. Effective Marketing: Promote precise arrival times to attract more users and repeat business
  9. Efficient Workforce: Predictive insights aid in allocating drivers as per demand, reducing downtime
  10. Data Value: Monetize data by selling insights on traffic patterns and commuter behavior.

Data Pre-processing

Import used libraries

library(dplyr)
library(lubridate)
library(Ardian) # My own package
library(GGally)
library(ggplot2)
library(caret)
library(randomForest)
library(MLmetrics)
library(tensorflow)
library(keras)
library(class)
library(FNN)
library(car)

Read the data

trip_raw <- read.csv("datasets/taxi_tripdata.csv")

Inspect the data

Top 6 rows

trip_raw %>% head()

Bottom 6 rows

trip_raw %>% tail()

Check duplicated rows

trip_raw %>% anyDuplicated()
## [1] 0

There’s no duplicated row

Check missing values

trip_raw %>% anyNA()
## [1] TRUE

There are missing values. Let’s check which columns!

Check missing values by columns

trip_raw %>% is.na() %>% colSums()
##              VendorID  lpep_pickup_datetime lpep_dropoff_datetime 
##                 32518                     0                     0 
##    store_and_fwd_flag            RatecodeID          PULocationID 
##                     0                 32518                     0 
##          DOLocationID       passenger_count         trip_distance 
##                     0                 32518                     0 
##           fare_amount                 extra               mta_tax 
##                     0                     0                     0 
##            tip_amount          tolls_amount             ehail_fee 
##                     0                     0                 83691 
## improvement_surcharge          total_amount          payment_type 
##                     0                     0                 32518 
##             trip_type  congestion_surcharge 
##                 32518                 32518

I don’t need these columns, so I’ll just remove them

Remove unneeded columns

trip <- trip_raw %>%
  select_if(~ all(!is.na(.))) %>% 
  select(-extra,
         -mta_tax,
         -tip_amount,
         -improvement_surcharge,
         -total_amount,
         -store_and_fwd_flag,
         -tolls_amount)

trip %>% tail()

Check missing values again

trip %>% anyNA()
## [1] FALSE

Good, the data is now free of missing values!

Parse datetime columns

I’m gonna parse lpep_pickup_datetime and lpep_dropoff_datetime to datetime and then rename them

trip <- trip %>% 
  mutate_if(is.character, ymd_hms) %>% 
  rename(PU_datetime = lpep_pickup_datetime,
         DO_datetime = lpep_dropoff_datetime)

trip %>% head()

Parse categorical columns

I’m gonna parse PULocationID and DOLocationID to factor

trip <- trip %>% 
  mutate_at(vars(PULocationID, DOLocationID), as.factor)

Check data structures

trip %>% glimpse()
## Rows: 83,691
## Columns: 6
## $ PU_datetime   <dttm> 2021-07-01 00:30:52, 2021-07-01 00:25:36, 2021-07-01 00…
## $ DO_datetime   <dttm> 2021-07-01 00:35:36, 2021-07-01 01:01:31, 2021-07-01 00…
## $ PULocationID  <fct> 74, 116, 97, 74, 42, 24, 75, 82, 74, 41, 74, 167, 197, 7…
## $ DOLocationID  <fct> 168, 265, 33, 42, 244, 239, 243, 82, 42, 42, 236, 167, 2…
## $ trip_distance <dbl> 1.20, 13.69, 0.95, 1.24, 1.10, 1.90, 0.00, 0.66, 1.72, 1…
## $ fare_amount   <dbl> 6.0, 42.0, 6.5, 6.5, 7.0, 8.0, 17.5, 5.0, 7.0, 7.5, 9.0,…

Check the data summary

By looking at the data summary, I can detect weird records or people may call them anomalies

trip %>% summary()
##   PU_datetime                      DO_datetime                    
##  Min.   :2008-12-31 23:12:53.00   Min.   :2008-12-31 23:27:09.00  
##  1st Qu.:2021-07-08 19:50:55.00   1st Qu.:2021-07-08 20:25:41.50  
##  Median :2021-07-16 11:58:17.00   Median :2021-07-16 12:26:04.00  
##  Mean   :2021-07-16 10:39:52.05   Mean   :2021-07-16 11:04:18.35  
##  3rd Qu.:2021-07-24 12:48:20.00   3rd Qu.:2021-07-24 13:13:41.50  
##  Max.   :2021-08-01 00:06:03.00   Max.   :2021-08-01 19:55:44.00  
##                                                                   
##   PULocationID    DOLocationID   trip_distance        fare_amount     
##  74     : 8770   74     : 3666   Min.   :     0.00   Min.   :-150.00  
##  75     : 7713   75     : 3122   1st Qu.:     1.35   1st Qu.:   9.00  
##  41     : 4761   42     : 2904   Median :     2.76   Median :  16.00  
##  42     : 3229   41     : 2527   Mean   :   194.35   Mean   :  20.39  
##  95     : 2486   236    : 1700   3rd Qu.:     6.20   3rd Qu.:  26.83  
##  166    : 2391   238    : 1546   Max.   :260517.93   Max.   : 480.00  
##  (Other):54341   (Other):68226

There are 3 anomalies:

  1. trip_distance == 0. Who the hell order taxi to stay in the same place?
  2. trip_distance == 260517.93. Did the taxi fly?
  3. fare_amount < 0. Did he/she steal the driver’s money?

Handle anomalies

trip <- trip %>% 
  filter(trip_distance >= 0.1,
         fare_amount >= 1) %>% 
  killOutliers(trip_distance) # This function is from my package

trip %>% summary()
##   PU_datetime                     DO_datetime                    
##  Min.   :2008-12-31 23:12:53.0   Min.   :2008-12-31 23:27:09.00  
##  1st Qu.:2021-07-08 20:59:30.0   1st Qu.:2021-07-08 21:34:01.25  
##  Median :2021-07-16 12:23:28.0   Median :2021-07-16 12:44:06.00  
##  Mean   :2021-07-16 13:13:35.2   Mean   :2021-07-16 13:35:26.66  
##  3rd Qu.:2021-07-24 13:04:00.0   3rd Qu.:2021-07-24 13:28:10.00  
##  Max.   :2021-08-01 00:06:03.0   Max.   :2021-08-01 15:50:56.00  
##                                                                  
##   PULocationID    DOLocationID   trip_distance    fare_amount   
##  74     : 8534   74     : 3434   Min.   : 0.10   Min.   :  2.0  
##  75     : 7338   75     : 2837   1st Qu.: 1.41   1st Qu.:  8.5  
##  41     : 4583   42     : 2746   Median : 2.64   Median : 15.0  
##  42     : 3004   41     : 2409   Mean   : 3.77   Mean   : 17.6  
##  95     : 2330   236    : 1643   3rd Qu.: 5.23   3rd Qu.: 23.5  
##  166    : 2296   238    : 1511   Max.   :13.92   Max.   :160.0  
##  (Other):44943   (Other):58448

Cool, looking beautiful

Feature Enginnering

Extract useful information from the datetime columns

There are 3 useful informations I’m gonna extract:

  1. trip_duration: The minute difference between PU_datetime and DO_datetime
  2. hour: the hour of the PU_datetime
  3. is_weekend: Is the PU_datetime in weekend or not
trip <- trip %>% 
  mutate(trip_duration = as.numeric(round(difftime(DO_datetime, PU_datetime, units = "mins"), 2)),
         is_weekend = as.factor(ifelse(wday(PU_datetime) > 5, 1, 0)),
         hour = as.factor(hour(PU_datetime)))

trip %>% head()

trip_duration is gonna be the target variable

Check features correlations

ggcorr(trip, label = T)

I will remove fare_amount because it’s highly correlated to trip_distance. Why do I choose to remove fare_amount instead of trip_distance? Because as I know, fare_amount is a dependent variable

Feature selection

Remove unused columns

I’ll remove the datetime columns and fare_amount

trip <- trip %>% select(-PU_datetime,
                        -DO_datetime,
                        -fare_amount)

trip %>% head()

Check target variable summary

By looking at the target summary, I can detect weird records or people may call them anomalies

trip$trip_duration %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.52   14.00   21.86   23.00 1438.77

There are 2 anomalies:

  1. trip_duration == 0. Did they teleport?
  2. trip_duration == 1438.77. Were they dragging the car?

Handle target anomalies

trip <- trip %>% 
  killOutliers(trip_duration) %>%
  filter(trip_duration >= 1)

Check target variable’s summary again

trip$trip_duration %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.33   13.73   15.84   21.30   44.72

Perfecto

Average Trip Duration

It would be great if I could aggregate to get and use the average trip_duration of each POLocationID to DOLocationID as a feature with this code below:

trip %>% 
  group_by(PULocationID, DOLocationID) %>% 
  summarise(average_duration = mean(trip_duration))

But unfortunately, I don’t have enough data :(

Exploratory Data Analysis

Distribution of Trip Duration

ggplot(trip, aes(x = trip_duration)) +
  geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
  labs(x = "Trip Duration",
       y = "Density",
       title = "Distribution of Trip Duration") +
  theme_minimal()

The trip duration is distributed normally and mostly around 8 minutes

Distribution of Trip Distance

ggplot(trip, aes(x = trip_distance)) +
  geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
  labs(x = "Trip Distance",
       y = "Density",
       title = "Distribution of Trip Distance") +
  theme_minimal()

Trip distance is also distributed normally and mostly around 1.75 KM

Trip Distance & Duration Relationship per Hour Range

plot_data <- trip %>% mutate(hour_range = getHourRange(as.integer(hour))) # getHourRange() is from personal package my package

ggplot(plot_data, aes(x = trip_distance, y = trip_duration, color = trip_duration)) +
  geom_point() +
  labs(x = "Trip Distance",
       y = "Trip Duration",
       title = "Trip Distance & Duration Relationship per Hour Range",
       color = "Is Weekend") +
  theme_gray() +
  facet_grid(. ~ hour_range) +
  theme(legend.position = "none")

They are mostly the same in terms of the the relationship pattern

Number of Trips by Hour

plot_data <- trip %>% 
  group_by(hour, is_weekend) %>% 
  summarise(count = n()) %>%
  ungroup()

ggplot(plot_data, aes(x = hour, y = count)) +
  geom_bar(stat = "identity", aes(fill = count), show.legend = F) + 
  labs(x = "Hour of Day",
       y = "Trips",
       title = "Number of Trips by Hour") +
  facet_grid(. ~ is_weekend,
             labeller = labeller(is_weekend = c("0" = "Weekday", "1" = "Weekend"))) +
  theme_minimal()

Number of trips are high around 8 to 18 o’clock

Average Trip Duration per Distance by hour

This will give us insight into which hours experience high traffic congestion the most

plot_data <- trip %>% 
  group_by(hour, is_weekend) %>% 
  summarise(ratio = mean(trip_duration/trip_distance)) %>%
  ungroup()

ggplot(plot_data, aes(x = hour, y = ratio)) +
  geom_bar(stat = "identity", aes(fill = ratio), show.legend = F) + 
  labs(x = "Hour of Day",
       y = "Trip Duration per Distance",
       title = "Trip Duration per Distance by Hour") +
  facet_grid(. ~ is_weekend,
             labeller = labeller(is_weekend = c("0" = "Weekday", "1" = "Weekend"))) +
  theme_minimal()

Traffic congestion mostly occur between 13 and 18 o’clock

Cross Validation

Set random seed and training indices

set.seed(1)

indices <- createDataPartition(trip$trip_duration,
                               p = 0.8,
                               list = FALSE)

Train test split

train_data <- trip[indices, ] %>% select(-PULocationID,
                                         -DOLocationID)
test_data <- trip[-indices, ] %>% select(-PULocationID,
                                         -DOLocationID)

X_train <- train_data %>% select(-trip_duration)
y_train <- train_data$trip_duration

X_test <- test_data %>% select(-trip_duration)
y_test <- test_data$trip_duration

Model Fitting

Linear Regression Algorithm

model_lm <- lm(formula = trip_duration ~ .,
               data = train_data)

model_lm %>% summary()
## 
## Call:
## lm(formula = trip_duration ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.147  -3.466  -0.827   2.662  37.326 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.797590   0.213754  17.766  < 2e-16 ***
## trip_distance  2.636803   0.008239 320.050  < 2e-16 ***
## is_weekend1   -0.014847   0.050868  -0.292  0.77039    
## hour1         -0.328531   0.333683  -0.985  0.32485    
## hour2         -0.389243   0.388882  -1.001  0.31687    
## hour3         -1.181921   0.416058  -2.841  0.00450 ** 
## hour4         -1.300279   0.385308  -3.375  0.00074 ***
## hour5         -1.599021   0.329178  -4.858 1.19e-06 ***
## hour6         -0.600088   0.269575  -2.226  0.02601 *  
## hour7          1.198183   0.243219   4.926 8.40e-07 ***
## hour8          2.230058   0.232150   9.606  < 2e-16 ***
## hour9          2.301029   0.229197  10.040  < 2e-16 ***
## hour10         2.126995   0.228501   9.308  < 2e-16 ***
## hour11         2.809875   0.229031  12.269  < 2e-16 ***
## hour12         3.077923   0.230163  13.373  < 2e-16 ***
## hour13         3.726444   0.231074  16.127  < 2e-16 ***
## hour14         4.274314   0.230670  18.530  < 2e-16 ***
## hour15         4.493975   0.230991  19.455  < 2e-16 ***
## hour16         4.578300   0.231953  19.738  < 2e-16 ***
## hour17         4.325192   0.232275  18.621  < 2e-16 ***
## hour18         3.189397   0.231089  13.802  < 2e-16 ***
## hour19         1.987397   0.234743   8.466  < 2e-16 ***
## hour20         1.446235   0.246563   5.866 4.50e-09 ***
## hour21         1.369564   0.253470   5.403 6.57e-08 ***
## hour22         0.671668   0.260220   2.581  0.00985 ** 
## hour23         0.557497   0.268400   2.077  0.03780 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.657 on 55748 degrees of freedom
## Multiple R-squared:  0.6517, Adjusted R-squared:  0.6515 
## F-statistic:  4172 on 25 and 55748 DF,  p-value: < 2.2e-16

Linear Regression Model Evaluation

pred_lm <- predict(model_lm, X_test)

getRegressionErrors(pred_lm, y_test) # This function is from my personal package
##  Target Min   Max   MAE   MAPE    MSE  RMSE
##           1 44.68 4.134 36.954 31.381 5.602

Not so good. But don’t worry! I’ll try another algorithm

Random Forest Algorithm

model_rf <- randomForest(x = X_train,
                         y = y_train,
                         ntree = 55,
                         mtry = ncol(X_train),
                         importance = T)

model_rf
## 
## Call:
##  randomForest(x = X_train, y = y_train, ntree = 55, mtry = ncol(X_train),      importance = T) 
##                Type of random forest: regression
##                      Number of trees: 55
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 31.62449
##                     % Var explained: 65.56

Random Forest Model Evaluation

pred_rf <- predict(model_rf, X_test)

getRegressionErrors(pred_rf, y_test)
##  Target Min   Max   MAE   MAPE    MSE  RMSE
##           1 44.68 3.949 29.498 31.481 5.611

It’s way better than my linear regression model in terms of MAPE. But, I’ll try using Feed-Forward Neural Network Architecture

Deep Learning

One Hot Encoding

factor_columns <- c("PULocationID", "DOLocationID", "is_weekend", "hour")
trip_encoded <- oneHotEncode(trip, factor_columns) %>% as.matrix() # oneHotEncode() is from my personal package

X_train <- trip_encoded[indices, -which(colnames(trip_encoded) == 'trip_duration')]
y_train <- trip_encoded[indices, "trip_duration"]

X_test <- trip_encoded[-indices, -which(colnames(trip_encoded) == 'trip_duration')]
y_test <- trip_encoded[-indices, "trip_duration"]

Feed-Forward Neural Network

model_fnn <- keras_model_sequential(name = "ETA-Predictor_NYC-Taxi_1.0.0") %>%
  layer_dense(units = 256,
              activation = "relu",
              input_shape = ncol(X_train),
              name = "hidden_1") %>%
  layer_dense(units = 128,
              activation = "relu",
              name = "hidden_2") %>%
  layer_dense(units = 64,
              activation = "relu",
              name = "hidden_3") %>%
  layer_dense(units = 1,
              activation = "linear",
              name = "output")

model_fnn %>% compile(
  loss = "mean_squared_error",
  optimizer = optimizer_adam(learning_rate = 0.01),
  metrics = "mean_absolute_percentage_error" # I'm gonna use MAPE as the metric because the trip_duration is never gonna be zero
)

history <- model_fnn %>% fit(
  X_train, y_train,
  epochs = 18,
  batch_size = 8192,
  validation_data = list(X_test, y_test),
  verbose = 1
)

model_fnn
## Model: "ETA-Predictor_NYC-Taxi_1.0.0"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  hidden_1 (Dense)                   (None, 256)                     137216      
##  hidden_2 (Dense)                   (None, 128)                     32896       
##  hidden_3 (Dense)                   (None, 64)                      8256        
##  output (Dense)                     (None, 1)                       65          
## ================================================================================
## Total params: 178,433
## Trainable params: 178,433
## Non-trainable params: 0
## ________________________________________________________________________________

Feed-Forward Neural Network Model Evaluation

pred_fnn <- predict(model_fnn, X_test) %>% as.vector()

getRegressionErrors(pred_fnn, y_test)
##  Target Min   Max   MAE   MAPE    MSE  RMSE
##           1 44.68 3.213 25.254 20.603 4.539

Alhamdulillah! It’s way better than my random forest model in terms of all metrics! But, I’ll try and see how KNN algorithm performs

Further Model Fitting

K-Nearest Neighbour Regressor Algorithm

pred_knn_train <- knn.reg(train = X_train,
                          test = X_train,
                          y = y_train,
                          k = 5)$pred

pred_knn_test <- knn.reg(train = X_train,
                         test = X_test,
                         y = y_train,
                         k = 5)$pred

K-Nearest Neighbour Evaluation

getRegressionErrors(pred_knn_test, y_test)
##  Target Min   Max   MAE   MAPE   MSE RMSE
##           1 44.68 3.461 26.676 24.21 4.92

It’s cool, but my FNN model is better. What if I combine them?

Model Combining

I will try combining KNN and FNN model, I will add the predicted value from KNN as a feature and see how the FNN performs

Add KNN prediction as a feature

X_train <- cbind(X_train, pred_knn = pred_knn_train)
X_test <- cbind(X_test, pred_knn = pred_knn_test)

Feed-Forward Neural Network with KNN prediction

model_fnn_knn <- keras_model_sequential(name = "ETA-Predictor_NYC-Taxi_1.0.1") %>%
  layer_dense(units = 256,
              activation = "relu",
              input_shape = ncol(X_train),
              name = "hidden_1") %>%
  layer_dense(units = 128,
              activation = "relu",
              name = "hidden_2") %>%
  layer_dense(units = 64,
              activation = "relu",
              name = "hidden_3") %>%
  layer_dense(units = 1,
              activation = "linear",
              name = "output")

model_fnn_knn %>% compile(
  loss = "mean_squared_error",
  optimizer = optimizer_adam(learning_rate = 0.01),
  metrics = "mean_absolute_percentage_error"
)

history <- model_fnn_knn %>% fit(
  X_train, y_train,
  epochs = 30,
  batch_size = 8192,
  validation_data = list(X_test, y_test),
  verbose = 1
)

model_fnn_knn
## Model: "ETA-Predictor_NYC-Taxi_1.0.1"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  hidden_1 (Dense)                   (None, 256)                     137472      
##  hidden_2 (Dense)                   (None, 128)                     32896       
##  hidden_3 (Dense)                   (None, 64)                      8256        
##  output (Dense)                     (None, 1)                       65          
## ================================================================================
## Total params: 178,689
## Trainable params: 178,689
## Non-trainable params: 0
## ________________________________________________________________________________

New Feed-Forward Neural Network Model Evaluation

pred_fnn_knn <- predict(model_fnn_knn, X_test) %>% as.vector()

getRegressionErrors(pred_fnn_knn, y_test)
##  Target Min   Max   MAE   MAPE    MSE RMSE
##           1 44.68 3.296 24.798 22.282 4.72

Not better than my original FNN model.

Model Selection

So, the model that wins the model selection is:

model_fnn
## Model: "ETA-Predictor_NYC-Taxi_1.0.0"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  hidden_1 (Dense)                   (None, 256)                     137216      
##  hidden_2 (Dense)                   (None, 128)                     32896       
##  hidden_3 (Dense)                   (None, 64)                      8256        
##  output (Dense)                     (None, 1)                       65          
## ================================================================================
## Total params: 178,433
## Trainable params: 178,433
## Non-trainable params: 0
## ________________________________________________________________________________

Congratulations to our model model_fnn

Regression Assumptions

1. Linearity

residuals_fnn <- y_test - pred_fnn

ggplot(data = data.frame(y_test = y_test, pred_fnn = pred_fnn), aes(x = y_test, y = pred_fnn)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  labs(title = "Predicted vs Actual Values",
       x = "Actual Values",
       y = "Predicted Values") +
  theme_minimal()

ggplot(data.frame(pred_fnn, residuals_fnn),
       aes(x = pred_fnn, y = residuals_fnn)) +
  geom_point() +
  geom_smooth() +
  labs(title = "Residuals Spread Pattern",
       x = "Predicted Values",
       y = "Residuals") +
  theme_minimal()

Cool. Why? Because:

  1. It has a linear pattern
  2. The residual bounces randomly around 0

2. Normality of Residuals

ggplot(data = data.frame(residuals_fnn), aes(x = residuals_fnn)) +
  geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
  labs(title = "Density of Residuals",
       x = "Residuals",
       y = "Frequency") +
  theme_minimal()

Cool. Why? Cause the residuals are mostly around zero

3. Homoscedasticity of Residuals

ggplot(data = data.frame(pred_fnn, residuals_fnn), aes(x = pred_fnn, y = residuals_fnn)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  labs(title = "Residuals vs Predicted Values",
       x = "Predicted Values",
       y = "Residuals") +
  theme_minimal()

Cool. Why? Cause the residual has random cloud shape (no discernible pattern)

4. No Multicollinearity

Because my neural network model uses encoded features, I can’t directly do multicollinearity check. So I’m gonna use my linear model just to check the multicollinearity of the features

vif(model_lm)
##                   GVIF Df GVIF^(1/(2*Df))
## trip_distance 1.016746  1        1.008338
## is_weekend    1.006300  1        1.003145
## hour          1.023017 23        1.000495

Cool. Why? Cause there’s no feature with vif >= 10

Conclusion

In conclusion, after a comprehensive exploration of multiple algorithms including linear regression, random forest, feedforward neural networks (FNN), and k-nearest neighbors (KNN), I have finally found the best model which is the FNN model. The FNN model’s performance metrics, including MAE, MAPE, MSE, and RMSE, exhibit highly favorable results, with an overall accuracy that suggests its readiness for practical deployment. Moreover, the model satisfies crucial regression assumptions such as linearity, normality of residuals, homoscedasticity of residuals, and absence of multicollinearity. With its exceptional predictive capabilities and fulfillment of key assumptions, this model is poised for effective real-world application in time arrival estimation tasks