Introduction

This project revolves around analyzing NYC taxi trips and their respective durations. The primary objective is to construct an effective machine learning model capable of predicting trip durations. By achieving accurate predictions of trip durations, we can obtain reliable estimations for the expected arrival times of trips.

Developing an effective machine learning model for time arrival estimation offers several significant business advantages:

  1. Better Customer Experience: Predicting trip durations helps passengers plan better, reducing wait times and travel uncertainty
  2. Efficient Dispatch and Routing: Accurate predictions optimize taxi assignments and routes, reducing idle time and improving fleet use
  3. Quality Service: Reliable estimates reduce rushed driving, leading to safer and more comfortable rides
  4. Cost Efficiency: Optimized operations save fuel, vehicle wear, and overall expenses
  5. Competitive Edge: Reliable predictions attract customers and boost loyalty, setting you apart in the market
  6. Smart Decision-Making: Data analysis unveils peak times, routes, and demand patterns for informed choices
  7. Partnerships and Integration: Accurate predictions open doors for collaborations with other services
  8. Effective Marketing: Promote precise arrival times to attract more users and repeat business
  9. Efficient Workforce: Predictive insights aid in allocating drivers as per demand, reducing downtime
  10. Data Value: Monetize data by selling insights on traffic patterns and commuter behavior.

Data Pre-processing

Import used libraries

library(dplyr)
library(lubridate)
library(Ardian) # My own package
library(GGally)
library(caret)
library(class)
library(FNN)
library(car)

Read the data

trip_raw <- read.csv("datasets/taxi_tripdata.csv")

Inspect the data

Top 6 rows

trip_raw %>% head()

Bottom 6 rows

trip_raw %>% tail()

Check duplicated rows

trip_raw %>% anyDuplicated()
## [1] 0

There’s no duplicated row

Check missing values

trip_raw %>% anyNA()
## [1] TRUE

There are missing values. Let’s check which columns!

Check missing values by columns

trip_raw %>% is.na() %>% colSums()
##              VendorID  lpep_pickup_datetime lpep_dropoff_datetime 
##                 32518                     0                     0 
##    store_and_fwd_flag            RatecodeID          PULocationID 
##                     0                 32518                     0 
##          DOLocationID       passenger_count         trip_distance 
##                     0                 32518                     0 
##           fare_amount                 extra               mta_tax 
##                     0                     0                     0 
##            tip_amount          tolls_amount             ehail_fee 
##                     0                     0                 83691 
## improvement_surcharge          total_amount          payment_type 
##                     0                     0                 32518 
##             trip_type  congestion_surcharge 
##                 32518                 32518

I don’t need these columns, so I’ll just remove them

Remove unneeded columns

trip <- trip_raw %>%
  select_if(~ all(!is.na(.))) %>% 
  select(-extra,
         -mta_tax,
         -tip_amount,
         -improvement_surcharge,
         -total_amount,
         -store_and_fwd_flag,
         -tolls_amount)

trip %>% tail()

Check missing values again

trip %>% anyNA()
## [1] FALSE

Good, the data is now free of missing values!

Parse datetime columns

I’m gonna parse lpep_pickup_datetime and lpep_dropoff_datetime to datetime and then rename them

trip <- trip %>% 
  mutate_if(is.character, ymd_hms) %>% 
  rename(PU_datetime = lpep_pickup_datetime,
         DO_datetime = lpep_dropoff_datetime)

trip %>% head()

Parse categorical columns

I’m gonna parse PULocationID and DOLocationID to factor

trip <- trip %>% 
  mutate_at(vars(PULocationID, DOLocationID), as.factor)

Check data structures

trip %>% glimpse()
## Rows: 83,691
## Columns: 6
## $ PU_datetime   <dttm> 2021-07-01 00:30:52, 2021-07-01 00:25:36, 2021-07-01 00…
## $ DO_datetime   <dttm> 2021-07-01 00:35:36, 2021-07-01 01:01:31, 2021-07-01 00…
## $ PULocationID  <fct> 74, 116, 97, 74, 42, 24, 75, 82, 74, 41, 74, 167, 197, 7…
## $ DOLocationID  <fct> 168, 265, 33, 42, 244, 239, 243, 82, 42, 42, 236, 167, 2…
## $ trip_distance <dbl> 1.20, 13.69, 0.95, 1.24, 1.10, 1.90, 0.00, 0.66, 1.72, 1…
## $ fare_amount   <dbl> 6.0, 42.0, 6.5, 6.5, 7.0, 8.0, 17.5, 5.0, 7.0, 7.5, 9.0,…

Check the data summary

By looking at the data summary, I can detect weird records or people may call them anomalies

trip %>% summary()
##   PU_datetime                      DO_datetime                    
##  Min.   :2008-12-31 23:12:53.00   Min.   :2008-12-31 23:27:09.00  
##  1st Qu.:2021-07-08 19:50:55.00   1st Qu.:2021-07-08 20:25:41.50  
##  Median :2021-07-16 11:58:17.00   Median :2021-07-16 12:26:04.00  
##  Mean   :2021-07-16 10:39:52.05   Mean   :2021-07-16 11:04:18.35  
##  3rd Qu.:2021-07-24 12:48:20.00   3rd Qu.:2021-07-24 13:13:41.50  
##  Max.   :2021-08-01 00:06:03.00   Max.   :2021-08-01 19:55:44.00  
##                                                                   
##   PULocationID    DOLocationID   trip_distance        fare_amount     
##  74     : 8770   74     : 3666   Min.   :     0.00   Min.   :-150.00  
##  75     : 7713   75     : 3122   1st Qu.:     1.35   1st Qu.:   9.00  
##  41     : 4761   42     : 2904   Median :     2.76   Median :  16.00  
##  42     : 3229   41     : 2527   Mean   :   194.35   Mean   :  20.39  
##  95     : 2486   236    : 1700   3rd Qu.:     6.20   3rd Qu.:  26.83  
##  166    : 2391   238    : 1546   Max.   :260517.93   Max.   : 480.00  
##  (Other):54341   (Other):68226

There are 3 anomalies:

  1. trip_distance == 0. Who the hell order taxi to stay in the same place?
  2. trip_distance == 260517.93. Did the taxi fly?
  3. fare_amount < 0. Did he/she steal the driver’s money?

Handle anomalies

trip <- trip %>% 
  filter(trip_distance >= 0.1,
         fare_amount >= 1) %>% 
  killOutliers(trip_distance) # This function is from my package

trip %>% summary()
##   PU_datetime                     DO_datetime                    
##  Min.   :2008-12-31 23:12:53.0   Min.   :2008-12-31 23:27:09.00  
##  1st Qu.:2021-07-08 20:59:30.0   1st Qu.:2021-07-08 21:34:01.25  
##  Median :2021-07-16 12:23:28.0   Median :2021-07-16 12:44:06.00  
##  Mean   :2021-07-16 13:13:35.2   Mean   :2021-07-16 13:35:26.66  
##  3rd Qu.:2021-07-24 13:04:00.0   3rd Qu.:2021-07-24 13:28:10.00  
##  Max.   :2021-08-01 00:06:03.0   Max.   :2021-08-01 15:50:56.00  
##                                                                  
##   PULocationID    DOLocationID   trip_distance    fare_amount   
##  74     : 8534   74     : 3434   Min.   : 0.10   Min.   :  2.0  
##  75     : 7338   75     : 2837   1st Qu.: 1.41   1st Qu.:  8.5  
##  41     : 4583   42     : 2746   Median : 2.64   Median : 15.0  
##  42     : 3004   41     : 2409   Mean   : 3.77   Mean   : 17.6  
##  95     : 2330   236    : 1643   3rd Qu.: 5.23   3rd Qu.: 23.5  
##  166    : 2296   238    : 1511   Max.   :13.92   Max.   :160.0  
##  (Other):44943   (Other):58448

Cool, looking beautiful

Feature Enginnering

Extract useful information from the datetime columns

There are 3 useful informations I’m gonna extract:

  1. trip_duration: The minute difference between PU_datetime and DO_datetime
  2. hour: the hour of the PU_datetime
  3. is_weekend: Is the PU_datetime in weekend or not
trip <- trip %>% 
  mutate(trip_duration = as.numeric(round(difftime(DO_datetime, PU_datetime, units = "mins"), 2)),
         is_weekend = as.factor(ifelse(wday(PU_datetime) > 5, 1, 0)),
         hour = as.factor(hour(PU_datetime)))

trip %>% head()

trip_duration is gonna be the target variable

Check features correlations

ggcorr(trip, label = T)

I will remove fare_amount because it’s highly correlated to trip_distance. Why do I choose to remove fare_amount instead of trip_distance? Because as I know, fare_amount is a dependent variable

Feature selection

Remove unused columns

I’ll remove the datetime columns and fare_amount

trip <- trip %>% select(-PU_datetime,
                        -DO_datetime,
                        -fare_amount)

trip %>% head()

Check target variable summary

By looking at the target summary, I can detect weird records or people may call them anomalies

trip$trip_duration %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.52   14.00   21.86   23.00 1438.77

There are 2 anomalies:

  1. trip_duration == 0. Did they teleport?
  2. trip_duration == 1438.77. Were they dragging the car?

Handle target anomalies

trip <- trip %>% 
  killOutliers(trip_duration) %>%
  filter(trip_duration >= 1)

Check target variable’s summary again

trip$trip_duration %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.33   13.73   15.84   21.30   44.72

Perfecto

Average Trip Duration

It would be great if I could aggregate to get and use the average trip_duration of each POLocationID to DOLocationID as a feature with this code below:

trip %>% 
  group_by(PULocationID, DOLocationID) %>% 
  summarise(average_duration = mean(trip_duration))

But unfortunately, I don’t have enough data :(

Exploratory Data Analysis

Distribution of Trip Duration

ggplot(trip, aes(x = trip_duration)) +
  geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
  labs(x = "Trip Duration",
       y = "Density",
       title = "Distribution of Trip Duration") +
  theme_minimal()

The trip duration is distributed normally and mostly around 8 minutes

Distribution of Trip Distance

ggplot(trip, aes(x = trip_distance)) +
  geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
  labs(x = "Trip Distance",
       y = "Density",
       title = "Distribution of Trip Distance") +
  theme_minimal()

Trip distance is also distributed normally and mostly around 1.75 KM

Trip Distance & Duration Relationship per Hour Range

plot_data <- trip %>% mutate(hour_range = getHourRange(as.integer(hour))) # getHourRange() is from personal package my package

ggplot(plot_data, aes(x = trip_distance, y = trip_duration, color = trip_duration)) +
  geom_point() +
  labs(x = "Trip Distance",
       y = "Trip Duration",
       title = "Trip Distance & Duration Relationship per Hour Range",
       color = "Is Weekend") +
  theme_gray() +
  facet_grid(. ~ hour_range) +
  theme(legend.position = "none")

They are mostly the same in terms of the the relationship pattern

Number of Trips by Hour

plot_data <- trip %>% 
  group_by(hour, is_weekend) %>% 
  summarise(count = n()) %>%
  ungroup()

ggplot(plot_data, aes(x = hour, y = count)) +
  geom_bar(stat = "identity", aes(fill = count), show.legend = F) + 
  labs(x = "Hour of Day",
       y = "Trips",
       title = "Number of Trips by Hour") +
  facet_grid(. ~ is_weekend,
             labeller = labeller(is_weekend = c("0" = "Weekday", "1" = "Weekend"))) +
  theme_minimal()

Number of trips are high around 8 to 18 o’clock

Average Trip Duration per Distance by hour

This will give us insight into which hours experience high traffic congestion the most

plot_data <- trip %>% 
  group_by(hour, is_weekend) %>% 
  summarise(ratio = mean(trip_duration/trip_distance)) %>%
  ungroup()

ggplot(plot_data, aes(x = hour, y = ratio)) +
  geom_bar(stat = "identity", aes(fill = ratio), show.legend = F) + 
  labs(x = "Hour of Day",
       y = "Trip Duration per Distance",
       title = "Trip Duration per Distance by Hour") +
  facet_grid(. ~ is_weekend,
             labeller = labeller(is_weekend = c("0" = "Weekday", "1" = "Weekend"))) +
  theme_minimal()

Traffic congestion mostly occur between 13 and 18 o’clock

Cross Validation

Set random seed and training indices

set.seed(1)

indices <- createDataPartition(trip$trip_duration,
                               p = 0.8,
                               list = FALSE)

Train test split

train_data <- trip[indices, ] %>% select(-PULocationID,
                                         -DOLocationID)
test_data <- trip[-indices, ] %>% select(-PULocationID,
                                         -DOLocationID)

X_train <- train_data %>% select(-trip_duration)
y_train <- train_data$trip_duration

X_test <- test_data %>% select(-trip_duration)
y_test <- test_data$trip_duration

Model Fitting

Linear Regression Algorithm

model_lm <- lm(formula = trip_duration ~ .,
               data = train_data)

model_lm %>% summary()
## 
## Call:
## lm(formula = trip_duration ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.147  -3.466  -0.827   2.662  37.326 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.797590   0.213754  17.766  < 2e-16 ***
## trip_distance  2.636803   0.008239 320.050  < 2e-16 ***
## is_weekend1   -0.014847   0.050868  -0.292  0.77039    
## hour1         -0.328531   0.333683  -0.985  0.32485    
## hour2         -0.389243   0.388882  -1.001  0.31687    
## hour3         -1.181921   0.416058  -2.841  0.00450 ** 
## hour4         -1.300279   0.385308  -3.375  0.00074 ***
## hour5         -1.599021   0.329178  -4.858 1.19e-06 ***
## hour6         -0.600088   0.269575  -2.226  0.02601 *  
## hour7          1.198183   0.243219   4.926 8.40e-07 ***
## hour8          2.230058   0.232150   9.606  < 2e-16 ***
## hour9          2.301029   0.229197  10.040  < 2e-16 ***
## hour10         2.126995   0.228501   9.308  < 2e-16 ***
## hour11         2.809875   0.229031  12.269  < 2e-16 ***
## hour12         3.077923   0.230163  13.373  < 2e-16 ***
## hour13         3.726444   0.231074  16.127  < 2e-16 ***
## hour14         4.274314   0.230670  18.530  < 2e-16 ***
## hour15         4.493975   0.230991  19.455  < 2e-16 ***
## hour16         4.578300   0.231953  19.738  < 2e-16 ***
## hour17         4.325192   0.232275  18.621  < 2e-16 ***
## hour18         3.189397   0.231089  13.802  < 2e-16 ***
## hour19         1.987397   0.234743   8.466  < 2e-16 ***
## hour20         1.446235   0.246563   5.866 4.50e-09 ***
## hour21         1.369564   0.253470   5.403 6.57e-08 ***
## hour22         0.671668   0.260220   2.581  0.00985 ** 
## hour23         0.557497   0.268400   2.077  0.03780 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.657 on 55748 degrees of freedom
## Multiple R-squared:  0.6517, Adjusted R-squared:  0.6515 
## F-statistic:  4172 on 25 and 55748 DF,  p-value: < 2.2e-16

Model Interpretation

Coefficients Interpretation

  • Intercept (3.797590): The intercept represents the predicted trip duration when all other predictor variables are zero.

  • trip_distance (2.636803): For every unit increase in the trip distance, the predicted trip duration increases by approximately 2.64 units.

  • is_weekend1 (-0.014847): The “is_weekend1” variable doesn’t have a significant impact on trip duration (p = 0.77039).

  • hour1 to hour23: Each “hourX” coefficient represents the change in predicted trip duration associated with each hour of the day.

Model Performance

  • Residual Standard Error (5.657): Average magnitude of prediction errors.

  • Multiple R-squared (0.6517): About 65.17% of variability in trip duration explained by predictors.

  • Adjusted R-squared (0.6515): Adjusted R-squared considers model complexity.

  • F-statistic (4172) and p-value (< 2.2e-16): Model is statistically significant.

Residuals

Residuals represent differences between actual and predicted trip durations. - Min: The minimum value of the residuals is -34.147. This means that in some cases, the model’s predictions were off by as much as 34.147 units less than the actual trip duration. These are the most negative differences between predicted and actual values.

  • 1Q (First Quartile): The first quartile value of the residuals is -3.466. This means that 25% of the residuals fall below this value. In other words, a quarter of the model’s predictions were off by less than 3.466 units compared to the actual trip duration.

  • Median: The median value of the residuals is -0.827. This is the middle value of the residuals when they are ordered from smallest to largest. It indicates that 50% of the model’s predictions were below this value and 50% were above it.

  • 3Q (Third Quartile): The third quartile value of the residuals is 2.662. This means that 75% of the residuals fall below this value. In other words, three-quarters of the model’s predictions were off by less than 2.662 units compared to the actual trip duration.

  • Max: The maximum value of the residuals is 37.326. This means that in some cases, the model’s predictions were off by as much as 37.326 units more than the actual trip duration. These are the most positive differences between predicted and actual values.

Model Evaluation

pred_lm <- predict(model_lm, X_test)

getRegressionErrors(pred_lm, y_test) # This function is from my personal package
##  Target Min   Max   MAE   MAPE    MSE  RMSE
##           1 44.68 4.134 36.954 31.381 5.602

Based on the evaluation metrics for the target variable:

  • The minimum value is 1, indicating the smallest actual value observed in the target variable.
  • The maximum value is 44.68, representing the largest actual value observed.

The model’s performance metrics are as follows:

  • MAE (Mean Absolute Error): The average absolute difference between the model’s predictions and the actual values is 4.134. This > metric quantifies the magnitude of errors without considering their direction.

  • MAPE (Mean Absolute Percentage Error): The average percentage difference between the model’s predictions and the actual values is 36.954%. This metric provides insight into the relative error between predictions and actual values.

  • MSE (Mean Squared Error): The average squared difference between the model’s predictions and the actual values is 31.381. This metric emphasizes larger errors as it squares the differences.

  • RMSE (Root Mean Squared Error): The square root of the MSE is 5.602. RMSE provides a measure of the average magnitude of errors, combining both their size and direction.

Regression Assumptions

1. Linearity

residuals <- y_test - pred_lm

ggplot(data = data.frame(y_test = y_test, pred_lm = pred_lm), aes(x = y_test, y = pred_lm)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  labs(title = "Predicted vs Actual Values",
       x = "Actual Values",
       y = "Predicted Values") +
  theme_minimal()

ggplot(data.frame(pred_lm, residuals),
       aes(x = pred_lm, y = residuals)) +
  geom_point() +
  geom_smooth() +
  labs(title = "Residuals Spread Pattern",
       x = "Predicted Values",
       y = "Residuals") +
  theme_minimal()

Cool. Why? Because:

  1. It has a linear pattern
  2. The residual bounces randomly around 0

2. Normality of Residuals

ggplot(data = data.frame(residuals), aes(x = residuals)) +
  geom_density(fill = "#56b4fc", col = "white", alpha = 0.5) +
  labs(title = "Density of Residuals",
       x = "Residuals",
       y = "Frequency") +
  theme_minimal()

Cool. Why? Cause the residuals are mostly around zero

3. Homoscedasticity of Residuals

ggplot(data = data.frame(pred_lm, residuals), aes(x = pred_lm, y = residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red") +
  labs(title = "Residuals vs Predicted Values",
       x = "Predicted Values",
       y = "Residuals") +
  theme_minimal()

Cool. Why? Cause the residual has random cloud shape (no discernible pattern)

4. No Multicollinearity

vif(model_lm)
##                   GVIF Df GVIF^(1/(2*Df))
## trip_distance 1.016746  1        1.008338
## is_weekend    1.006300  1        1.003145
## hour          1.023017 23        1.000495

Cool. Why? Because there are no features with vif >=10

Conclusion

In conclusion, the model’s predictions exhibit a relatively small mean absolute error (MAE), indicating that, on average, the predictions are off by around 4.134 units from the actual values. The MAPE value of 36.954% suggests that the average percentage difference between predictions and actual values is moderate. The MSE and RMSE values provide further insight into the spread and magnitude of errors, with RMSE indicating that the model’s predictions have an average error of around 5.602 units. Overall, while the model seems to be making reasonable predictions, further analysis and optimization could potentially improve its performance.

In my next Rpubs publication, I will extend this Time Arrival Estimation project by incorporating more robust machine learning algorithms, including Deep Learning architectures. The document will showcase how these advanced techniques improve prediction accuracy and reliability. By exploring a broader range of methods, the goal is to enhance the project’s predictive power and provide a comprehensive assessment of different models. Stay tuned on my Rpubs for insights into how Deep Learning and other potent algorithms can elevate time arrival estimation.