Optimize Trip Duration

Library

library(tidyverse)
library(ggplot2)
library(scales) 
library(corrplot) 
library(data.table) 
library(tibble)
library(lubridate) 
library(leaflet)

Background

Urban transportation every year undergoes a rapid and significant evolution. Since the advent of the Internet and smartphones, we have become increasingly connected and able to plan and optimize travel. Along with that, large amounts of data are collected and used to improve the efficiency of existing transportation systems. Transportation companies like Uber use this data to revolutionize the taxi industry, lay the groundwork for controlled transportation structures, and build innovative systems like car-pooling.

In this project, we addressed the issue of optimizing travel times for taxi vehicles. It can be framed as Traveling Salesman Problem (TSP), a well-known computer science problem. The goal is to find the shortest route that visits a set of locations. For this problem, optimization techniques are required to find solutions that are not only intelligent, but find almost optimal solutions. Before optimizing, we first use machine learning to estimate the travel time between each pair of pick up and drop off locations. Then we use evolutionary algorithms, i.e. ant colonies and genetics, to find the best itinerary for vehicles in a data set.

The data that I worked with is the NYC Taxi and Limousine Commission Trip Record data from kaggle.

train <- read_csv("data_input/train.csv")
train

This dataset has 11 variable:

  • id - a unique identifier for each trip
  • vendor_id - a code indicating the provider associated with the trip record
  • pickup_datetime - date and time when the meter was engaged
  • dropoff_datetime - date and time when the meter was disengaged
  • passenger_count - the number of passengers in the vehicle (driver entered value)
  • pickup_longitude - the longitude where the meter was engaged
  • pickup_latitude - the latitude where the meter was engaged
  • dropoff_longitude - the longitude where the meter was disengaged
  • dropoff_latitude - the latitude where the meter was disengaged
  • store_and_fwd_flag - This flag indicates whether the trip record was held in vehicle memory before sending to the vendor because * * the vehicle did not have a connection to the server - Y=store and forward; N=not a store and forward trip
  • trip_duration - duration of the trip in seconds

Preprocessing Data

Missing Values

train %>% 
  is.na() %>% 
  colSums()
##                 id          vendor_id    pickup_datetime   dropoff_datetime 
##                  0                  0                  0                  0 
##    passenger_count   pickup_longitude    pickup_latitude  dropoff_longitude 
##                  0                  0                  0                  0 
##   dropoff_latitude store_and_fwd_flag      trip_duration 
##                  0                  0                  0

There is no missing value. After that we need to adjust data type

glimpse(train)
## Rows: 1,458,644
## Columns: 11
## $ id                 <chr> "id2875421", "id2377394", "id3858529", "id350467...
## $ vendor_id          <dbl> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, ...
## $ pickup_datetime    <dttm> 2016-03-14 17:24:55, 2016-06-12 00:43:35, 2016-...
## $ dropoff_datetime   <dttm> 2016-03-14 17:32:30, 2016-06-12 00:54:38, 2016-...
## $ passenger_count    <dbl> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1, 1, ...
## $ pickup_longitude   <dbl> -73.98215, -73.98042, -73.97903, -74.01004, -73....
## $ pickup_latitude    <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40.79321...
## $ dropoff_longitude  <dbl> -73.96463, -73.99948, -74.00533, -74.01227, -73....
## $ dropoff_latitude   <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40.78252...
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration      <dbl> 455, 663, 2124, 429, 435, 443, 341, 1551, 255, 1...
train <- train %>%
  mutate(pickup_datetime = ymd_hms(pickup_datetime),
         dropoff_datetime = ymd_hms(dropoff_datetime),
         vendor_id = factor(vendor_id),
         passenger_count = factor(passenger_count))

EDA

Visualisations of feature distributions and their relations are key to understanding a data set, and they often open up new lines of inquiry. I always recommend to examine the data from as many different perspectives as possible to notice even subtle trends and correlations.

In this section we will begin by having a look at the distributions of the individual data features.

We start with a map of NYC and overlay a managable number of pickup coordinates to get a general overview of the locations and distances in question. For this visualisation we use the leaflet package, which includes a variety of cool tools for interactive maps. In this map you can zoom and pan through the pickup locations:

set.seed(1234)
foo <- sample_n(train, 8e3)

leaflet(data = foo) %>% addProviderTiles("Esri.NatGeoWorldMap") %>%
  addCircleMarkers(~ pickup_longitude, ~pickup_latitude, radius = 1,
                   color = "blue", fillOpacity = 0.3)

The map gives us an idea what some of the our distributions could look like. Let’s start with plotting the target feature trip_duration:

train %>%
  ggplot(aes(trip_duration)) +
  geom_histogram(fill = "red", bins = 150) +
  scale_x_log10() +
  scale_y_sqrt()

Note the logarithmic x-axis and square-root y-axis.

We find:

the majority of rides follow a rather smooth distribution that looks almost log-normal with a peak just short of 1000 seconds, i.e. about 27 minutes.

There are several suspiciously short rides with less than 10 seconds duration.

Additionally, there is a strange delta-shaped peak of trip_duration just before the 1e5 seconds mark and even a few way above it:

train %>%
  arrange(desc(trip_duration)) %>%
  select(trip_duration, pickup_datetime, dropoff_datetime, everything()) %>%
  head(10)

Those records would correspond to 24-hour trips and beyond, with a maximum of almost 12 days. I know that rush hour can be bad, but those values are a little unbelievable.

Over the year, the distributions of pickup_datetime and dropoff_datetime look like this:

library(ggplot2)
p1 <- train %>%
  ggplot(aes(pickup_datetime)) +
  geom_histogram(fill = "red", bins = 120) +
  labs(x = "Pickup dates")
p1

p2 <- train %>%
  ggplot(aes(dropoff_datetime)) +
  geom_histogram(fill = "blue", bins = 120) +
  labs(x = "Dropoff dates")
p2

Pickup date/time vs trip_duration How does the variation in trip numbers throughout the day and the week affect the average trip duration? Do quieter days and hours lead to faster trips? Here we include the vendor_id as an additional feature. Furthermore, for the hours of the day we add a smoothing layer to indicate the extent of the variation and its uncertainties:

dotw <- train %>%
  mutate(wday = wday(pickup_datetime, label = TRUE, week_start = 1)) %>%
  group_by(wday, vendor_id) %>%
  summarise(median_duration = median(trip_duration)/60) %>%
  ggplot(aes(wday, median_duration, color = vendor_id)) +
  geom_point(size = 4) +
  labs(x = "Day of the week", y = "Median trip duration [min]")
dotw

hotd <- train %>%
  mutate(hpick = hour(pickup_datetime)) %>%
  group_by(hpick, vendor_id) %>%
  summarise(median_duration = median(trip_duration)/60) %>%
  ggplot(aes(hpick, median_duration, color = vendor_id)) +
  geom_smooth(method = "loess", span = 1/2) +
  geom_point(size = 4) +
  labs(x = "Hour of the day", y = "Median trip duration [min]") +
  theme(legend.position = "none")
hotd