49999 New York taxi trips

In this project, we will analyze a random sample of 49999 New York journeys made in 2013. We will also use regression trees and random forests to build a model that can predict the locations and times when the biggest fares can be earned.

library(ggplot2)
library(dplyr)
library(tidyr)
library(ggplot2)
library(viridis)
library(tree)
library(lubridate)
library(randomForest)
library(ggmap)

taxi <- read.csv(file = "datasets/taxi.csv")
head(taxi)

##                          medallion      pickup_datetime pickup_longitude
## 1 4D24F4D8EF35878595044A52B098DFD2 2013-01-13T10:23:00Z        -73.94646
## 2 A49C37EB966E7B05E69523D1CB7BE303 2013-01-13T04:52:00Z        -73.99827
## 3 1E4B72A8E623888F53A9693C364AC05A 2013-01-13T10:47:00Z        -73.95346
## 4 F7E4E9439C46B8AD5B16AB9F1B3279D7 2013-01-13T11:14:00Z        -73.98137
## 5 A9DC75D59E0EA27E1ED328E8BE8CD828 2013-01-13T11:24:00Z        -73.96800
## 6 19BF1BB516C4E992EA3FBAEDA73D6262 2013-01-13T10:51:00Z        -73.98502
##   pickup_latitude trip_time_in_secs fare_amount tip_amount
## 1        40.77273               600         8.0        2.5
## 2        40.74041               840        18.0        0.0
## 3        40.77586                60         3.5        0.7
## 4        40.72473               720        11.5        2.3
## 5        40.76000               240         6.5        0.0
## 6        40.76341               540         8.5        1.7

# Renaming the location variables
# dropping any journeys with zero fares and zero tips
# and creating the total variable as the log sum of fare and tip
taxi <- taxi %>%
   rename(long = pickup_longitude, lat = pickup_latitude)  %>% 
   filter(fare_amount > 0 | tip_amount > 0) %>%
   mutate(total = log(fare_amount + tip_amount) )

Manhattan

While the dataset contains taxi trips from all over New York City, the bulk of the trips are to and from Manhattan

# Reducing the data to taxi trips starting in Manhattan
# Manhattan is bounded by the rectangle with 
# latitude from 40.70 to 40.83 and 
# longitude from -74.025 to -73.93
taxi <- taxi  %>% 
    filter(between(lat,  40.70, 40.83) & 
           between(long, -74.025, -73.93))

Where does the journey begin?

I’ll use ggmap package together with ggplot2 to visualize where in Manhattan people tend to start their taxi journeys.

# Retrieving a stored map object which originally was created by
# nycmap <- get_map("manhattan", zoom = 12, color = "bw")
manhattan <- readRDS("datasets/manhattan.rds")

# Drawing a density map with the number of journey start locations
ggmap(manhattan, darken=0.5)+
    scale_fill_viridis(option='plasma') +
    geom_bin2d(data = taxi, aes(x = long, y = lat), bins = 60, alpha = 0.6) +
    labs(x = 'Longitude', y = 'Latitude', fill = 'Journeys')

Predicting taxi fares using a tree

The map from the previous task showed that the journeys are highly concentrated in the business and tourist areas

We’re now going to use a regression tree to predict the total fare with lat and long being the predictors. The tree algorithm will try to find cutpoints in those predictors that results in the decision tree with the best predictive capability.

# Fitting a tree to lat and long
fitted_tree <- tree(total ~ lat + long, data = taxi)

# draw a diagram of the tree structure
plot(fitted_tree)
text(fitted_tree)

More predictors

The tree above looks a bit frugal, it only includes one split: It predicts that trips where lat < 40.7237 are more expensive, which makes sense as it is downtown Manhattan. But that’s it. It didn’t even include long as tree deemed that it didn’t improve the predictions

# Generate the three new time variables
taxi <- taxi %>% 
    mutate(hour = hour(pickup_datetime),
           wday = wday(pickup_datetime, label = TRUE),
           month = month(pickup_datetime, label = TRUE))

## Registered S3 method overwritten by 'cli':
##   method     from
##   print.tree tree

One more tree

I’ll fit a new regression tree where we include the new time variables.

fitted_tree <- tree(total ~ lat + long + hour + wday + month, data = taxi)

# draw a diagram of the tree structure
plot(fitted_tree)
text(fitted_tree)

# Summarizing the performance of the tree
summary(fitted_tree)

## 
## Regression tree:
## tree(formula = total ~ lat + long + hour + wday + month, data = taxi)
## Variables actually used in tree construction:
## [1] "lat"
## Number of terminal nodes:  2 
## Residual mean deviance:  0.3041 = 13910 / 45760 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.61900 -0.37880 -0.04244  0.00000  0.32660  2.69900

One tree is not enough

The regression tree has not changed after including the three time variables. This is likely because latitude is still the most promising first variable to split the data on, and after that split, the other variables are not informative enough to be included. A random forest model, where many different trees are fitted to subsets of the data, may well include the other variables in some of the trees that make it up.

# Fitting a random forest
fitted_forest <- randomForest(total ~ lat + long + hour + wday + month,
    data=taxi, ntree=80, sampsize=10000)

# Printing the fitted_forest object
fitted_forest

## 
## Call:
##  randomForest(formula = total ~ lat + long + hour + wday + month,      data = taxi, ntree = 80, sampsize = 10000) 
##                Type of random forest: regression
##                      Number of trees: 80
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.3029521
##                     % Var explained: 1.75

Plotting the predicted fare

Now, let’s take a look at the predictions of fitted_forest projected back onto Manhattan.

# Extracting the prediction from forest_fit
taxi$pred_total <- fitted_forest$predicted

# Plotting the predicted mean trip prices from according to the random forest
ggmap(manhattan, darken=0.5) +
    scale_fill_viridis(option = 'plasma') +
    stat_summary_2d(data=taxi, aes(x = long, y = lat, z = pred_total),
                    fun = mean, alpha = 0.6, bins = 60) +
    labs(x = 'Longitude', y = 'Latitude', fill = 'Log fare+tip')

Plotting the actual fare

Looking at the map with the predicted fares we see that fares in downtown Manhattan are predicted to be high, while midtown is lower. Let’s compare the map with the predicted fares with a new map showing the mean fares according to the data.

# Function that returns the mean *if* there are 15 or more datapoints
mean_if_enough_data <- function(x) { 
    ifelse( length(x) >= 15, mean(x), NA) 
}

# Plotting the mean trip prices from the data
ggmap(manhattan, darken=0.5) +
    stat_summary_2d(data=taxi, aes(x = long, y = lat, z = total),
                    fun = mean_if_enough_data,
                    alpha = 0.6, bins = 60) +
  scale_fill_viridis(option = 'plasma') +
  labs(x = 'Longitude', y = 'Latitude', fill = 'Log fare+tip')

Conclusion

By looking at the map, we can say that a taxi driver would earn more money if he picked up passengers downtown

Predicting Taxi Fares with Random Forests