Previously, I analyzed a dataset regarding Uber trip data in New York City from 2014. The results from that analysis indicated peak usage was focused on the Manhattan borough of the city and peak usage occurred around the beginning and end of the workday during the workweek. So naturally I jumped at the opportunity to further confirm this pattern when presented with a project using New York City taxi data from 2013 including 50,000 observations.
However, this time I wanted to try my hand at some machine learning techniques, specifically regression trees and random forests.
Here is a link to where you can download the taxi.csv dataset. Note once downloaded you may need to move it to your working directory.
# Loading the tidyverse
library(tidyverse)
# Reading in the taxi data
taxi <- read_csv("taxi.csv")
# Taking a look at the first few rows in taxi
head(taxi)
Parsed with column specification:
cols(
medallion = col_character(),
pickup_datetime = col_datetime(format = ""),
pickup_longitude = col_double(),
pickup_latitude = col_double(),
trip_time_in_secs = col_double(),
fare_amount = col_double(),
tip_amount = col_double()
)
| medallion | pickup_datetime | pickup_longitude | pickup_latitude | trip_time_in_secs | fare_amount | tip_amount |
|---|---|---|---|---|---|---|
| 4D24F4D8EF35878595044A52B098DFD2 | 2013-01-13 10:23:00 | -73.94646 | 40.77273 | 600 | 8.0 | 2.5 |
| A49C37EB966E7B05E69523D1CB7BE303 | 2013-01-13 04:52:00 | -73.99827 | 40.74041 | 840 | 18.0 | 0.0 |
| 1E4B72A8E623888F53A9693C364AC05A | 2013-01-13 10:47:00 | -73.95346 | 40.77586 | 60 | 3.5 | 0.7 |
| F7E4E9439C46B8AD5B16AB9F1B3279D7 | 2013-01-13 11:14:00 | -73.98137 | 40.72473 | 720 | 11.5 | 2.3 |
| A9DC75D59E0EA27E1ED328E8BE8CD828 | 2013-01-13 11:24:00 | -73.96800 | 40.76000 | 240 | 6.5 | 0.0 |
| 19BF1BB516C4E992EA3FBAEDA73D6262 | 2013-01-13 10:51:00 | -73.98502 | 40.76341 | 540 | 8.5 | 1.7 |
While this dataset has some great information including dates, times, and geospatial data it still needs a little cleaning to serve our models well. Longitude and latitude coordinates are simply renamed to something more compact and observations are filtered to include on positive fairs and tips. The big processing here is in the creation of the creation of the variable total. Total is defined as the log() of the sum of fare amounts and tip amounts. Log() is used to simply dial back the effects or any large outliers such as some one taking a taxi from New York City to DC or leaving an unusually large tip. By using a logarithmic transformation we are not removing these outliers, but instead just mitigating the risk of their effects on modeling.
# Renaming the location variables,
# dropping any journeys with zero fares and zero tips,
# and creating the total variable as the log sum of fare and tip
taxi <- taxi %>%
rename(lat = pickup_latitude, long = pickup_longitude)%>%
filter(fare_amount > 0 | tip_amount > 0)%>%
mutate(total = log(fare_amount + tip_amount))
As mentioned previously, analysis was done on a dataset of 4.5 million observations of Uber trips in New York City. It was clear Manhattan saw the bulk of trips. Given this and the fact I wanted to keep runtimes short, we will focus on data originating in Manhattan. This is accomplished by simply filtering to trips which originated in the coordinates defined below which encompass Manhattan.
# Reducing the data to taxi trips starting in Manhattan
# Manhattan is bounded by the rectangle with
# latitude from 40.70 to 40.83 and
# longitude from -74.025 to -73.93
taxi <- taxi %>%
filter(between(lat, 40.70, 40.83) & between(long, -74.025, -73.93))
The ggmap and viridis packages are loaded for this step. Both of which will allow for strong, colorful visualizations. Along with the loading of these packages a file of a map of Manhattan is downloaded as well. The link to the manhattan.rds file can be found here. Using ggmap and applying the beautiful plasma theme from viridis we can see the hotspots for trip origination. Downtown and Midtown see more trip originations than Uptown. The grayish area in the middle which coincides with a decline in intensity is Central Park for reference.
# Loading in ggmap and viridis for nice colors
library(ggmap)
library(viridis)
# Retrieving a stored map object which originally was created by
# manhattan <- get_map("manhattan", zoom = 12, color = "bw")
manhattan <- readRDS("manhattan.rds")
# Drawing a density map with the number of journey start locations
ggmap(manhattan, darken = 0.5) +
scale_fill_viridis(option = 'plasma') +
geom_bin2d(data = taxi, aes(x = long, y = lat), bins = 60, alpha = 0.6)+
labs(x = 'Longitude', y = 'Latitude', fill = 'Journeys')
Our attempt at modeling taxi fares in New York City starts out simple with a single regression tree utilizing just two independent variables, longitude and latitude. The resulting tree may be a little cumbersome to interpret, but its simpler than its seems. This tree is telling us once fares originate beneath the threshold of latitude 40.7237 the total variable has a tendency to be higher. Notice too the tree fails to note any longitudinal coordinates as it deems this information as irrelevant. Simply put, this single regression tree can be translated as saying the largest fares are Downtown and the longitude is of little importance.
# Loading in the tree package
library(tree)
# Fitting a tree to lat and long
fitted_tree <- tree(total~lat+long, taxi)
# Draw a diagram of the tree structure
plot(fitted_tree)
text(fitted_tree)
Our single regression tree with just coordinates as predictors fails to control for other variables which likely effect our model. As seen with analysis of Uber data, space plays a role but so too does time. As such we load up the lubridate package and create some new independent variables to control for time when modeling our total or outcome variable. These new time oriented variable are plugged into our regression tree.
# Loading in the lubridate package
library(lubridate)
# Generate the three new time variables
taxi <- taxi %>%
mutate(hour = hour(pickup_datetime),
wday = wday(pickup_datetime, label = TRUE),
month = month(pickup_datetime, label = TRUE))
# Fitting a tree with total as the outcome and
# lat, long, hour, wday, and month as predictors
fitted_tree <- tree(total~lat+long+hour+wday+month, taxi)
# draw a diagram of the tree structure
plot(fitted_tree)
text(fitted_tree)
# Summarizing the performance of the tree
summary(fitted_tree)
Regression tree:
tree(formula = total ~ lat + long + hour + wday + month, data = taxi)
Variables actually used in tree construction:
[1] "lat"
Number of terminal nodes: 2
Residual mean deviance: 0.3041 = 13910 / 45760
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.61900 -0.37880 -0.04244 0.00000 0.32660 2.69900
It turns out the addition of time variables into our regression tree yielded no difference. Our output is exactly the same. This confirms the latitude where trips originate is quite important, but this doesn’t mean our other variables have no impact. Rather this should be seen as the model telling us latitude is certainly a great first place to split our tree. Furthermore, other variables in this model are not informative enough to include in the tree. However, it seems risky to base decisions based on two runs of our single regression tree. To better mitigate this risk we can run additional tree models. The concept is similar to diversifying a portfolio or not putting all of your eggs in one basket. The feedback and predictions we get from multiple trees will likely bare more fruit than just one.
Here we ran randomForest() with all of our variables on a sample size of 10,000 eighty times. To compare performance between our single tree and our forest we can look at two statistics in our output. The “Residual mean deviance” of fitted_tree is comparable to the “Mean of squared residuals” for fitted_forest. These statistics are measuring residuals or the difference between predictions and outcomes which can be characterized as errors. The fitted_forest does have a slightly lower reading indicating it mild superiority in modeling this particular data.
Essentially, neither fitted_tree or fitted_forest are good predictive models in this scenario. This does not mean the concepts themselves are poor predictive models rather it indicates the models need more data to predict better. In fact, as can be seen from fitted_forest output the model only explains ~3% of the variance in our total variable. Having data on something like user income or propensity to tip would likely yield a more predictive model.
# Loading in the randomForest package
library(randomForest)
# Fitting a random forest
fitted_forest <- randomForest(total~lat+long+hour+wday+month, taxi, ntree = 80, sampsize = 10000)
# Printing the fitted_forest object
fitted_forest
Call:
randomForest(formula = total ~ lat + long + hour + wday + month, data = taxi, ntree = 80, sampsize = 10000)
Type of random forest: regression
Number of trees: 80
No. of variables tried at each split: 1
Mean of squared residuals: 0.2995075
% Var explained: 2.86
Despite the relatively poor predictive ability of our random forest model we can still visualize its predictions across Manhattan. Here the aforementioned ggmap and viridis packages are put to use as in step 4. However, this time stat_summary_2d is used instead of geom_bind2d. This allows for the use of the default function argument mean() to be used on our z variable, pred_total. The map is showing the predicted value of the outcome variable, total, on average. Notice also the mapping is ignoring our time variable as showing the prediction as a function of geography, not time.
# Extracting the prediction from fitted_forest
taxi$pred_total <- fitted_forest$predicted
# Plotting the predicted mean trip prices from according to the random forest
ggmap(manhattan, darken = 0.5) +
scale_fill_viridis(option = 'plasma') +
stat_summary_2d(data = taxi, aes(x = long, y = lat, z = pred_total), bins = 60, alpha = 0.6, fun = mean)+
labs(x = 'Longitude', y = 'Latitude', fill = 'Predicted')
With our fitted_forest predictions mapped we can map our actual data as well for comparison. The predicted mapping indicates the largest total amounts originate farther south on the island in Downtown Manhattan. Intuitively this makes since as the “hottest” area of the predicted mapping coincides with Wall Street and the Financial District where passenger incomes are likely the highest. In comparison, our mapping of actual totals gives some credence to this pattern from fitted_forest. It would be fair to asses the maps as similar but not congruent.
# Function that returns the mean *if* there are 15 or more datapoints
mean_if_enough_data <- function(x) {
ifelse( length(x) >= 15, mean(x), NA)
}
# Plotting the mean trip prices from the data
ggmap(manhattan, darken = 0.5) +
scale_fill_viridis(option = 'plasma') +
stat_summary_2d(data = taxi, aes(x = long, y = lat, z = total), bins = 60, alpha = 0.6, fun = mean_if_enough_data)+
labs(x = 'Longitude', y = 'Latitude', fill = 'Predicted')
While random forest models can be an incredible predictive tool they are only as good as the data which fuels them. In this instance, we were able to get a model which reasonably predicts which area of Manhattan the best fares originate from, Wall Street. But the models ability to accurately predict fares across the area is not strong enough with which to make granular decisions. Having more information on passenger traits could be useful. Or instead of individual information, using geospatial data and mapping coordinates to rents or average compensation of offices at the location could give us better predictive data.