Uber Pickups in New York City

Shikun Li

2016-11-28 17:46

Introduction

This is an exploratory analysis generated with R for Uber pick-up statistics from the six Uber bases in New York City. The data, along with the description, can be found on https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city.

Cohort Analysis

The data used here are aggregated daily Uber trip statistics in January and February 2015 Uber-Jan-Feb-FOIL.csv.

A glimpse of the dataset:

dispatching_base_number date active_vehicles trips
B02512 1/1/2015 190 1132
B02765 1/1/2015 225 1765
B02764 1/1/2015 3427 29421
B02682 1/1/2015 945 7679
B02617 1/1/2015 1228 9537
B02598 1/1/2015 870 6903

Before creating cohorts for Uber trips, first we can plot the daily statistics per Uber base.

From the plots, it is obvious that scales are very different between these Uber bases. B02764 is the largest base. B02598, B02617 and B02682 have a medium scale while B02512 and B02765 have a much smaller scale. This can be an indication of forming cohorts for Uber trips as Large, Medium and Small.

Because of the large scale difference, it is difficult to compare Uber bases directly with trips and active_vehicles. Instead, we can define a new statistic that reflect the efficiency of vehicles, pick-up ratio: average trips per active vehicle\[ratio = \frac{trips}{active\_vehicles}\] The new measurement might also be something that Uber bases concern about to maximize their revenue.

The plot of pick-up ratio again shows a similar weekly pattern with the previous plots and similar groups of Uber bases. This suggests that we can create cohorts for Uber trips according to both the day of the week and Uber base groups.

After creating cohorts, aggregate the statistics (pick-up ratio) and create a table with cohorts:

weekday Large Medium Small
Monday 7.923580 8.334522 7.014960
Tuesday 7.970542 8.387070 7.050534
Wednesday 7.974457 8.506846 7.012574
Thursday 8.586777 9.077049 7.572259
Friday 8.915295 9.359747 7.793988
Saturday 10.482902 10.560899 8.738895
Sunday 9.210226 9.310403 7.775599

From the beginning of a week, pick-up ratio starts with the lowest level and then gradually reach the peak on Saturday. ratio has a positive correlation with active_vehicles, as a group of larger Uber bases has higher ratio in general. Large and Medium share very a similar pattern, and Small has a lower ratio than the other two.

Hypothesis: marginal effect of active_vehicles on ratio is the same among Uber base groups

There seems to be a positive correlation between pick-up ratio and active_vehicles.

It is interesting to know whether the marginal effect of active_vehicles on ratio is the same among Uber base groups. Assume a linear relation between ratio and active_vehicles, and start the null hypothesis model with a random intercepts multilevel model (since it is clear in the scatter plot that different groups start from different levels).

Null hypothesis: multilevel model with random intercepts and fixed slope \[ratio =\beta_{0i} + \beta_1 active\_vehicles + \epsilon\]

Alternative hypothesis: multilevel model with random intercepts and random slopes \[ratio =\beta_{0i} + \beta_{1i} active\_vehicles + \epsilon\]

Run and compare the two models:

Model AIC BIC deviance DIC
Null hypothesis 1153.9 1169.4 1145.9 1133.954
Alternative hypothesis 1118.5 1141.8 1106.5 1093.024

In practice, deviance information criterion (DIC) is preferred in model selection for multilevel models. DIC, along with other goodness of fit measures, is lower in the alternative model, so we can reject the null hypothesis and conclude that the marginal effect of active_vehicles on ratio is not the same among Uber base groups.

From the scatter plot with the fitted lines, we can see that smaller Uber bases tends to gain more trips from adding extra vehicle.

Visualization

From the previous analysis, we get rhe result of heterogeneous marginal effect of active_vehicles on ratio, but the causal relationship is unclear without more information. As Uber bases might serve different areas in New York City, visualizing the trips on the map might give us some indications. Since location information for Uber trip in 2015 is coarse-grained, trip data in 2014 with more precise location information are used.

Small

B02512

B02765

B02765 is missing in the dataset.

Medium

B02598

B02617

B02682

Large

B02764

In a previous study, it is found that lower income means fewer pickups. The above heatmaps show Uber base B02764 has a wider coverage of the city while the others concentrate in certain (possibly wealthier) areas. With the assumption that the pick-up pattern is similar between 2014 and 2015, the evidence might explain the difference in marginal effect of active_vehicles.