Introduction
This is an exploratory analysis generated with R for Uber pick-up statistics from the six Uber bases in New York City. The data, along with the description, can be found on https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city.
Cohort Analysis
The data used here are aggregated daily Uber trip statistics in January and February 2015 Uber-Jan-Feb-FOIL.csv.
A glimpse of the dataset:
dispatching_base_number | date | active_vehicles | trips |
---|---|---|---|
B02512 | 1/1/2015 | 190 | 1132 |
B02765 | 1/1/2015 | 225 | 1765 |
B02764 | 1/1/2015 | 3427 | 29421 |
B02682 | 1/1/2015 | 945 | 7679 |
B02617 | 1/1/2015 | 1228 | 9537 |
B02598 | 1/1/2015 | 870 | 6903 |
Before creating cohorts for Uber trips, first we can plot the daily statistics per Uber base.
From the plots, it is obvious that scales are very different between these Uber bases. B02764 is the largest base. B02598, B02617 and B02682 have a medium scale while B02512 and B02765 have a much smaller scale. This can be an indication of forming cohorts for Uber trips as Large, Medium and Small.
Because of the large scale difference, it is difficult to compare Uber bases directly with trips and active_vehicles. Instead, we can define a new statistic that reflect the efficiency of vehicles, pick-up ratio: average trips per active vehicle\[ratio = \frac{trips}{active\_vehicles}\] The new measurement might also be something that Uber bases concern about to maximize their revenue.
The plot of pick-up ratio again shows a similar weekly pattern with the previous plots and similar groups of Uber bases. This suggests that we can create cohorts for Uber trips according to both the day of the week and Uber base groups.After creating cohorts, aggregate the statistics (pick-up ratio) and create a table with cohorts:
weekday | Large | Medium | Small |
---|---|---|---|
Monday | 7.923580 | 8.334522 | 7.014960 |
Tuesday | 7.970542 | 8.387070 | 7.050534 |
Wednesday | 7.974457 | 8.506846 | 7.012574 |
Thursday | 8.586777 | 9.077049 | 7.572259 |
Friday | 8.915295 | 9.359747 | 7.793988 |
Saturday | 10.482902 | 10.560899 | 8.738895 |
Sunday | 9.210226 | 9.310403 | 7.775599 |
From the beginning of a week, pick-up ratio starts with the lowest level and then gradually reach the peak on Saturday. ratio has a positive correlation with active_vehicles, as a group of larger Uber bases has higher ratio in general. Large and Medium share very a similar pattern, and Small has a lower ratio than the other two.
Hypothesis: marginal effect of active_vehicles on ratio is the same among Uber base groups
There seems to be a positive correlation between pick-up ratio and active_vehicles.It is interesting to know whether the marginal effect of active_vehicles on ratio is the same among Uber base groups. Assume a linear relation between ratio and active_vehicles, and start the null hypothesis model with a random intercepts multilevel model (since it is clear in the scatter plot that different groups start from different levels).
Null hypothesis: multilevel model with random intercepts and fixed slope \[ratio =\beta_{0i} + \beta_1 active\_vehicles + \epsilon\]
Alternative hypothesis: multilevel model with random intercepts and random slopes \[ratio =\beta_{0i} + \beta_{1i} active\_vehicles + \epsilon\]
Run and compare the two models:
Model | AIC | BIC | deviance | DIC |
---|---|---|---|---|
Null hypothesis | 1153.9 | 1169.4 | 1145.9 | 1133.954 |
Alternative hypothesis | 1118.5 | 1141.8 | 1106.5 | 1093.024 |
In practice, deviance information criterion (DIC) is preferred in model selection for multilevel models. DIC, along with other goodness of fit measures, is lower in the alternative model, so we can reject the null hypothesis and conclude that the marginal effect of active_vehicles on ratio is not the same among Uber base groups.
From the scatter plot with the fitted lines, we can see that smaller Uber bases tends to gain more trips from adding extra vehicle.
Visualization
From the previous analysis, we get rhe result of heterogeneous marginal effect of active_vehicles on ratio, but the causal relationship is unclear without more information. As Uber bases might serve different areas in New York City, visualizing the trips on the map might give us some indications. Since location information for Uber trip in 2015 is coarse-grained, trip data in 2014 with more precise location information are used.
Small
B02512
B02765
B02765 is missing in the dataset.
Medium
B02598
B02617
B02682
Large
B02764
In a previous study, it is found that lower income means fewer pickups. The above heatmaps show Uber base B02764 has a wider coverage of the city while the others concentrate in certain (possibly wealthier) areas. With the assumption that the pick-up pattern is similar between 2014 and 2015, the evidence might explain the difference in marginal effect of active_vehicles.