All bikeshare system data are obtained from the Capital Bikeshare website, as per their license agreement. The full usage data is quite large (~4gb), so please make the appropriate adjustments or time allotments when running the code.
Data come from OpenStreetMap and from CapitalBikeShare (as described above). Upon the request of OSM, here’s the attribution: “© OpenStreetMap contributors”. The data derived from OpenStreetMap should be maintained under the Open Database License.
Thanks to RColorBrewer for providing the color scales.
Locations of each station, as of end of 2020
Departures in 2020: colors are by quintiles.
Net flow appears to be more muted/less extreme in 2020
Net flow 2019
Net flow 2020
Median ride durations are longer from bike stations that are farther away from the city center or metro stations. In 2020, ride durations may be slightly longer on average compared to pre-pandemic years. Also, in 2020, bike stations in the suburbs near metro stations see longer median durations than in 2019, which suggests the possibility of changes in aggregate rider behavior away from using bikeshare to commute via metro during the pandemic.
Median Ride Durations, 2019
Median Ride Durations, 2020
Also, we see a stark difference in the standard deviation distribution in 2020, which suggests that over the course of the year, stations experienced much wider variations in ride durations from their stations compared to pre-pandemic years.
The changes in median duration from each station play out between the 10th and 12th weeks of 2020, where the distribution of ride lengths becomes more spread out quite quickly.
Destination Parity is a measure how even the distribution of rides is at all destinations from a single station. The index tells us if all rides from a bikeshare station are concentrated among just a few destination stations or spread out evenly among all actual destinations. For those familiar with measuring income distribution, the principle and calculation are the same: the Gini Index. In our case, a score of 0 indicates perfect equality among rides going to destination stations, and a score of 1 means that virtually all rides end up at a single station.
Most departures are between 0.5 and 0.75 – meaning that there’s a sizable inequity in destination stations on a yearly basis.
Departure Inequity and Arrival Inequity indicies correlate pretty well, as one might expect.
The Percent of departures that end up in the top 5% of departures stations seem to be farily predictive of the GINI measurement – at least at the station-year level.
But Standard Deviation doesn’t seem to be as good of a measurement as the Gini
We also notice a slight upward trend in 2020 between Destination Gini and median duration, which is opposite the trends of pre-pandemic years.
What does the regression look like?
looks like departure parity does help predict median ride duration, with higher gini coefficients suggesting shorter median durations.
however, the patterns of member usage at the station appear to be a much better predictor
this pattern holds true across all years, but the year 2020 saw significantly lower median ride durations, holding all other factors constant.
##
## Call:
## lm(formula = dur_med ~ dep_ineq + member_pct + metro + as.factor(year),
## data = sum_station_yr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1147.7 -136.7 -27.9 116.9 3404.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2237.14 52.04 42.987 < 2e-16 ***
## dep_ineq -255.94 61.74 -4.146 3.52e-05 ***
## member_pct -1672.84 36.02 -46.447 < 2e-16 ***
## metroTRUE -75.58 14.64 -5.162 2.66e-07 ***
## as.factor(year)2018 69.57 16.65 4.178 3.05e-05 ***
## as.factor(year)2019 161.75 16.54 9.780 < 2e-16 ***
## as.factor(year)2020 -96.95 17.84 -5.434 6.10e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 263.3 on 2205 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.5578, Adjusted R-squared: 0.5566
## F-statistic: 463.5 on 6 and 2205 DF, p-value: < 2.2e-16
In 2020, we see that the percentage of rides from each station going to another station within 250 meters of another metro station appears lower
This is corroborated by a bivariate regression, which shows that the percent of rides in 2020 going to a “near-metro” bikeshare station is about 7% lower (p<0.001), on average compared to 2017, for all departing bikeshare stations.
##
## Call:
## lm(formula = metro_end_pct ~ as.factor(year), data = sum_station_yr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30313 -0.08713 -0.00514 0.06986 0.70552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.298244 0.006450 46.236 <2e-16 ***
## as.factor(year)2018 0.004885 0.008934 0.547 0.585
## as.factor(year)2019 -0.003768 0.008736 -0.431 0.666
## as.factor(year)2020 -0.081100 0.008592 -9.440 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1418 on 2208 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.06303, Adjusted R-squared: 0.06175
## F-statistic: 49.51 on 3 and 2208 DF, p-value: < 2.2e-16
In 2020, we also see a more accentuated relationship between membership and metro-going percentages: the higher proportion of users that check out a bike that are members, the more likely the ride patterns from that station are headed to a another station near a metro.
The regression below shows that, in addition to the year, membership ratios are also mathematically important in predicting going-to-metro ratios: holding the year-effect constant, a 10% increase in a station’s membership ratio suggests, on average, about a 4% increase in the number of departing rides that end up close to a metro station.
##
## Call:
## lm(formula = metro_end_pct ~ member_pct + as.factor(year), data = sum_station_yr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33091 -0.08638 -0.01481 0.06814 0.66909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.115360 0.015420 7.481 1.06e-13 ***
## member_pct 0.240644 0.018567 12.961 < 2e-16 ***
## as.factor(year)2018 -0.004315 0.008643 -0.499 0.61768
## as.factor(year)2019 -0.025094 0.008582 -2.924 0.00349 **
## as.factor(year)2020 -0.027868 0.009246 -3.014 0.00261 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1367 on 2207 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.1293, Adjusted R-squared: 0.1277
## F-statistic: 81.94 on 4 and 2207 DF, p-value: < 2.2e-16
There are notable day-to-day variations in daily ride patterns from stations.
Taking a somewhat random sample of 10 stations, we see considerable day-to-day variation in departures, compared to the modeled or averaged value (smoothed line)
The following indicators demonstrate that, while taking yearly-averaging of station-level indicators may be useful, these yearly averages hide noticeable variations in day-to-day aggregate behaviors.
Ideally, we’d like to visualize the above graph for all 600+ stations. But since we can’t do that easily, we can use a numeric approximate for the amount of ‘zig-zaggy-ness’ in each of the day-to-day lines. For this, I use standard deviation to measure variations in day-to-day figures at each station. A low standard deviation means that the indicator doesn’t change much over the course of the year, while higher standard deviations indicate move volatility in usage patterns.
The key finding here is that all indicators see standard deviations that are reasonably above zero or close to zero — or enough that we can justify a further look into daily aggregate usage patterns.
Daily Median Ride Duration
Net Flow
Member Percentage
Departures
Arrivals
Number of Destination Stations
Number of Arrival Stations
Median Durations from stations appear higher on the weekends throughout the year. In the maps below, we also see that median durations are longer across geography, but most notably in the suburbs.
Do key statistics change across different levels of flow in or out of the station?
In the years prior to the pandemic, there were considerably more than 3 million rides over the course of each year, or over 8,000 rides per day, on average.
However, in 2020, there were only around 2.25 million rides.
In pre-pandemic years, the months of April through October see the highest monthly rides. In 2020, the months of Janurary and Febrauary had monthly ride tallys in line with pre-pandemic years, but April of 2020 saw a drastic decrease in the number of rides compared to pre-pandemic years. Monthly rides stayed below pre-pandemic years, but nevertheless recovered after the spring dip in ridership.
Ridership in 2020 across the days of the week is markedly different from pre-pandemic years. In 2020, the most rides occur on weekends, while in previous years, the days with the highest average daily rides were weekdays. Furthermore, the percent of rides that are taken by members is lower than that of pre-pandemic years, and the difference is most stark on the weekends.
When breaking out by month, we see that the early months of 2020, before the outbreak in the US, the weekly riding patterns were superficially on par with those in rencent years. However, ridership patterns after the onset of the pandemic changed drastically. The biggest differences between 2020 and pre-pandemic years occur in the early months of the pandemic: April and May.
The busiest times of day are the morning and afternoon rush hour periods – and this pattern has largely held so far in 2020.
The percent of users that are members is quite high during rush hour periods, but this trend appears more muddled in 2020.
When accounting for the day of week, we do see a noticeable uptick in the member percentage during the morning rush hour on weekdays even during the pandemic.
2019
2020
Rainy Days
Temperature Days
Max Temperature
Precipitation
We notice a general non-linear pattern: the relationship between max temperature and the number of rides is generally positive and linear, except when the temperature reaches ~30 degrees (Celsius), after which the relationship weakens or even becomes negative.
We also notice that in 2020, there’s a distinctive lower number of rides.
A similar pattern emerges for median duration as with number of rides, except rides in 2020 were longer in aggregate terms.
For all non-panemdic years, there seems to be a weak relationship between max temperature and the equity of duration distributions. However, in 2020, higher temperatures indicate a very unevenly distributed ride length distribution, suggesting that the longer median ride length (ascertained above) is caused by a very uneven distribution of ride lengths.
An indicator relationship seems to be appropriate for precipitation
(either there was rain over nmm or there wasn’t). Linear
doesn’t seem to be appropriate. Lots of variation in the 0-5 mm
range.