Philadelphia Bikeshare Rebalancing Prediction

MUSA 508, Lab #5

Author

Nissim Lebovits

Published

November 15, 2023

1 Summary

Rebalancing is a common and important problem facing bike share systems around the globe. Below, we consider the value of a range of linear models incorporating spatiotemporal data to predict future rideshare demand for the Indego bikeshare network in Philadelphia. We find that spatiotemporal features have high predictive power for this use case and enable us to construct a model worth deploying. Although the predictions are still hampered by non-spatially random error, indicating the possibility of further improving the model, we recommend deploying the model as a useful tool in ride share balancing.

2 Introduction

Show the code

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(tidyverse)
library(sf)
library(tmap)
library(sfdep)
library(gganimate)
library(caret)

options(tmap.mode = 'plot', scipen = 999, tigris_use_cache = TRUE)

tmap_options(basemaps = "Esri.WorldGrayCanvas") #set global tmap basemap

crs <- "epsg:2272"

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette7 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c", "#063970", "#04254c")
palette6 <- c("#eff3ff", "#bdd7e7", "#6baed6", "#3182bd", "#08519c", "#063970")
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")



stations_path <- "indego_data/indego-trips-2022-q4.csv"

suppressMessages(
stations <- read_csv(stations_path) %>%
  filter(!is.na(start_lon),
         !is.na(start_lat),
         !is.na(end_lon),
         !is.na(end_lat)) %>%
  st_as_sf(coords = c('start_lon', 'start_lat'), crs = 4326) %>%
  st_transform(crs = crs) %>%
                mutate(date = as.Date(strptime(start_time, "%m/%d/%Y %H:%M")),
                       week = week(date),
                       dotw = wday(date),
                       interval60 = as.POSIXct(start_time, format = "%m/%d/%Y %H"))
)



### PHL bounds and grid----------------------------------------
phl_path <- "https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson"
phl <- st_read(phl_path, quiet = TRUE) %>%
          st_transform(crs = crs)

### GET ACS

phlCensus <- get_acs(geography = "tract", 
                          variables = c("B01003_001", 
                                        "B19013_001", 
                                        "B02001_002", 
                                        "B08013_001",
                                        "B08012_001", 
                                        "B08301_001", 
                                        "B08301_010", 
                                        "B01002_001"), 
                          year = 2021, 
                          state = "PA", 
                          geometry = TRUE, 
                          county="Philadelphia",
                          output = "wide") %>%
                  rename(Total_Pop =  B01003_001E,
                         Med_Inc = B19013_001E,
                         Med_Age = B01002_001E,
                         White_Pop = B02001_002E,
                         Travel_Time = B08013_001E,
                         Num_Commuters = B08012_001E,
                         Means_of_Transport = B08301_001E,
                         Total_Public_Trans = B08301_010E) %>%
                  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
                         Means_of_Transport, Total_Public_Trans,
                         Med_Age,
                         GEOID, geometry) %>%
                  mutate(Percent_White = White_Pop / Total_Pop,
                         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
                         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport) %>%
                  mutate(
                      nb = st_knn(geometry, 5), # need to impute census values for tracts with pop ~ 0
                      wt = st_weights(nb),
                      Med_Age = ifelse(is.na(Med_Age), purrr::map_dbl(find_xj(Med_Age, nb), mean, na.rm = TRUE), Med_Age),
                      Med_Inc = ifelse(is.na(Med_Inc), purrr::map_dbl(find_xj(Med_Inc, nb), mean, na.rm = TRUE), Med_Inc),
                      Travel_Time = ifelse(is.na(Travel_Time), purrr::map_dbl(find_xj(Travel_Time, nb), mean, na.rm = TRUE), Travel_Time),
                      Percent_White = ifelse(is.na(Percent_White), purrr::map_dbl(find_xj(Percent_White, nb), mean, na.rm = TRUE), Percent_White),
                      Mean_Commute_Time = ifelse(is.na(Mean_Commute_Time), purrr::map_dbl(find_xj(Mean_Commute_Time, nb), mean, na.rm = TRUE), Mean_Commute_Time),
                      Percent_Taking_Public_Trans = ifelse(is.na(Percent_Taking_Public_Trans), purrr::map_dbl(find_xj(Percent_Taking_Public_Trans, nb), mean, na.rm = TRUE), Percent_Taking_Public_Trans))

phlTracts <- 
  phlCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf %>%
  st_transform(crs = crs)


dat_census <- st_join(stations, phlTracts, join = st_intersects, left = TRUE) %>%
                      rename(Origin.Tract = GEOID) %>%
                      mutate(start_lon = unlist(map(geometry, 1)),
                             start_lat = unlist(map(geometry, 2))) %>%
                      as.data.frame() %>%
                      select(-geometry) %>%
                      st_as_sf(., coords = c("end_lon", "end_lat"), crs = crs) %>%
                      st_join(., phlTracts %>%
                                st_transform(crs=crs),
                              join=st_intersects,
                              left = TRUE) %>%
                      rename(Destination.Tract = GEOID)  %>%
                      mutate(end_lon = unlist(map(geometry, 1)),
                             end_lat = unlist(map(geometry, 2)))%>%
                      as.data.frame() %>%
                      select(-geometry)


### import weather--------------------------------------------------------------
weather.Panel <- 
  riem_measures(station = "PHL", date_start = "2022-10-01", date_end = "2023-01-05") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(date = as.Date(substr(valid,1,13)),
           week = week(date),
           dotw = wday(date),
           interval60 = ymd_h(substr(valid,1,13)))%>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))


weekend <- c(1, 7)

Bike sharing systems have emerged as a popular, eco-friendly transportation alternative in cities worldwide. These systems provide a network of bicycles and docking stations, allowing users to pick up a bike from one location and drop it off at another, facilitating short, point-to-point trips. However, one significant challenge in these systems is maintaining a balanced distribution of bikes across the network. This is where bike re-balancing becomes crucial.

Due to daily commuting patterns and other socio-economic factors, certain areas in a city may experience a surplus of bikes while others face a shortage. For instance, residential areas might see a high number of bikes in the morning as people commute to work, leaving these stations depleted by evening. Conversely, business districts may have an excess of bikes during the day, which need to be redistributed for the evening commute home. Without effective re-balancing, the utility and efficiency of the bike share system are significantly reduced, leading to customer dissatisfaction and a decline in usage.

Below, we propose and evaluate a rebalancing strategy for Indego, the bike share system in Philadelphia, based on a spatiotemporal predictive model. Launched in 2015, Indego has 1,400 bikes distributed across 130 stations, mostly concentrated in Center City, West Philadelphia, and South Philadelphia.¹ Using Indego data from the 4th quarter of 2022, as well as supplemental weather data, we propose a predictive model optimized for overnight rebalancing with trucks, which aims to minimize expected customer dissatisfaction the next day.² Based on our predictions, an integer program could be used to optimize redistribution routes.³

Show the code

ggplot() +
  geom_sf(data = phlTracts) +
  geom_sf(data = stations %>%
            group_by(start_station) %>%
            summarize(count = n()),
          color = palette7[5],
          alpha = 0.4) +
  labs(title = "Distribution of Indego Stations in Philadelphia",
       subtitle = "December 2022")+
  mapTheme

3 Methods

3.1 Data

3.1.1 Overview

For our analysis, we rely on three datasets: 1) Indego station and ridership data from Q4 of 2022, 2) weather data collected at Philadelphia International Airport during those same dates and accessed via the riem package in R, and 3) American Community Survey data from 2021.

3.1.2 Exploratory Analysis

Examining our bike share data, we notice clear temporal trends: peaks during certain hours of the day and a longer-time decline in trips across the quarter.

Show the code

ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per Hour",
       subtitle = "Philadelphia, Oct. - Dec., 2022",
       x="Date", 
       y="Number of trips")+
  plotTheme

Breaking these trends down, we notice peaks at certain times of day and certain days of the week. As may be expected, morning and afternoon rush hours are associated with the highest ridership volumes during the week. Overall ridership is lower on the weekend, and peaks during the midday instead of morning or afternoon.

Show the code

dat_census %>% mutate(hour = hour(interval60)) %>%
ggplot(aes(x = hour, color = as.factor(dotw)))+
     geom_freqpoly(binwidth = 1)+
  labs(title="Bike share trips in Philadelphia, by day of the week, Oct. - Dec., 2022",
       x="Hour", 
       y="Trip Counts",
       color = "Day of the Week")+
     plotTheme

Show the code

ggplot(dat_census %>% 
         mutate(hour = hour(interval60),
                weekend = ifelse(dotw %in% weekend, "Weekend", "Weekday")))+
     geom_freqpoly(aes(x = hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in Philadelphia, Oct. - Dec., 2022, \nweekend vs. weekday",
       x="Hour", 
       y="Trip Counts",
       color = "Weekend?")+
     plotTheme

Although the average number of trips per hour per station remains around 1.3 no matter the time of day, a higher proportion of stations experience ridership demands exceeding this during the AM and PM rush as compared to overnight or during the middle of the day.

Show the code

dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station, time_of_day) %>%
         tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_density(aes(mean_trips, color = time_of_day, fill = time_of_day), alpha = 0.3)+
  labs(title="Mean Number of Hourly Trips Per Station. Philadelphia, Oct. - Dec., 2022",
       x="Number of trips", 
       y="Density",
       color = "Time of Day",
       fill = "Time of Day")+
  #scale_x_log10()
 facet_wrap(~time_of_day)+
  plotTheme

Additionally, these trends are born out spatially, as we observe that the highest ridership per station clusters in the downtown and University City areas during the afternoon rush hour.

Show the code

ggplot()+
  geom_sf(data = phlTracts %>%
          st_transform(crs=crs))+
  geom_point(data = dat_census %>% 
            mutate(hour = hour(interval60),
                weekend = ifelse(dotw > 5, "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
              tally(),
            aes(x=start_lon, y = start_lat, color = n), 
            fill = "transparent", alpha = 0.7, size = 0.7)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. Philadelphia, Oct. - Dec., 2022")+
  mapTheme

3.1.3 Feature Engineering

To build our model, we incorporate various predictors, including weather and spatial and temporal lag. Our main contribution in terms of feature engineering is to add temporal lags. This is important, as we find that they are strongly correlated with our dependent variable, Trip_Count.

Show the code

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station = unique(dat_census$start_station)) %>%
  left_join(., dat_census %>%
              select(start_station, Origin.Tract, start_lon, start_lat)%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel, by = "interval60") %>%
  ungroup() %>%
  filter(!is.na(start_station)) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60)) %>%
  filter(!is.na(Origin.Tract))

ride.panel <- ride.panel %>%
                st_as_sf(coords = c("start_lon", "start_lat"), crs = crs) %>%
                mutate(start_lon = as.numeric(st_coordinates(.)[,1]),
                       start_lat = as.numeric(st_coordinates(.)[,2])) %>%
                st_join((phlCensus %>% st_transform(crs = crs)))

# holidays are nov 23, dec 23, 24, 25, 30, 31
# how do I return the yday for the dates above?
holidays_22 <- c(327, 148, 357, 358, 359, 364, 365)


ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count, 1),
         lag2Hours = dplyr::lag(Trip_Count, 2),
         lag3Hours = dplyr::lag(Trip_Count, 3),
         lag4Hours = dplyr::lag(Trip_Count, 4),
         lag12Hours = dplyr::lag(Trip_Count, 12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) %in% holidays_22, 1, 0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays",
                                TRUE ~ "Zero")) %>%
  mutate(nb = st_knn(st_jitter(geometry), 3),
         wt = st_weights(nb),
         Trip_Count_Lag = st_lag(Trip_Count, nb, wt))

as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2)) %>%
    kbl(caption = "Correlations between Engineered Features and Dependent Variable") %>%
    kable_paper(bootstrap_options = "striped", full_width = F)

Correlations between Engineered Features and Dependent Variable
Variable	correlation
lagHour	0.90
lag2Hours	0.75
lag3Hours	0.58
lag4Hours	0.42
lag12Hours	-0.31
lag1day	0.83

Finally, we note clear spatial and temporal autocorrelation in our data: this animated map, for example, indicates that data cluster at certain times of the day in certain places.

Show the code

ride.sum <- ride.panel %>%
                mutate(date = ymd(as.Date(interval60))) %>%
                group_by(date, start_station) %>%
                summarize(count = sum(Trip_Count)) %>% 
                  ungroup() %>%
                  mutate(start_lon = as.numeric(st_coordinates(.)[,1]),
                         start_lat = as.numeric(st_coordinates(.)[,2])) %>%
                st_drop_geometry()

p <- ggplot() +
      geom_sf(data = phlTracts) + 
      geom_point(data = st_drop_geometry(ride.sum),
                 aes(x = start_lon,
                     y = start_lat,
                     color = count),
  fill = "transparent", alpha = 0.4)+
  scale_colour_viridis(direction = 1,
  discrete = FALSE, option = "D") +
  transition_states(date,
                    transition_length = 5, 
                    state_length = 1) +
  enter_fade() +
  exit_fade() +
  labs(title = "Station Usage in Philadelphia, Q4 2022",
       subtitle = "Day of year: {closest_state}",
       color = "Trip Count") +
  mapTheme

animate(p)

3.2 Spatiotemporal Prediction

For our analysis, we use a range of OLS regressions. Given the spatial autocorrelation noted above, OLS may not be the ideal approach; a spatial lag or geographically weighted regression, for example, might improve performance and generalizability. Overall, however, for the purposes of this task, OLS is sufficient.

4 Results

Show the code

#### model-------------------------------------------------
ride.Train <- filter(ride.panel, week >= 47 | week == 1)
ride.Test <- filter(ride.panel, week < 47 & week != 1)

head(ride.panel %>% filter(week < 26))

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)

reg6 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + Trip_Count_Lag, 
     data=ride.Train)

reg7 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + 
                   Trip_Count_Lag + Total_Pop + Med_Inc + Percent_Taking_Public_Trans, 
     data=ride.Train)

#### predict-------------------------------------------------
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred),
           FTime_Space_FE_timeLags_holidayLags_tripCountLags = map(.x = data, fit = reg6, .f = model_pred),
           GTime_Space_FE_timeLags_holidayLags_tripCountLags_socioecon = map(.x = data, fit = reg7, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))



folds <- 100

control <- trainControl(method="cv", number=folds)


set.seed(123)

model_cv <- train(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + 
                   Trip_Count_Lag + Total_Pop + Med_Inc + Percent_Taking_Public_Trans, 
                  data=na.omit(ride.panel),
                  method="lm",
                  trControl=control)

cv_df <- data.frame(
  model = "GTime_Space_FE_timeLags_holidayLags_tripCountLags_socioecon",
  folds = folds,
  rmse = model_cv$results$RMSE,
  mae = model_cv$results$MAE
)

Using 100-fold cross validation, we find that the MAE for our best model is 0.52`–not bad. This suggests that, for a given station at a given hour, our model will, on average, correctly predict demand to within one bike.

Show the code

cv_df %>%
    kbl(caption = "Cross-Validation Results") %>%
    kable_paper(bootstrap_options = "striped", full_width = F)

Cross-Validation Results
model	folds	rmse	mae
GTime_Space_FE_timeLags_holidayLags_tripCountLags_socioecon	100	0.866072	0.5245892

5 Discussion

Of the six models we consider, we find that those incorporating both time and space features, as well as the lag of various time features, perform the best. Interesting, temporal features seem more important than spatial features; incorporating time lags into the model yielded significant improvement, while incorporating spatial lag resulted in only a marginal further improvement. The best models are those that incorporate time lag; the inclusion of holiday lags, spatial lag, and socioeconomic variables yield little to no improvement. It’s likely that these features are in some ways already implicit in the pure spatial and temporal features of the dataset; given the highly spatially-autocorrelated distribution of income, for instance, this variable is already baked into the location data.

Show the code

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette7) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme +
  theme(legend.position = "bottom")

One potential challenge posed by our models is that they consistently underpredict rather than overpredict. When trying to meet rideshare demand, this poses a problem; if Indego consistently delivers too many bikes, they will never miss out on ridership, but if they consistently underprepare, they will lose potential riders and even possibly cost themselves long-term subscribers.

Show the code

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station)) %>%
    dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "Philadelphia; A test set of 3 Months",  x = "Hour", y= "Station Trips") +
      plotTheme

We also observe that the errors in our model are not spatially random, indicating that we have not fully accounted for spatial fixed effects in our model. In particular, they cluster in areas of high demand. Consistent underprediction in areas of high demand undermines the viability of our model, as these are precisely the areas in which we would hope to have the most accurate predictions.

Show the code

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "GTime_Space_FE_timeLags_holidayLags_tripCountLags_socioecon")%>%
  mutate(weekend = ifelse(dotw %in% weekend, "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phlTracts %>%
          st_transform(crs=crs), color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size = 1, alpha = 0.4)+
  scale_colour_viridis(direction = 1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme

If we plot observed vs. predicted for different times of day during the week and weekend, some patterns begin to emerge.

Underprediction: Across almost all plots, the predicted values tend to be lower than the observed values, especially as the number of observed trips increases. This is indicated by the data points lying above the 45-degree line (which represents perfect prediction). In other words, our model struggles more with the higher-demand stations.
Time-of-Day Effects: There might be different patterns of underprediction based on the time of day. For example, during the AM Rush and PM Rush, the difference between observed and predicted seems larger than during midday or overnight. This could suggest that the model does not capture the peak travel times as accurately.
Weekday vs. Weekend: There could be a difference in the prediction quality between weekdays and weekends. The weekend plots might show more scatter away from the line than the weekdays, indicating a higher variance in the accuracy of predictions during weekends.
Model Limitations: The scatter and the deviation from the line could indicate that the model is missing some factors that influence the number of trips, especially during specific times like rush hours or weekends when travel patterns could be significantly different from the usual.
Outliers: There are some data points that are far away from the majority of the data, especially in the higher range of observed trips. These outliers suggest that there are extreme cases where the model fails to predict the correct number of trips by a large margin.

In order to improve our model, we could attempt to account for some of these issues with further feature engineering. For example, we might add binary variables to mark peak hours, attempt to factor in special events, or even consider switching to different models. One simple upgrade here could be using a penalized model such as a ridge, lasso, or elastic net to account for what is effectively overfitting. Other approaches, such as a generalized additive model (GAM) or a tree-based model (e.g., a Random Forest) might also help.

Show the code

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
  unnest() %>%
  filter(Regression == "GTime_Space_FE_timeLags_holidayLags_tripCountLags_socioecon")%>%
  mutate(weekend = ifelse(dotw %in% weekend, "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day ~ weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

Lastly, we find that our model does not generalize perfectly; errors are higher in wealthier and whiter neighborhoods, and lower in neighborhoods that are more reliant on public transportation (in reality, these are likely all the same neighborhoods, as these variables are highly colinear). This introduces an interesting optimization problem: do we prefer to optimize for an equitable distribution of error, whereby we seek to avoid burdening low-income, minority neighborhoods with disproportionate rates of error, or do we prefer to minimize error in high-demand areas (which are predominately white and wealthy)?

Show the code

week_predictions %>%
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station),
           start_lat = map(data, pull, start_lat),
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station, start_lon,
           start_lat, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest(cols = c(interval60, start_station, start_lon, start_lat, Observed, Prediction, dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White)) %>%
  filter(Regression == "GTime_Space_FE_timeLags_holidayLags_tripCountLags_socioecon") %>%
  mutate(weekend = ifelse(dotw %in% weekend, "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = chicagoCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

All that said, the predictive power of our model is good. With an MAE around 0.5, we can expect to be fairly close (within roughly 1 bike) to the accurate demand for bikes at a given station at any given hour. The major challenge to address is underprediction at high-demand stations; this is both a question of model improvements and rebalancing priorities. Not only do we want to maximize model accuracy, but we also need to consider at which stations we most care about accuracy. This shades into a cost-benefit analysis.

Further improvement of the model depends on two factors: improved feature engineering and more computing power. Accounting for features like rush hour with binary variables, for example, might help. The best thing, however, would likely be to basically throw as much data into the model as possible. Simply, training on more historical data would allow for a more sophisticated model that could, for example, distinguish between different holidays that might prompt different levels of commuting (Veterans’ Day versus Christmas Eve, for instance), or simply learn from more historic data. Likewise, more sophisticated models (random forest, Lasso regression, etc.) might increase model accuracy. One key component would be switching from an OLS regression to one that explicitly incorporated spatial dependencies, such as spatial lag regression or geographically-weighted regression, in addition to incorporating temporal features. (This, however, is beyond the scope of this assignment and likely would require more computing power than we have access to). In truth, though, the best approach is probably the best machine learning approach in general: throw as much data as possible into the model. Training on the roughly 7 years of available Indego data, rather than the current three months’ worth, would probably do more than anything else to increase model accuracy and generalizability–but would also be prohibitively computational intensive, at least in the context of this assignment.

6 Conclusion

We find that our model is effective in predicting rideshare demand and would be useful in informing rebalancing efforts for Indego. Overall, the model’s accuracy is high enough to be useful–certainly in comparison to the naive approach of responding to demand as it emerges. That said, the model still suffers from non-random spatial distribution of errors. Specifically, it does not generalize well to high-demand areas, which is a substantial issue given that these are exactly the areas where the impact of underprediction is biggest. Thus, while the model is still a solid predictive tool worth implementing, there are important opportunities to improve it by better accounting for spatial fixed effects.

Footnotes

https://en.wikipedia.org/wiki/Indego↩︎
https://www.rideindego.com/about/data/↩︎
https://people.orie.cornell.edu/shane/pubs/BSOvernight.pdf↩︎