Introduction

Bike share is a growing form of green transportation across the United States and the world. Cities are implementing bike-share programs, which help increase mobility in cities and mitigate the need for a car. This particular study focuses on Philadelphia’s bike share, run by the company Indigo. For Philadelphia to maximize bike share usage across the city, they must account for rebalancing. For example, many cities may want to move bikes to highly dense populations in the central district before rush hour to meet the bike share demands. Additionally, rebalancing may occur by offering rewards or incentives to those who bike from place to place. A bike-share prediction model allows the rideshare to understand where and when to place bikes to meet the highest demand. From the perspective of Indego, Indego must focus on both the spatial scale and the temporal scale. If Indego were to look at the temporal scale of 1 year, they would notice a rise in bike share usage in the summer but a decline of bike share usage at the weekend. However, these trends do not tell the entire story. Over a single day, bike share usage is not consistent throughout the day. For example, bike share usage is typically higher during rush hour and lower during the middle of the day. These trends are clearly reflected in the data. Throughout a week, weekdays will typically have more significant bike share usage in comparison to the weekend. Events such as holidays may also play a role. Spatial factors are also significantly important. Bike share is likely to be used more in areas where bike lane infrastructure is. Bike share will likely have higher demand in population and regions with more significant commercial industry. Finally, the weather plays a significant role in bike share usage. When there is precipitation or colder temperatures, bike share usage significantly decreased. These temporal, spatial, and weather-related attributes are essential regarding how Indigo should rebalance their resources. Understanding these spatial and temporal trends will maximize the cities bike share usage. Consequently, Indigo will understand how to allocate their resources and where to advertise their services.

Set Up

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(gifski) 
library(kableExtra)
library(RSocrata)
library(rnaturalearth)
library(dplyr)
library(RColorBrewer)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(ggcorrplot)
library(jtools)     # for regression model plots
library(broom)
library(rmarkdown)
library(kableExtra)
library(tidycensus)
library(rgeos)
library(readxl)
library(rgdal)
library(sp)
library(raster)
library(cowplot)
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(gganimate)
library(knitr)





plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())
mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

The Data

stations <- st_read("https://kiosks.bicycletransit.workers.dev/phl")%>% st_transform("EPSG:6565")

trips <- st_read("C:/Users/Kyle McCarthy/Documents/MUSA_PubPolicy/indigo.csv")%>%
  rename(id = start_station) 

trips$id <- as.integer(trips$id)

stations <- right_join(stations, trips)

full_boundary <- st_read("C:/Users/Kyle McCarthy/Documents/MUSA_PubPolicy/hw5/Neighborhoods_Philadelphia (1).geojson") %>%
  st_transform("ESRI:102729") 

phil_boundary1 <- st_read("C:/Users/Kyle McCarthy/Documents/MUSA_PubPolicy/hw5/Neighborhoods_Philadelphia (1).geojson") %>%
  st_transform("ESRI:102729") 
  
phil_boundary1 <- filter(phil_boundary1, listname == "Cedar Park" | listname == "University City" | listname =="Rittenhouse" |listname == "Society Hill" |listname == "Fairmount" |listname == "Old City" | listname =="Center City East" | listname =="Fitler Square" |listname == "Powelton" |listname == "Mantua" |listname == "Spruce Hill" |listname == "Graduate Hospital" | listname == "Logan Square" | listname =="Society Hill" |listname == "Washington Square West" | listname =="Grays Ferry" | listname =="Point Breeze" |listname == "Newbold" | listname == "Queen Village" |listname == "Hawthorne" |listname == "Callowhill"|listname == "Spring Garden"|listname == "Chinatown"|listname =="Dickinson Narrows" |listname =="East Passyunk" |listname =="Greenwich" |listname =="Pennsport"|listname =="Passyunk Square" |listname =="Point Breeze" |listname =="Woodland Terrace"  |listname =="West Powelton" |listname =="Francisville" |listname =="West Poplar" |listname =="East Poplar" |listname =="Northern Liberties" |listname =="Garden Court" |listname =="Walnut Hill"|listname =="Bella Vista" | listname =="West Passyunk")




phil_boundary <-
  phil_boundary1 %>%
  summarise(shape_area= sum(shape_area)) %>%
  st_transform("ESRI:102729")


zip <- st_read( "http://data.phl.opendata.arcgis.com/datasets/b54ec5210cee41c3a884c9086f7af1be_0.geojson")%>%
  filter(CODE == "19104") %>%
  st_transform("ESRI:102729")

Study Region

A subset of Philadelphia neighborhoods were taken to analyze where regions which rideshare is most popular.

ggplot() + 
  geom_sf(data = full_boundary, aes(colour = "#FFFFFF"))+
  geom_sf(data = phil_boundary, aes(colour = "black", fill = "shape_area"))+
  labs(title = "Study Region within Philadelphia")+ 
  theme_classic() + theme(legend.position = "none")

trip2 <- stations %>%
  mutate(interval60 = floor_date(mdy_hm(start_time), unit = "hour"),
         interval15 = floor_date(mdy_hm(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))
#glimpse(stations)

trip2 <-
  trip2  %>% st_transform("ESRI:102729")

Importing Weather Data

weather.Panel <- 
  riem_measures(station = "PHL", date_start = "2019-10-01", date_end = "2020-01-01") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

Exploratory Analysis

Bike share trips per Hr Philadlphia

Note the effect of weather on ride share in Philadelphia. As the temperatures lower throughout the winter, ride share usage drastically decreases.

ggplot(trip2 %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. Philadelphia",
       x="Date", 
       y="Number of trips")+
  theme_classic()

Fishnet

A fishnet was created to effectively create the bike station as the unit of measurement.A map showing the count of trips within each fishnet is shown below. The cell size in 900 square meters in a hexagonal shape.

fishnet <- 
  st_make_grid(phil_boundary, cellsize = 900, square = FALSE) %>%
  .[phil_boundary] %>%
  st_sf()%>%
  mutate(uniqueID = rownames(.))


#bikenet
bikenet <- 
  dplyr::select(trip2) %>% 
  mutate(Trip_Count = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(Trip_Count = replace_na(Trip_Count, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = bikenet, aes(fill = Trip_Count)) +
  scale_fill_viridis() +
  labs(title = "Count of Trips October-December 2019")+ 
  theme_map()

Ride Panel

bikenet2 <-
  st_intersection(trip2, bikenet)

ride.panel <- 
  bikenet2 %>%
    mutate(Trip_Counter = 1) %>%
      group_by(interval60, uniqueID) %>%
      summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
      left_join(weather.Panel, by = "interval60") %>%
          mutate(week = week(interval60),
                dotw = wday(interval60, label = TRUE)) %>%
          st_sf()

Feature Engineering Training and Testing Sets

Training set was designated for weeks 41-43 in October, while the testing set was designated for weeks 44-45 Features engineered include proximity to bike lanes, parks and septa lines. Features were also engineered to maximize weekday and rush hour times.

# ride.panel <- 
#   ride.panel%>%
#   mutate(distbike = round(st_distance(ride.panel, bikepath))*100, 2)

septa <- st_read("https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson")%>%
  st_transform("ESRI:102729")


tree <- st_read("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+ppr_tree_canopy_points_2015&filename=ppr_tree_canopy_points_2015&format=geojson&skipfields=cartodb_id")%>%
   st_transform("ESRI:102729")

  
bike_lanes <- st_read("C:/Users/Kyle McCarthy/Documents/MUSA_PubPolicy/bikelane/bikepathspts_clip.shp")

st_c <- st_coordinates

ride.panel <- 
  ride.panel %>% 
    arrange(uniqueID, interval60) %>% 
    group_by(uniqueID) %>% 
    mutate(lagHour = dplyr::lag(Trip_Count,1),
           lag2Hours = dplyr::lag(Trip_Count,2),
           lag3Hours = dplyr::lag(Trip_Count,3),
           lag4Hours = dplyr::lag(Trip_Count,4),
           lag12Hours = dplyr::lag(Trip_Count,12),
           lag1day = dplyr::lag(Trip_Count,24)) %>% 
   ungroup()%>%
  na.omit()

ride.panel <-
  ride.panel %>% 
  mutate(
    bike_nn1 = nn_function(st_c(st_centroid(ride.panel)), st_c(bike_lanes), 1),
    bike_nn2 = nn_function(st_c(st_centroid(ride.panel)), st_c(bike_lanes), 2),
    bike_nn3 = nn_function(st_c(st_centroid(ride.panel)), st_c(bike_lanes), 3), 
    bike_nn4 = nn_function(st_c(st_centroid(ride.panel)), st_c(bike_lanes), 4),
    bike_nn5 = nn_function(st_c(st_centroid(ride.panel)), st_c(bike_lanes), 5))

ride.panel <-
  ride.panel %>% 
  mutate(
    septa_nn1 = nn_function(st_c(st_centroid(ride.panel)), st_c(septa), 1),
    septa_nn2 = nn_function(st_c(st_centroid(ride.panel)), st_c(septa), 2),
    septa_nn3 = nn_function(st_c(st_centroid(ride.panel)), st_c(septa), 3), 
    septa_nn4 = nn_function(st_c(st_centroid(ride.panel)), st_c(septa), 4),
    septa_nn5 = nn_function(st_c(st_centroid(ride.panel)), st_c(septa), 5))

ride.panel <-
  ride.panel %>% 
  mutate(
    tree_nn20 = nn_function(st_c(st_centroid(ride.panel)), st_c(tree), 20),
    tree_nn40 = nn_function(st_c(st_centroid(ride.panel)), st_c(tree), 40),
    tree_nn60 = nn_function(st_c(st_centroid(ride.panel)), st_c(tree), 60), 
    tree_nn80 = nn_function(st_c(st_centroid(ride.panel)), st_c(tree), 80),
    tree_nn100 = nn_function(st_c(st_centroid(ride.panel)), st_c(tree), 100))

ride.Train <- filter(ride.panel, week == 41 | week == 42 | week == 43)
ride.Test <- filter(ride.panel, week == 44 | week == 45)


st_drop_geometry(rbind(
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing"))) %>%
    group_by(Legend, interval60) %>% 
      summarize(Trip_Count = sum(Trip_Count)) %>%
      ungroup() %>% 
      ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
        scale_colour_manual(values = palette2) +
        labs(title="Rideshare trips by week: November-December",
             x="Day", y="Trip Count")

Exploratory Analysis Depicting the Correlation of Model Variables

ride.panel.na <- 
  ride.panel%>%
  na.omit()

correlation.long <-
  st_drop_geometry(ride.panel.na) %>%
    dplyr::select(-dotw, - interval60, -uniqueID, - septa_nn3, - septa_nn4, -septa_nn5, -tree_nn20, -tree_nn40, -tree_nn80, -tree_nn100, -bike_nn1, -bike_nn2, -bike_nn4, -bike_nn5) %>%
    gather(Variable, Value, -Trip_Count)

correlation.long$Value <- as.double(correlation.long$Value )


correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, Trip_Count, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, Trip_Count)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "red") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Trip Count as a Function of Indicators") +
  theme()

## Moran’s I anaysis.

The Moran I analysis was performed. To perform this analysis, a random subset of the data was taken for processing time (my computer was unable to load the full result). The output creats a trip_count map, a local moran I count, significant hotspots and p-value. Note that spatial autocorrelation exsists within the data, indicated by difference in the polygon’s center and at the edges.

bikenet <- 
  dplyr::select(trip2) %>% 
  mutate(Trip_Count = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(Trip_Count = replace_na(Trip_Count, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))

ride.panel <- ride.panel %>%
  st_transform("ESRI:102729")%>%
  filter(week == 41 | week == 42 | week == 43 | week == 44 | week == 45)

bikenet <- 
  st_join(bikenet, ride.panel)%>%
  filter(week == 41 | week == 42 | week == 43 | week == 44 | week == 45)
  

bikenetsample <- sample_n(bikenet, 10000)%>%
  na.omit()

bikenetsample.nb <- poly2nb(as_Spatial(bikenetsample), queen=TRUE)
bikenetsample.weights <- nb2listw(bikenetsample.nb, style="W", zero.policy=TRUE)


bikenetsample.localMorans <- 
  cbind(
    as.data.frame(localmoran(bikenetsample$Trip_Count.x, bikenetsample.weights)),
    as.data.frame(bikenetsample)) %>% 
    st_sf() %>%
      dplyr::select(Trip_Count = Trip_Count.x, 
                    Local_Morans_I = Ii, 
                    P_Value = `Pr(z != E(Ii))`) %>%
      mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
      gather(Variable, Value, -geometry)
  
vars <- unique(bikenetsample.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
    geom_sf(data = phil_boundary)+
      geom_sf(data = filter(bikenetsample.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      theme_map() + theme(legend.position="right")}

do.call(grid.arrange,c(varList, ncol = 2, top = "Local Morans I statistics, Indego Counts from a Random sample "))

Animation

The animation displayed below depicts bike share usage on Octorber 8th, 2019. There are distinct rises in bike share uses throughout the day as depicted by the red points on the map. Higher ride share usages correspond with rush hour times. Note that this date was a Tuesday during the week.

week45 <-
  filter(bikenet2, week == 41 | week == 42 | week == 43 | week == 44 | week == 45)
week45 <- sample_n(week45, 5000)

week45.panel <-
  expand.grid(
    interval15 = unique(week45$interval15),
    uniqueID = unique(bikenet2$uniqueID))%>%
  na.omit()

week45.panel <- sample_n(week45.panel, 5000)


ride.animation.data <-
  mutate(week45, Trip_Counter = 1) %>%
    group_by(interval15, uniqueID) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    ungroup() %>% 
    st_join(bikenetsample, by=c("uniqueID" = "uniqueID.x")) %>%
    st_sf() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
                             Trip_Count > 10 ~ "11+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
                                       "7-10 trips","10+ trips"))


ride.animation.data <- 
  ride.animation.data%>%
  mutate(day = day(interval15))

ride.animation.data <- 
  ride.animation.data%>%
  filter(week == 41 & day == 8)

ride.animation.data <- sample_n(ride.animation.data, 1000)
  
  
 
rideshare_animation <-
   ggplot() +
  geom_sf(data = phil_boundary)+ 
  geom_sf(data = fishnet, fill = "#070000")+
     geom_sf(data = ride.animation.data, aes(fill = Trips, colour = "#F7FA15")) +
     scale_fill_manual(values = palette5) +
     labs(title = "Bikeshare Animated Map October 8th Throught the Day",
          subtitle = "15 minute intervals: {current_frame}") +
     transition_manual(interval15) +
     theme_map()+ theme(legend.position = "none")
 
 animate(rideshare_animation, duration=20, renderer = gifski_renderer())

 anim_save("rideshare_local", rideshare_animation, duration=20, renderer = gifski_renderer())

Rain Vs. Snow

Precipitation decreases bike share usage as depicted in the bar chart.

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Precipitation = first(Precipitation)) %>%
  mutate(isPrecip = ifelse(Precipitation > 0,"Rain/Snow", "No Percipitation")) %>%
  group_by(isPrecip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
    ggplot(aes(isPrecip, Mean_Trip_Count, fill = isPrecip)) + geom_bar(stat = "identity") +
      labs(title="Does Ridership Vary with Precipitation?",
           x="Precipitation", y="Mean Trip Count") + theme_cowplot()+ theme(legend.position = "NONE")

Temparature vs. Trip Count

Typically, as temperature increases, bike share increases. However, the amount that temperature plays a role seems to vary week from week.

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE) +
    facet_wrap(~week, ncol=7) + 
    labs(title="Trip Count as a fuction of Temperature by week",
         x="Temperature", y="Mean Trip Count") + theme()+ theme_cowplot()

Modelling

Regressions

reg1 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- lm(Trip_Count ~  uniqueID + dotw + Temperature,  data=ride.Train)

reg3 <- lm(Trip_Count ~  uniqueID + hour(interval60) + dotw + Temperature , 
                         data=ride.Train)

reg4 <- lm(Trip_Count ~  uniqueID +  hour(interval60) + dotw + Temperature +
                         lagHour + lag2Hours + lag3Hours + lag12Hours + lag1day, 
                         data=ride.Train)

reg5 <- lm(Trip_Count ~  uniqueID +  hour(interval60) +  dotw + lagHour + lag2Hours + lag3Hours + lag12Hours + lag1day + septa_nn2 + Precipitation  + Wind_Speed + Temperature + tree_nn20 + bike_nn5,  data=ride.Train)

Purr Test

The purr family functions are utilized to loop through a set of nested data frames. to efficiently compare accross different spatial models. The five models in the regression above are compared using the purr functions. The nested loop loops through each model for each week and mutates the summary statistics. The output is a list of predictions for each model by week. These results are then gathered, generating four new collumns. To calculate MAE, map_dbl, a variant of map, is used to loop through Absolute_Error, calculating the mean. The same function calculates the standard deviation of absolute error, sd_AE, which is a useful measure of generalizability (Steif, 2020). The final result is depicted in the form of a histogram, comparing the week to Mean average error for weeks 44 and 45.

ride.panel <- 
  ride.panel %>% 
    arrange(uniqueID, interval60) %>% 
    group_by(uniqueID) %>% 
    mutate(lagHour = dplyr::lag(Trip_Count,1),
           lag2Hours = dplyr::lag(Trip_Count,2),
           lag3Hours = dplyr::lag(Trip_Count,3),
           lag4Hours = dplyr::lag(Trip_Count,4),
           lag12Hours = dplyr::lag(Trip_Count,12),
           lag1day = dplyr::lag(Trip_Count,24)) %>% 
   ungroup()%>%
  na.omit()

ride.Train <- filter(ride.panel, week == 41 | week == 42 | week == 43)
ride.Test <- filter(ride.panel, week == 44 | week == 45)


ride.Test.weekNest <- 
  as.data.frame(ride.Test) %>%
  nest(-week) 

ride.Test.weekNest

## # A tibble: 2 × 2
##    week data                 
##   <dbl> <list>               
## 1    44 <tibble [6,557 × 29]>
## 2    45 <tibble [6,571 × 29]>

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(A_Time_FE = map(.x = data, fit = reg1, .f = model_pred),
           B_Space_FE = map(.x = data, fit = reg2, .f = model_pred),
           C_Space_Time_FE = map(.x = data, fit = reg3, .f = model_pred),
           D_Space_Time_Lags = map(.x = data, fit = reg4, .f = model_pred),
           E_Space_Time_Lags_Features =map(.x = data, fit = reg5, .f = model_pred))

week_predictions <- week_predictions %>%  
    gather(Regression, Prediction, -data, -week) %>% 
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean),
           sd_AE = map_dbl(Absolute_Error, sd))

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") + theme()

Purr Prediction Graph Result

Below is a graph generating from the purr family functions, depicting the predicted value compared to the observed value by an hourly interval.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         uniqueID = map(data, pull, uniqueID)) %>%
  dplyr::select(interval60, uniqueID, Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -interval60, -uniqueID) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = mean(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      scale_colour_manual(values = palette2) +
      labs(title = "Mean Predicted/Observed ride share by hourly interval", 
           x = "Hour", y= "Rideshare Trips") + theme()

Purr Map Results

Depicted below is a map if the mean average error, depicting spatially where mean average error is the highest. Displayed here, the mean average error is higheest towards the center (which tends to be more populatied. Consequently, the regression model can be further improved by accounting for these high population density spatial regions.

error.byWeek <-
  filter(week_predictions, Regression == "D_Space_Time_Lags") %>% 
  unnest() %>% st_sf() %>%
  dplyr::select(uniqueID, Absolute_Error, week, geometry) %>%
  gather(Variable, Value, -uniqueID, -week, -geometry) %>%
    group_by(Variable, uniqueID, week) %>%
    summarize(MAE = mean(Value))

fishnet1<- 
  st_join(fishnet, error.byWeek)%>%
  na.omit()

error.byWeek <- 
  error.byWeek%>%
  na.omit()

  ggplot() +
  geom_sf(data = phil_boundary) +
  geom_sf(data = fishnet1, aes(fill = MAE))+
  facet_wrap(~week) +
  scale_fill_viridis() +
  labs(title = "Mean Average Error by Week")+
  theme_map()

Cross Validation

Finally, a k-fold cross validation was utilized to test how the model performs accross space. The mean average error

K- Fold Regression Map

Note that higher error is associated with Philadelphia’s Central Business District.

fishnet2<- 
  st_join(fishnet, error_by_reg_and_fold)%>%
  na.omit()



fishnet2 %>%
  ggplot() +
  geom_sf(data = phil_boundary)+
    geom_sf(aes(fill = MAE)) +
    scale_fill_viridis() +
    labs(title = "Errors by K-Fold Spatial Cross Validation Regression") +
    theme_map() + theme(legend.position="right")

K Fold Regression Table

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(1, color = "black", background = "#64E4F4")

Regression	Mean_MAE	SD_MAE
Random k-fold CV: Spatial Process	0.25	0.17

Conclusion

Overall the model is decent but can undoubtedly be improved. The model performs better on temporal scales in comparison to spatial scales. The Purr graphs indicate that feature engineering played a large role in the creation of a better model. The predicted outcomes were closer to the observed outcomes in models with a greater number of engineered features. Features were placed into the model to account for time, spatial, and weather effects. Spatially, the model struggles more in the City Center within Philadelphia’s Central Business District. If Indego chooses to utilize this model, there would likely be a shortage of bikes in higher density regions. This is problematic, especially during rush hour, where demand is exceptionally high. Consequently, this will result in a decrease in revenue for Indego Bike Share. Although the mean average error is higher in the polygon’s center, the mean average error is only 1, which is not incredibly high. Temporally, the model is good at predicting bike share usage during the day, but not during times of peak usage. Consequently, if the model were to be put into production, there would be a shortage of bikes during rush hour. The model predicts at about the same rate for weeks 44 and 45. Like the cross-validation model results, the mean average error associated with the model is only about 1. The model can be improved by continued feature engineering. Engineering features can account for higher density populations to account for spatial errors. Distance from the commercial industry is another feature that can be utilized. These are typically regions that people are more likely to visit, and therefore, more bikes should be available. Unfortunately, the appropriate data was not available. Additionally, features must be engineered further to account for the high demand at rush hour. The methodology should not be changed and is capable of producing an accurate model. However, the model’s inputs should be altered to create a useful model that can be put into production. Lastly, a popper model is essential so that Indego can effectively imbalance its resources appropriately. A useful model ensures that the company can understand when and where to allocate their resources. Resources may include where to distribute the majority of their bikes, where to put a new bike station, or which customers the company markets so that more people make use of their service.

Philadelphia Indego Bike Share Prediction Model

Kyle McCarthy

November 13, 2020