Introduction

Bike share is a growing form of green transportation across the United States and the world. Cities are implementing bike-share programs, which help increase mobility in cities and mitigate the need for a car. This particular study focuses on Philadelphia’s bike share, run by the company Indigo. For Philadelphia to maximize bike share usage across the city, they must account for rebalancing. For example, many cities may want to move bikes to highly dense populations in the central district before rush hour to meet the bike share demands. Additionally, rebalancing may occur by offering rewards or incentives to those who bike from place to place. A bike-share prediction model allows the rideshare to understand where and when to place bikes to meet the highest demand. From the perspective of Indego, Indego must focus on both the spatial scale and the temporal scale. If Indego were to look at the temporal scale of 1 year, they would notice a rise in bike share usage in the summer but a decline of bike share usage at the weekend. However, these trends do not tell the entire story. Over a single day, bike share usage is not consistent throughout the day. For example, bike share usage is typically higher during rush hour and lower during the middle of the day. These trends are clearly reflected in the data. Throughout a week, weekdays will typically have more significant bike share usage in comparison to the weekend. Events such as holidays may also play a role. Spatial factors are also significantly important. Bike share is likely to be used more in areas where bike lane infrastructure is. Bike share will likely have higher demand in population and regions with more significant commercial industry. Finally, the weather plays a significant role in bike share usage. When there is precipitation or colder temperatures, bike share usage significantly decreased. These temporal, spatial, and weather-related attributes are essential regarding how Indigo should rebalance their resources. Understanding these spatial and temporal trends will maximize the cities bike share usage. Consequently, Indigo will understand how to allocate their resources and where to advertise their services.

Set Up

library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.0.3

library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)

## Warning: package 'riem' was built under R version 4.0.3

library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(rnaturalearth)
library(dplyr)
library(RColorBrewer)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(ggcorrplot)
library(jtools)     # for regression model plots
library(broom)
library(rmarkdown)
library(kableExtra)
library(tidycensus)
library(rgeos)

## Warning: package 'rgeos' was built under R version 4.0.3

library(readxl)
library(rgdal)
library(sp)
library(raster)
library(cowplot)

## Warning: package 'cowplot' was built under R version 4.0.3

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(gganimate)

## Warning: package 'gganimate' was built under R version 4.0.3

library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(gganimate)
library(rgeos)




plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())
mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]]), digits = 3),
                 c(.01,.2,.4,.6,.8), na.rm=T)
  }
}

q5 <- function(variable) {as.factor(ntile(variable, 5))}

nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

The Data

stations <- st_read("https://kiosks.bicycletransit.workers.dev/phl")%>% st_transform("EPSG:6565")

## Reading layer `phl' from data source `https://kiosks.bicycletransit.workers.dev/phl' using driver `GeoJSON'
## Simple feature collection with 144 features and 33 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -75.22399 ymin: 39.88994 xmax: -75.12994 ymax: 39.99179
## geographic CRS: WGS 84

trips <- st_read("C:/Users/Kyle McCarthy/Documents/MUSA_PubPolicy/indigo.csv")%>%
  rename(id = start_station)

## Reading layer `indigo' from data source `C:\Users\Kyle McCarthy\Documents\MUSA_PubPolicy\indigo.csv' using driver `CSV'

## Warning: no simple feature geometries present: returning a data.frame or tbl_df

trips$id <- as.integer(trips$id)

stations <- right_join(stations, trips)

full_boundary <- st_read("C:/Users/Kyle McCarthy/Documents/MUSA_PubPolicy/hw5/Neighborhoods_Philadelphia (1).geojson") %>%
  st_transform("ESRI:102729")

## Reading layer `Neighborhoods_Philadelphia (1)' from data source `C:\Users\Kyle McCarthy\Documents\MUSA_PubPolicy\hw5\Neighborhoods_Philadelphia (1).geojson' using driver `GeoJSON'
## Simple feature collection with 158 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -75.28027 ymin: 39.867 xmax: -74.95576 ymax: 40.13799
## geographic CRS: WGS 84

phil_boundary1 <- st_read("C:/Users/Kyle McCarthy/Documents/MUSA_PubPolicy/hw5/Neighborhoods_Philadelphia (1).geojson") %>%
  st_transform("ESRI:102729")

## Reading layer `Neighborhoods_Philadelphia (1)' from data source `C:\Users\Kyle McCarthy\Documents\MUSA_PubPolicy\hw5\Neighborhoods_Philadelphia (1).geojson' using driver `GeoJSON'
## Simple feature collection with 158 features and 8 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -75.28027 ymin: 39.867 xmax: -74.95576 ymax: 40.13799
## geographic CRS: WGS 84

phil_boundary1 <- filter(phil_boundary1, listname == "Cedar Park" | listname == "University City" | listname =="Rittenhouse" |listname == "Society Hill" |listname == "Fairmount" |listname == "Old City" | listname =="Center City East" | listname =="Fitler Square" |listname == "Powelton" |listname == "Mantua" |listname == "Spruce Hill" |listname == "Graduate Hospital" | listname == "Logan Square" | listname =="Society Hill" |listname == "Washington Square West" | listname =="Grays Ferry" | listname =="Point Breeze" |listname == "Newbold" | listname == "Queen Village" |listname == "Hawthorne" |listname == "Callowhill"|listname == "Spring Garden"|listname == "Chinatown"|listname =="Dickinson Narrows" |listname =="East Passyunk" |listname =="Greenwich" |listname =="Pennsport"|listname =="Passyunk Square" |listname =="Point Breeze" |listname =="Woodland Terrace"  |listname =="West Powelton" |listname =="Francisville" |listname =="West Poplar" |listname =="East Poplar" |listname =="Northern Liberties" |listname =="Garden Court" |listname =="Walnut Hill"|listname =="Bella Vista" | listname =="West Passyunk")




phil_boundary <-
  phil_boundary1 %>%
  summarise(shape_area= sum(shape_area)) %>%
  st_transform("ESRI:102729")


zip <- st_read( "http://data.phl.opendata.arcgis.com/datasets/b54ec5210cee41c3a884c9086f7af1be_0.geojson")%>%
  filter(CODE == "19104") %>%
  st_transform("ESRI:102729")

## Reading layer `Zipcodes_Poly' from data source `http://data.phl.opendata.arcgis.com/datasets/b54ec5210cee41c3a884c9086f7af1be_0.geojson' using driver `GeoJSON'
## Simple feature collection with 48 features and 5 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -75.28031 ymin: 39.86747 xmax: -74.95575 ymax: 40.13793
## geographic CRS: WGS 84

Study Region

A subset of Philadelphia neighborhoods were taken to analyze where regions which rideshare is most popular.

ggplot() + 
  geom_sf(data = full_boundary, aes(colour = "#FFFFFF"))+
  geom_sf(data = phil_boundary, aes(colour = "black", fill = "shape_area"))+
  labs(title = "Study Region within Philadelphia")+ 
  theme_classic() + theme(legend.position = "none")

trip2 <- stations %>%
  mutate(interval60 = floor_date(mdy_hm(start_time), unit = "hour"),
         interval15 = floor_date(mdy_hm(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))
#glimpse(stations)

trip2 <-
  trip2  %>% st_transform("ESRI:102729")

Importing Weather Data

weather.Panel <- 
  riem_measures(station = "PHL", date_start = "2019-10-01", date_end = "2020-01-01") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

Bike share trips per Hr Philadlphia

Note the effect of weather on ride share in Philadelphia. As the temperatures lower throughout the winter, ride share usage drastically decreases.

ggplot(trip2 %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. Philadelphia",
       x="Date", 
       y="Number of trips")+
  theme_classic()

Fishnet

A fishnet was created to effectively create the bike station as the unit of measurement.A map showing the count of trips within each fishnet is shown below. The cell size in 900 square meters in a hexagonal shape.

fishnet <- 
  st_make_grid(phil_boundary, cellsize = 900, square = FALSE) %>%
  .[phil_boundary] %>%
  st_sf()%>%
  mutate(uniqueID = rownames(.))


#bikenet
bikenet <- 
  dplyr::select(trip2) %>% 
  mutate(Trip_Count = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(Trip_Count = replace_na(Trip_Count, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = bikenet, aes(fill = Trip_Count)) +
  scale_fill_viridis() +
  labs(title = "Count of Trips October-December 2019")+ 
  theme_map()

Ride Panel

bikenet2 <-
  st_intersection(trip2, bikenet)

ride.panel <- 
  bikenet2 %>%
    mutate(Trip_Counter = 1) %>%
      group_by(interval60, uniqueID) %>%
      summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
      left_join(weather.Panel, by = "interval60") %>%
          mutate(week = week(interval60),
                dotw = wday(interval60, label = TRUE)) %>%
          st_sf()

More Engineered Features — Training and Testing sets.

Training set was designated for weeks 41-43 in October, while the testing set was designated for weeks 44-45 Features engineered include proximity to bike lanes, parks and septa lines. Features were also engineered to maximize weekday and rush hour times.

# ride.panel <- 
#   ride.panel%>%
#   mutate(distbike = round(st_distance(ride.panel, bikepath))*100, 2)

septa <- st_read("https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson")%>%
  st_transform("ESRI:102729")

## Reading layer `b151c01f-c134-4291-aa32-4ba46b748ae3202041-1-q1f0k3.d5lrh' from data source `https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson' using driver `GeoJSON'
## Simple feature collection with 28 features and 45 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -75.25973 ymin: 39.94989 xmax: -75.07788 ymax: 40.02299
## geographic CRS: WGS 84

tree <- st_read("https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+ppr_tree_canopy_points_2015&filename=ppr_tree_canopy_points_2015&format=geojson&skipfields=cartodb_id")%>%
   st_transform("ESRI:102729")

## Reading layer `OGRGeoJSON' from data source `https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+ppr_tree_canopy_points_2015&filename=ppr_tree_canopy_points_2015&format=geojson&skipfields=cartodb_id' using driver `GeoJSON'
## Simple feature collection with 2480 features and 2 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -75.26242 ymin: 39.88838 xmax: -74.95928 ymax: 40.13635
## geographic CRS: WGS 84

bike_lanes <- st_read("C:/Users/Kyle McCarthy/Documents/MUSA_PubPolicy/bikelane/bikepathspts_clip.shp")

## Reading layer `bikepathspts_clip' from data source `C:\Users\Kyle McCarthy\Documents\MUSA_PubPolicy\bikelane\bikepathspts_clip.shp' using driver `ESRI Shapefile'
## Simple feature collection with 3391 features and 8 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: -75.22771 ymin: 39.91895 xmax: -75.11801 ymax: 39.99007
## geographic CRS: WGS 84

st_c <- st_coordinates

ride.panel <- 
  ride.panel %>% 
    arrange(uniqueID, interval60) %>% 
    group_by(uniqueID) %>% 
    mutate(lagHour = dplyr::lag(Trip_Count,1),
           lag2Hours = dplyr::lag(Trip_Count,2),
           lag3Hours = dplyr::lag(Trip_Count,3),
           lag4Hours = dplyr::lag(Trip_Count,4),
           lag12Hours = dplyr::lag(Trip_Count,12),
           lag1day = dplyr::lag(Trip_Count,24)) %>% 
   ungroup()%>%
  na.omit()

ride.panel <-
  ride.panel %>% 
  mutate(
    bike_nn1 = nn_function(st_c(st_centroid(ride.panel)), st_c(bike_lanes), 1),
    bike_nn2 = nn_function(st_c(st_centroid(ride.panel)), st_c(bike_lanes), 2),
    bike_nn3 = nn_function(st_c(st_centroid(ride.panel)), st_c(bike_lanes), 3), 
    bike_nn4 = nn_function(st_c(st_centroid(ride.panel)), st_c(bike_lanes), 4),
    bike_nn5 = nn_function(st_c(st_centroid(ride.panel)), st_c(bike_lanes), 5))

ride.panel <-
  ride.panel %>% 
  mutate(
    septa_nn1 = nn_function(st_c(st_centroid(ride.panel)), st_c(septa), 1),
    septa_nn2 = nn_function(st_c(st_centroid(ride.panel)), st_c(septa), 2),
    septa_nn3 = nn_function(st_c(st_centroid(ride.panel)), st_c(septa), 3), 
    septa_nn4 = nn_function(st_c(st_centroid(ride.panel)), st_c(septa), 4),
    septa_nn5 = nn_function(st_c(st_centroid(ride.panel)), st_c(septa), 5))

ride.panel <-
  ride.panel %>% 
  mutate(
    tree_nn20 = nn_function(st_c(st_centroid(ride.panel)), st_c(tree), 20),
    tree_nn40 = nn_function(st_c(st_centroid(ride.panel)), st_c(tree), 40),
    tree_nn60 = nn_function(st_c(st_centroid(ride.panel)), st_c(tree), 60), 
    tree_nn80 = nn_function(st_c(st_centroid(ride.panel)), st_c(tree), 80),
    tree_nn100 = nn_function(st_c(st_centroid(ride.panel)), st_c(tree), 100))

ride.Train <- filter(ride.panel, week == 41 | week == 42 | week == 43)
ride.Test <- filter(ride.panel, week == 44 | week == 45)


st_drop_geometry(rbind(
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing"))) %>%
    group_by(Legend, interval60) %>% 
      summarize(Trip_Count = sum(Trip_Count)) %>%
      ungroup() %>% 
      ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
        scale_colour_manual(values = palette2) +
        labs(title="Rideshare trips by week: November-December",
             x="Day", y="Trip Count")

Exploratory Analysis Depicting the Correlation of Model Variables

ride.panel.na <- 
  ride.panel%>%
  na.omit()

correlation.long <-
  st_drop_geometry(ride.panel.na) %>%
    dplyr::select(-dotw, - interval60, -uniqueID, - septa_nn3, - septa_nn4, -septa_nn5, -tree_nn20, -tree_nn40, -tree_nn80, -tree_nn100, -bike_nn1, -bike_nn2, -bike_nn4, -bike_nn5) %>%
    gather(Variable, Value, -Trip_Count)

correlation.long$Value <- as.double(correlation.long$Value )


correlation.cor <-
  correlation.long %>%
    group_by(Variable) %>%
    summarize(correlation = cor(Value, Trip_Count, use = "complete.obs"))
    
ggplot(correlation.long, aes(Value, Trip_Count)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.cor, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "red") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Trip Count as a Function of Indicators") +
  theme()

# Moran’s I anaysis.

The Moran I analysis was performed. To perform this analysis, a random subset of the data was taken for processing time (my computer was unable to load the full result). The output creats a trip_count map, a local moran I count, significant hotspots and p-value. Note that spatial autocorrelation exsists within the data, indicated by difference in the polygon’s center and at the edges.

bikenet <- 
  dplyr::select(trip2) %>% 
  mutate(Trip_Count = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(Trip_Count = replace_na(Trip_Count, 0),
         uniqueID = rownames(.),
         cvID = sample(round(nrow(fishnet) / 24), size=nrow(fishnet), replace = TRUE))

ride.panel <- ride.panel %>%
  st_transform("ESRI:102729")%>%
  filter(week == 41 | week == 42 | week == 43 | week == 44 | week == 45)

bikenet <- 
  st_join(bikenet, ride.panel)%>%
  filter(week == 41 | week == 42 | week == 43 | week == 44 | week == 45)
  

bikenetsample <- sample_n(bikenet, 10000)%>%
  na.omit()

bikenetsample.nb <- poly2nb(as_Spatial(bikenetsample), queen=TRUE)
bikenetsample.weights <- nb2listw(bikenetsample.nb, style="W", zero.policy=TRUE)

bikenetsample.localMorans <- 
  cbind(
    as.data.frame(localmoran(bikenetsample$Trip_Count.x, bikenetsample.weights)),
    as.data.frame(bikenetsample)) %>% 
    st_sf() %>%
      dplyr::select(Trip_Count = Trip_Count.x, 
                    Local_Morans_I = Ii, 
                    P_Value = `Pr(z > 0)`) %>%
      mutate(Significant_Hotspots = ifelse(P_Value <= 0.05, 1, 0)) %>%
      gather(Variable, Value, -geometry)
  
vars <- unique(bikenetsample.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
    geom_sf(data = phil_boundary)+
      geom_sf(data = filter(bikenetsample.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      theme_map() + theme(legend.position="right")}

do.call(grid.arrange,c(varList, ncol = 2, top = "Local Morans I statistics, Indego Counts from a Random sample "))

Animation

The animation displayed below depicts bike share usage on Octorber 8th, 2019. There are distinct rises in bike share uses throughout the day as depicted by the red points on the map. Higher ride share usages correspond with rush hour times. Note that this date was a Tuesday during the week.

week45 <-
  filter(bikenet2, week == 41 | week == 42 | week == 43 | week == 44 | week == 45)
week45 <- sample_n(week45, 5000)

week45.panel <-
  expand.grid(
    interval15 = unique(week45$interval15),
    uniqueID = unique(bikenet2$uniqueID))%>%
  na.omit()

week45.panel <- sample_n(week45.panel, 5000)


ride.animation.data <-
  mutate(week45, Trip_Counter = 1) %>%
    group_by(interval15, uniqueID) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    ungroup() %>% 
    st_join(bikenetsample, by=c("uniqueID" = "uniqueID.x")) %>%
    st_sf() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
                             Trip_Count > 10 ~ "11+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
                                       "7-10 trips","10+ trips"))


ride.animation.data <- 
  ride.animation.data%>%
  mutate(day = day(interval15))
ride.animation.data <- 
  ride.animation.data%>%
  filter(week == 41 & day == 8)

ride.animation.data <- sample_n(ride.animation.data, 1000)
  
  
 
rideshare_animation <-
   ggplot() +
  geom_sf(data = phil_boundary)+ 
  geom_sf(data = fishnet, fill = "#070000")+
     geom_sf(data = ride.animation.data, aes(fill = Trips, colour = "#F7FA15")) +
     scale_fill_manual(values = palette5) +
     labs(title = "Bikeshare Animated Map October 8th Throught the Day",
          subtitle = "15 minute intervals: {current_frame}") +
     transition_manual(interval15) +
     theme_map()+ theme(legend.position = "none")
 
 animate(rideshare_animation, duration=20, renderer = gifski_renderer())

 anim_save("rideshare_local", rideshare_animation, duration=20, renderer = gifski_renderer())

Rain Vs. Snow

Precipitation decreases bike share usage as depicted in the bar chart.

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Precipitation = first(Precipitation)) %>%
  mutate(isPrecip = ifelse(Precipitation > 0,"Rain/Snow", "No Percipitation")) %>%
  group_by(isPrecip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
    ggplot(aes(isPrecip, Mean_Trip_Count, fill = isPrecip)) + geom_bar(stat = "identity") +
      labs(title="Does Ridership Vary with Precipitation?",
           x="Precipitation", y="Mean Trip Count") + theme_cowplot()+ theme(legend.position = "NONE")

Temparature vs. Trip Count

Typically, as temperature increases, bike share increases. However, the amount that temperature plays a role seems to vary week from week.

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE) +
    facet_wrap(~week, ncol=7) + 
    labs(title="Trip Count as a fuction of Temperature by week",
         x="Temperature", y="Mean Trip Count") + theme()+ theme_cowplot()

Regressions

reg1 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- lm(Trip_Count ~  uniqueID + dotw + Temperature,  data=ride.Train)

reg3 <- lm(Trip_Count ~  uniqueID + hour(interval60) + dotw + Temperature , 
                         data=ride.Train)

reg4 <- lm(Trip_Count ~  uniqueID +  hour(interval60) + dotw + Temperature +
                         lagHour + lag2Hours + lag3Hours + lag12Hours + lag1day, 
                         data=ride.Train)

reg5 <- lm(Trip_Count ~  uniqueID +  hour(interval60) +  dotw + lagHour + lag2Hours + lag3Hours + lag12Hours + lag1day + septa_nn2 + Precipitation  + Wind_Speed + Temperature + tree_nn20 + bike_nn5,  data=ride.Train)

Purr test

The purr family functions are utilized to loop through a set of nested data frames. to efficiently compare accross different spatial models. The five models in the regression above are compared using the purr functions. The nested loop loops through each model for each week and mutates the summary statistics. The output is a list of predictions for each model by week. These results are then gathered, generating four new collumns. To calculate MAE, map_dbl, a variant of map, is used to loop through Absolute_Error, calculating the mean. The same function calculates the standard deviation of absolute error, sd_AE, which is a useful measure of generalizability (Steif, 2020). The final result is depicted in the form of a histogram, comparing the week to Mean average error for weeks 44 and 45.

ride.panel <- 
  ride.panel %>% 
    arrange(uniqueID, interval60) %>% 
    group_by(uniqueID) %>% 
    mutate(lagHour = dplyr::lag(Trip_Count,1),
           lag2Hours = dplyr::lag(Trip_Count,2),
           lag3Hours = dplyr::lag(Trip_Count,3),
           lag4Hours = dplyr::lag(Trip_Count,4),
           lag12Hours = dplyr::lag(Trip_Count,12),
           lag1day = dplyr::lag(Trip_Count,24)) %>% 
   ungroup()%>%
  na.omit()

ride.Train <- filter(ride.panel, week == 41 | week == 42 | week == 43)
ride.Test <- filter(ride.panel, week == 44 | week == 45)


ride.Test.weekNest <- 
  as.data.frame(ride.Test) %>%
  nest(-week) 

ride.Test.weekNest

## # A tibble: 2 x 2
##    week data                 
##   <dbl> <list>               
## 1    44 <tibble [6,789 x 29]>
## 2    45 <tibble [6,781 x 29]>

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(A_Time_FE = map(.x = data, fit = reg1, .f = model_pred),
           B_Space_FE = map(.x = data, fit = reg2, .f = model_pred),
           C_Space_Time_FE = map(.x = data, fit = reg3, .f = model_pred),
           D_Space_Time_Lags = map(.x = data, fit = reg4, .f = model_pred),
           E_Space_Time_Lags_Features =map(.x = data, fit = reg5, .f = model_pred))

week_predictions <- week_predictions %>%  
    gather(Regression, Prediction, -data, -week) %>% 
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean),
           sd_AE = map_dbl(Absolute_Error, sd))

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") + theme()

Purr Prediction Graph Result

Below is a graph generating from the purr family functions, depicting the predicted value compared to the observed value by an hourly interval.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         uniqueID = map(data, pull, uniqueID)) %>%
  dplyr::select(interval60, uniqueID, Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -interval60, -uniqueID) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = mean(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      scale_colour_manual(values = palette2) +
      labs(title = "Mean Predicted/Observed ride share by hourly interval", 
           x = "Hour", y= "Rideshare Trips") + theme()

Purr Map Results

Depicted below is a map if the mean average error, depicting spatially where mean average error is the highest. Displayed here, the mean average error is higheest towards the center (which tends to be more populatied. Consequently, the regression model can be further improved by accounting for these high population density spatial regions.

error.byWeek <-
  filter(week_predictions, Regression == "D_Space_Time_Lags") %>% 
  unnest() %>% st_sf() %>%
  dplyr::select(uniqueID, Absolute_Error, week, geometry) %>%
  gather(Variable, Value, -uniqueID, -week, -geometry) %>%
    group_by(Variable, uniqueID, week) %>%
    summarize(MAE = mean(Value))

fishnet1<- 
  st_join(fishnet, error.byWeek)%>%
  na.omit()

error.byWeek <- 
  error.byWeek%>%
  na.omit()

  ggplot() +
  geom_sf(data = phil_boundary) +
  geom_sf(data = fishnet1, aes(fill = MAE))+
  facet_wrap(~week) +
  scale_fill_viridis() +
  labs(title = "Mean Average Error by Week")+
  theme_map()

Cross Validation

Finally, a k-fold cross validation was utilized to test how the model performs accross space. The mean average error

## This hold out fold is 102 
## This hold out fold is 108 
## This hold out fold is 110 
## This hold out fold is 132 
## This hold out fold is 134 
## This hold out fold is 136 
## This hold out fold is 146 
## This hold out fold is 150 
## This hold out fold is 160 
## This hold out fold is 172 
## This hold out fold is 186 
## This hold out fold is 199 
## This hold out fold is 201 
## This hold out fold is 207 
## This hold out fold is 22 
## This hold out fold is 233 
## This hold out fold is 234 
## This hold out fold is 241 
## This hold out fold is 244 
## This hold out fold is 246 
## This hold out fold is 249 
## This hold out fold is 251 
## This hold out fold is 255 
## This hold out fold is 257 
## This hold out fold is 259 
## This hold out fold is 260 
## This hold out fold is 262 
## This hold out fold is 281 
## This hold out fold is 286 
## This hold out fold is 287 
## This hold out fold is 291 
## This hold out fold is 296 
## This hold out fold is 305 
## This hold out fold is 307 
## This hold out fold is 309 
## This hold out fold is 310 
## This hold out fold is 320 
## This hold out fold is 321 
## This hold out fold is 322 
## This hold out fold is 328 
## This hold out fold is 332 
## This hold out fold is 333 
## This hold out fold is 334 
## This hold out fold is 335 
## This hold out fold is 337 
## This hold out fold is 340 
## This hold out fold is 342 
## This hold out fold is 343 
## This hold out fold is 344 
## This hold out fold is 346 
## This hold out fold is 351 
## This hold out fold is 357 
## This hold out fold is 359 
## This hold out fold is 361 
## This hold out fold is 370 
## This hold out fold is 376 
## This hold out fold is 377 
## This hold out fold is 382 
## This hold out fold is 384 
## This hold out fold is 385 
## This hold out fold is 386 
## This hold out fold is 39 
## This hold out fold is 393 
## This hold out fold is 402 
## This hold out fold is 404 
## This hold out fold is 405 
## This hold out fold is 407 
## This hold out fold is 41 
## This hold out fold is 410 
## This hold out fold is 412 
## This hold out fold is 414 
## This hold out fold is 415 
## This hold out fold is 416 
## This hold out fold is 417 
## This hold out fold is 418 
## This hold out fold is 429 
## This hold out fold is 442 
## This hold out fold is 446 
## This hold out fold is 45 
## This hold out fold is 451 
## This hold out fold is 454 
## This hold out fold is 458 
## This hold out fold is 468 
## This hold out fold is 472 
## This hold out fold is 473 
## This hold out fold is 474 
## This hold out fold is 475 
## This hold out fold is 477 
## This hold out fold is 478 
## This hold out fold is 48 
## This hold out fold is 489 
## This hold out fold is 504 
## This hold out fold is 510 
## This hold out fold is 513 
## This hold out fold is 522 
## This hold out fold is 523 
## This hold out fold is 524 
## This hold out fold is 536 
## This hold out fold is 538 
## This hold out fold is 540 
## This hold out fold is 544 
## This hold out fold is 547 
## This hold out fold is 551 
## This hold out fold is 76 
## This hold out fold is 77 
## This hold out fold is 89 
## This hold out fold is 98

K- Fold Regression Map.

Note that higher error is associated with Philadelphia’s Central Business District.

fishnet2<- 
  st_join(fishnet, error_by_reg_and_fold)%>%
  na.omit()



fishnet2 %>%
  ggplot() +
  geom_sf(data = phil_boundary)+
    geom_sf(aes(fill = MAE)) +
    scale_fill_viridis() +
    labs(title = "Errors by K-Fold Spatial Cross Validation Regression") +
    theme_map() + theme(legend.position="right")

K Fold Regression Table

st_drop_geometry(error_by_reg_and_fold) %>%
  group_by(Regression) %>% 
    summarize(Mean_MAE = round(mean(MAE), 2),
              SD_MAE = round(sd(MAE), 2)) %>%
  kable() %>%
    kable_styling("striped", full_width = F) %>%
    row_spec(1, color = "black", background = "#64E4F4")

Regression	Mean_MAE	SD_MAE
Random k-fold CV: Spatial Process	0.25	0.17

Conclusion

Overall the model is decent but can undoubtedly be improved. The model performs better on temporal scales in comparison to spatial scales. The Purr graphs indicate that feature engineering played a large role in the creation of a better model. The predicted outcomes were closer to the observed outcomes in models with a greater number of engineered features. Features were placed into the model to account for time, spatial, and weather effects. Spatially, the model struggles more in the City Center within Philadelphia’s Central Business District. If Indego chooses to utilize this model, there would likely be a shortage of bikes in higher density regions. This is problematic, especially during rush hour, where demand is exceptionally high. Consequently, this will result in a decrease in revenue for Indego Bike Share. Although the mean average error is higher in the polygon’s center, the mean average error is only 1, which is not incredibly high. Temporally, the model is good at predicting bike share usage during the day, but not during times of peak usage. Consequently, if the model were to be put into production, there would be a shortage of bikes during rush hour. The model predicts at about the same rate for weeks 44 and 45. Like the cross-validation model results, the mean average error associated with the model is only about 1. The model can be improved by continued feature engineering. Engineering features can account for higher density populations to account for spatial errors. Distance from the commercial industry is another feature that can be utilized. These are typically regions that people are more likely to visit, and therefore, more bikes should be available. Unfortunately, the appropriate data was not available. Additionally, features must be engineered further to account for the high demand at rush hour. The methodology should not be changed and is capable of producing an accurate model. However, the model’s inputs should be altered to create a useful model that can be put into production. Lastly, a popper model is essential so that Indego can effectively imbalance its resources appropriately. A useful model ensures that the company can understand when and where to allocate their resources. Resources may include where to distribute the majority of their bikes, where to put a new bike station, or which customers the company markets so that more people make use of their service.

Philadelphia Indego Bike Share Prediction Model

Kyle McCarthy

November 13, 2020