Spatial network analysis

Spatial networks are a specific type of network in which nodes and edges are embedded in space. Each node in the network represents a location in space, and each edge represents a connection between these locations that can have various interpretations, such as a physical connection (like a road or a flight route), a communication link, or a social relationship.

This type of network is very important in fields such as transportation, logistics, urban planning, telecommunication, and epidemiology, where the spatial distribution of nodes and connections significantly influences the dynamics of the system.

The key characteristic that differentiates spatial networks from other types of networks (like social networks, biological networks, etc.) is the inclusion of spatial information, like geographic location or physical distance, in the network’s structure. This makes them particularly suitable for modeling and understanding systems where space and distance play an important role.

Bikeshare in New York

We will analyze Bikeshare use in New York city using network analysis. Spatial network is not too different from other network analysis. The most important is to identify how to represent spatial features as nodes and edges in a network. Here, we represent bike share stations as nodes and trips as edges. The locations of bike share stations represent the main spatial feature we need to consider. We will see how space influences the network as we move along the tutorial.

library(tidyverse)
library(igraph)

Download the file 201806-citibike-tripdata.csv from Canvas into your project folder. We clean the column names. The S and E at the beginning of column names below represent starting and ending points of bike trips.

tripdata <- read_csv("201806-citibike-tripdata.csv") %>% 
  rename(Slat = `start station latitude`,
         Slon = `start station longitude`,
         Elat = `end station latitude`,
         Elon = `end station longitude`,
         Sstid = `start station id`,
         Estid = `end station id`,
         Estname = `end station name`,
         Sstname = `start station name`)
Rows: 1953052 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (4): start station name, end station name, name_localizedValue0, usertype
dbl  (10): tripduration, start station id, start station latitude, start sta...
dttm  (2): starttime, stoptime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Some networks have self-loops. This is a case when a node is connected to itself. In our case, it’s when a bike trip begins and end at the same station. Since we are interested in the relationship between stations at different locations, we are filtering out self loops.

diffdesttrips <- tripdata[tripdata$Estid != tripdata$Sstid, ] # to make sure there are no loops or self-connections.

We convert a data frame to a graph. Note directed = T. In our case, we are differentiating between stations being a point of origin and a point of destination. We want to explore which stations are popular as origin and which ones are popular as destinations.

trips_graph <- diffdesttrips %>% 
        select(Sstid,Estid) %>%
        graph.data.frame(directed = T)

trips_graph
IGRAPH 4803d64 DN-- 773 1909858 -- 
+ attr: name (v/c)
+ edges from 4803d64 (vertex names):
 [1] 72->173  72->477  72->457  72->379  72->459  72->446  72->212  72->458 
 [9] 72->212  72->514  72->514  72->465  72->173  72->524  72->173  72->212 
[17] 72->382  72->462  72->3141 72->456  72->519  72->328  72->426  72->3659
[25] 72->359  72->525  72->405  72->458  72->457  72->3163 72->426  72->328 
[33] 72->376  72->3459 72->304  72->457  72->486  72->527  72->490  72->459 
[41] 72->459  72->3178 72->3178 72->515  72->459  72->388  72->500  72->508 
[49] 72->173  72->426  72->368  72->3258 72->532  72->478  72->514  72->533 
[57] 72->519  72->423  72->3457 72->525  72->525  72->490  72->529  72->405 
+ ... omitted several edges

Properties of the graph

vcount(trips_graph)
[1] 773
ecount(trips_graph)
[1] 1909858
is.directed(trips_graph)
[1] TRUE
V(trips_graph) %>% head()
+ 6/773 vertices, named, from 4803d64:
[1] 72  79  82  83  119 120
E(trips_graph) %>% head()
+ 6/1909858 edges from 4803d64 (vertex names):
[1] 72->173 72->477 72->457 72->379 72->459 72->446

Creating a unique list of station IDs

tmp1 <- diffdesttrips %>%
  group_by(Sstid) %>%
  summarise(
    stname = first(Sstname),
    lon = first(Slon),
    lat = first(Slat))%>%
  rename(stid = Sstid)

tmp2 <- diffdesttrips %>%
  group_by(Estid) %>%
  summarise(
    stname = first(Estname),
    lon = first(Elon),
    lat = first(Elat)) %>%
  rename(stid = Estid)

station_locs <- rbind(tmp1, tmp2) %>% unique()

We visualize a sample of the graph using ggraph

set.seed(200) # For reproducibility because of randomisation below
station_sample <- sample(V(trips_graph), 20)
sub_trips <- induced_subgraph(trips_graph, station_sample)


# plot using ggraph
library(ggraph)
ggraph(sub_trips, layout = 'kk') + 
    geom_edge_fan(show.legend = FALSE) +
    geom_node_point()
Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.

Degrees and their distribution

The degree of a node is the number of connections or edges it has with other nodes. For a directed network (where connections have a direction, like “A follows B” on Twitter), we differentiate between “in-degree” (number of incoming connections) and “out-degree” (number of outgoing connections).

The degree of a node is a fundamental property and can give insights into the role and importance of specific nodes in the network. For instance, nodes with a high degree could be seen as central or important because they have many connections. In a social network, a person (node) with a high degree might be a social influencer, while in a web graph, a page (node) with a high degree might be a major hub that many other pages link to. In our case, we are interested in stations that are popular in bike trips.

degree(trips_graph, mode = 'out') %>%
  as_tibble %>%
  ggplot()+
  geom_density(aes(x=value))

geom_density is a smoothed version of a histogram, useful for continuous data with large numbers.

Exercise

What do you interpret about degrees and their distributions from the graph?

In particular, you want to look for the outliers in the distribution. For example to find the stations with only end trips but no start trips, i.e stations that are solely destinations

V(trips_graph)$name[degree(trips_graph, mode="out") == 0 & degree(trips_graph, mode="in") > 0]
 [1] "3245" "3267" "3275" "3192" "3276" "3184" "3214" "3651" "3198" "3199"
[11] "3203"

We can also visualise these stations that are popular origins on the map. However, we need to attach the values to the station locations tibble.

tmp <- degree(trips_graph, mode = 'out') %>%
  as_tibble() %>%
  rename(Outdegree = value)%>%
  mutate(stid = V(trips_graph)$name %>% as.numeric()) 


station_locs <- station_locs %>% 
  left_join(tmp, by='stid')

library(leaflet)


Npal <- colorNumeric(
   palette = "Reds", n = 5,
   domain = station_locs$Outdegree
 )
m1 <- station_locs %>%
    leaflet()  %>%
    addProviderTiles(providers$Stamen.TonerLite, group = "Basemap") %>%
    addCircles(
     lng = station_locs$lon,
     lat = station_locs$lat,
     radius = (station_locs$Outdegree - mean(station_locs$Outdegree))/sd(station_locs$Outdegree) * 30,
    fillOpacity = .6,
    fillColor = Npal(station_locs$Outdegree),
    group = 'Stations',
    stroke=FALSE) %>%
  addLegend("topleft", 
            pal = Npal, 
            values = ~Outdegree,
            title = "Out degree/Trip Starts",
            opacity = 1
  )
 
 m1

You can also find out which stations are predominantly origin stations and which are predominantly destinations. I use 20% more as a cut off (arbitrary).

V(trips_graph)$name[degree(trips_graph, mode="out") / degree(trips_graph, mode="in") > 1.2]
 [1] "216"  "258"  "266"  "289"  "339"  "356"  "391"  "397"  "399"  "443" 
[11] "469"  "3152" "3161" "3164" "3177" "3302" "3316" "3366" "3536" "3539"
[21] "3542" "3620" "3623"

Centrality measures

We use page_rank technique to identify addresses of central stations. Page Rank was how initial versions of Google searches ranked web pages related to particular search queries.

tmp  <- page_rank(trips_graph)$vector %>% 
    boxplot.stats() %>%
    .[["out"]] %>% ## Box plot identifies the oultiers, outside the whiskers. The default value is 1.5 outside the box.
     names() %>% 
    as.numeric()

station_locs %>% 
  filter(stid %in% tmp) %>%
  select('stname')
# A tibble: 17 × 1
   stname                       
   <chr>                        
 1 Broadway & E 14 St           
 2 Christopher St & Greenwich St
 3 Carmine St & 6 Ave           
 4 Centre St & Chambers St      
 5 Broadway & E 22 St           
 6 West St & Chambers St        
 7 W 21 St & 6 Ave              
 8 W 41 St & 8 Ave              
 9 8 Ave & W 33 St              
10 W 33 St & 7 Ave              
11 E 17 St & Broadway           
12 Broadway & W 60 St           
13 12 Ave & W 40 St             
14 Pershing Square North        
15 Central Park S & 6 Ave       
16 South End Ave & Liberty St   
17 Kent Ave & N 7 St            

Community Detection

Finding a community structure in a network is to identify clusters of nodes that are more strongly connected within the cluster than to nodes outside the cluster. There are many algorithms that identify community structures. Some work on weighted networks and some work on digraphs. In any case it is better to convert multigraph (graph with two nodes having multiple edges) into a simple graph with edge weights, representing the number of trips.

#we sum number of edges between stations
E(trips_graph)$weight <- 1
station_graph  <- simplify(trips_graph, edge.attr.comb="sum")
ecount(station_graph) == ecount(trips_graph)
[1] FALSE
sum(E(station_graph)$weight) == ecount(trips_graph)
[1] TRUE
all.equal(vcount(station_graph), vcount(trips_graph))
[1] TRUE
is.directed(station_graph)
[1] TRUE
Exercise

What do the reposnses to the above logical queries mean?

Walktrap algorithm finds the communities by random walks in the graph. The intuition is that random walks are more likely to be within a community that across communities.

wlktrp_mmbr <- data.frame (clstrmm  = cluster_walktrap(station_graph)$membership %>% as.factor, 
                          stid = V(station_graph)$name %>% as.numeric()
                          ) %>% as_tibble()


wlktrp_mmbr$clstrmm %>% summary()
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
 80 218 207  61   3 195   1   1   1   1   1   1   1   1   1 

There seem to be 5 major communities in the spatial network. We can visualize the communities.

station_locs <- station_locs %>%
                inner_join(wlktrp_mmbr, by='stid')


library(ggmap)
The legacy packages maptools, rgdal, and rgeos, underpinning this package
will retire shortly. Please refer to R-spatial evolution reports on
https://r-spatial.org/r/2023/05/15/evolution4.html for details.
This package is now running under evolution status 0 
ℹ Google's Terms of Service: <https://mapsplatform.google.com>
ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(sf)
Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
bnds <- station_locs %>% st_as_sf(coords=c("lon",'lat')) %>% st_bbox %>% unname

nybg <- get_stamenmap(bbox=bnds, maptype = 'toner-lite', zoom = 11)
ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
library(RColorBrewer)
numColors <- levels(station_locs$clstrmm) %>% length()
myColors <- colorRampPalette(brewer.pal(8,"Dark2"))(numColors) # Expanding the number of colors available from 8 

  ggmap(nybg)+
  geom_point(aes(x=lon, y=lat, color=clstrmm), alpha=.9, data=station_locs)+
  scale_color_manual(values = myColors)+ 
   scale_x_continuous("", breaks=NULL)+
   scale_y_continuous("", breaks=NULL)+
   theme(panel.background = element_rect(fill='white',colour='white'), legend.position = "none") +
  labs(colour = "Communities")
Scale for x is already present.
Adding another scale for x, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf

Exercise

What are your interpretations from the above visualization about NY’s general behavior regarding bike trips?

This tutorial borrows heavily and expands on PLAN672 tutorial on spatial network analysis.