library(tidyverse)
library(igraph)Spatial network analysis
Spatial networks are a specific type of network in which nodes and edges are embedded in space. Each node in the network represents a location in space, and each edge represents a connection between these locations that can have various interpretations, such as a physical connection (like a road or a flight route), a communication link, or a social relationship.
This type of network is very important in fields such as transportation, logistics, urban planning, telecommunication, and epidemiology, where the spatial distribution of nodes and connections significantly influences the dynamics of the system.
The key characteristic that differentiates spatial networks from other types of networks (like social networks, biological networks, etc.) is the inclusion of spatial information, like geographic location or physical distance, in the network’s structure. This makes them particularly suitable for modeling and understanding systems where space and distance play an important role.
Bikeshare in New York
We will analyze Bikeshare use in New York city using network analysis. Spatial network is not too different from other network analysis. The most important is to identify how to represent spatial features as nodes and edges in a network. Here, we represent bike share stations as nodes and trips as edges. The locations of bike share stations represent the main spatial feature we need to consider. We will see how space influences the network as we move along the tutorial.
Download the file 201806-citibike-tripdata.csv from Canvas into your project folder. We clean the column names. The S and E at the beginning of column names below represent starting and ending points of bike trips.
tripdata <- read_csv("201806-citibike-tripdata.csv") %>%
rename(Slat = `start station latitude`,
Slon = `start station longitude`,
Elat = `end station latitude`,
Elon = `end station longitude`,
Sstid = `start station id`,
Estid = `end station id`,
Estname = `end station name`,
Sstname = `start station name`)Rows: 1953052 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): start station name, end station name, name_localizedValue0, usertype
dbl (10): tripduration, start station id, start station latitude, start sta...
dttm (2): starttime, stoptime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Some networks have self-loops. This is a case when a node is connected to itself. In our case, it’s when a bike trip begins and end at the same station. Since we are interested in the relationship between stations at different locations, we are filtering out self loops.
We convert a data frame to a graph. Note directed = T. In our case, we are differentiating between stations being a point of origin and a point of destination. We want to explore which stations are popular as origin and which ones are popular as destinations.
IGRAPH 4803d64 DN-- 773 1909858 --
+ attr: name (v/c)
+ edges from 4803d64 (vertex names):
[1] 72->173 72->477 72->457 72->379 72->459 72->446 72->212 72->458
[9] 72->212 72->514 72->514 72->465 72->173 72->524 72->173 72->212
[17] 72->382 72->462 72->3141 72->456 72->519 72->328 72->426 72->3659
[25] 72->359 72->525 72->405 72->458 72->457 72->3163 72->426 72->328
[33] 72->376 72->3459 72->304 72->457 72->486 72->527 72->490 72->459
[41] 72->459 72->3178 72->3178 72->515 72->459 72->388 72->500 72->508
[49] 72->173 72->426 72->368 72->3258 72->532 72->478 72->514 72->533
[57] 72->519 72->423 72->3457 72->525 72->525 72->490 72->529 72->405
+ ... omitted several edges
Properties of the graph
+ 6/773 vertices, named, from 4803d64:
[1] 72 79 82 83 119 120
+ 6/1909858 edges from 4803d64 (vertex names):
[1] 72->173 72->477 72->457 72->379 72->459 72->446
Creating a unique list of station IDs
tmp1 <- diffdesttrips %>%
group_by(Sstid) %>%
summarise(
stname = first(Sstname),
lon = first(Slon),
lat = first(Slat))%>%
rename(stid = Sstid)
tmp2 <- diffdesttrips %>%
group_by(Estid) %>%
summarise(
stname = first(Estname),
lon = first(Elon),
lat = first(Elat)) %>%
rename(stid = Estid)
station_locs <- rbind(tmp1, tmp2) %>% unique()We visualize a sample of the graph using ggraph
set.seed(200) # For reproducibility because of randomisation below
station_sample <- sample(V(trips_graph), 20)
sub_trips <- induced_subgraph(trips_graph, station_sample)
# plot using ggraph
library(ggraph)
ggraph(sub_trips, layout = 'kk') +
geom_edge_fan(show.legend = FALSE) +
geom_node_point()Warning: Using the `size` aesthetic in this geom was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` in the `default_aes` field and elsewhere instead.
Degrees and their distribution
The degree of a node is the number of connections or edges it has with other nodes. For a directed network (where connections have a direction, like “A follows B” on Twitter), we differentiate between “in-degree” (number of incoming connections) and “out-degree” (number of outgoing connections).
The degree of a node is a fundamental property and can give insights into the role and importance of specific nodes in the network. For instance, nodes with a high degree could be seen as central or important because they have many connections. In a social network, a person (node) with a high degree might be a social influencer, while in a web graph, a page (node) with a high degree might be a major hub that many other pages link to. In our case, we are interested in stations that are popular in bike trips.
geom_density is a smoothed version of a histogram, useful for continuous data with large numbers.
What do you interpret about degrees and their distributions from the graph?
In particular, you want to look for the outliers in the distribution. For example to find the stations with only end trips but no start trips, i.e stations that are solely destinations
[1] "3245" "3267" "3275" "3192" "3276" "3184" "3214" "3651" "3198" "3199"
[11] "3203"
We can also visualise these stations that are popular origins on the map. However, we need to attach the values to the station locations tibble.
m1 <- station_locs %>%
leaflet() %>%
addProviderTiles(providers$Stamen.TonerLite, group = "Basemap") %>%
addCircles(
lng = station_locs$lon,
lat = station_locs$lat,
radius = (station_locs$Outdegree - mean(station_locs$Outdegree))/sd(station_locs$Outdegree) * 30,
fillOpacity = .6,
fillColor = Npal(station_locs$Outdegree),
group = 'Stations',
stroke=FALSE) %>%
addLegend("topleft",
pal = Npal,
values = ~Outdegree,
title = "Out degree/Trip Starts",
opacity = 1
)
m1You can also find out which stations are predominantly origin stations and which are predominantly destinations. I use 20% more as a cut off (arbitrary).
[1] "216" "258" "266" "289" "339" "356" "391" "397" "399" "443"
[11] "469" "3152" "3161" "3164" "3177" "3302" "3316" "3366" "3536" "3539"
[21] "3542" "3620" "3623"
Centrality measures
We use page_rank technique to identify addresses of central stations. Page Rank was how initial versions of Google searches ranked web pages related to particular search queries.
tmp <- page_rank(trips_graph)$vector %>%
boxplot.stats() %>%
.[["out"]] %>% ## Box plot identifies the oultiers, outside the whiskers. The default value is 1.5 outside the box.
names() %>%
as.numeric()
station_locs %>%
filter(stid %in% tmp) %>%
select('stname')# A tibble: 17 × 1
stname
<chr>
1 Broadway & E 14 St
2 Christopher St & Greenwich St
3 Carmine St & 6 Ave
4 Centre St & Chambers St
5 Broadway & E 22 St
6 West St & Chambers St
7 W 21 St & 6 Ave
8 W 41 St & 8 Ave
9 8 Ave & W 33 St
10 W 33 St & 7 Ave
11 E 17 St & Broadway
12 Broadway & W 60 St
13 12 Ave & W 40 St
14 Pershing Square North
15 Central Park S & 6 Ave
16 South End Ave & Liberty St
17 Kent Ave & N 7 St
Community Detection
Finding a community structure in a network is to identify clusters of nodes that are more strongly connected within the cluster than to nodes outside the cluster. There are many algorithms that identify community structures. Some work on weighted networks and some work on digraphs. In any case it is better to convert multigraph (graph with two nodes having multiple edges) into a simple graph with edge weights, representing the number of trips.
[1] FALSE
[1] TRUE
[1] TRUE
[1] TRUE
What do the reposnses to the above logical queries mean?
Walktrap algorithm finds the communities by random walks in the graph. The intuition is that random walks are more likely to be within a community that across communities.
wlktrp_mmbr <- data.frame (clstrmm = cluster_walktrap(station_graph)$membership %>% as.factor,
stid = V(station_graph)$name %>% as.numeric()
) %>% as_tibble()
wlktrp_mmbr$clstrmm %>% summary() 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
80 218 207 61 3 195 1 1 1 1 1 1 1 1 1
There seem to be 5 major communities in the spatial network. We can visualize the communities.
The legacy packages maptools, rgdal, and rgeos, underpinning this package
will retire shortly. Please refer to R-spatial evolution reports on
https://r-spatial.org/r/2023/05/15/evolution4.html for details.
This package is now running under evolution status 0
ℹ Google's Terms of Service: <https://mapsplatform.google.com>
ℹ Please cite ggmap if you use it! Use `citation("ggmap")` for details.
Linking to GEOS 3.11.2, GDAL 3.6.2, PROJ 9.2.0; sf_use_s2() is TRUE
bnds <- station_locs %>% st_as_sf(coords=c("lon",'lat')) %>% st_bbox %>% unname
nybg <- get_stamenmap(bbox=bnds, maptype = 'toner-lite', zoom = 11)ℹ Map tiles by Stamen Design, under CC BY 3.0. Data by OpenStreetMap, under ODbL.
library(RColorBrewer)
numColors <- levels(station_locs$clstrmm) %>% length()
myColors <- colorRampPalette(brewer.pal(8,"Dark2"))(numColors) # Expanding the number of colors available from 8
ggmap(nybg)+
geom_point(aes(x=lon, y=lat, color=clstrmm), alpha=.9, data=station_locs)+
scale_color_manual(values = myColors)+
scale_x_continuous("", breaks=NULL)+
scale_y_continuous("", breaks=NULL)+
theme(panel.background = element_rect(fill='white',colour='white'), legend.position = "none") +
labs(colour = "Communities")Scale for x is already present.
Adding another scale for x, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
What are your interpretations from the above visualization about NY’s general behavior regarding bike trips?
This tutorial borrows heavily and expands on PLAN672 tutorial on spatial network analysis.