This notebook explores commuting data across municipalities in the Netherlands.

Utrecht

First, we will look at commuting data in a single province, Utrecht.

Load data, and select only municipalities in Utretcht:

## Reading layer `OGRGeoJSON' from data source `/Users/siyan/Documents/2018/T7 making maps/Lab10/gem_2016.geojson' using driver `GeoJSON'
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform
## for that
## Simple feature collection with 479 features and 121 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 10425.16 ymin: 306846.2 xmax: 278026.1 ymax: 621876.3
## epsg (SRID):    28992
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957,0.343988,-1.87740,4.0725 +units=m +no_defs
## Reading layer `OGRGeoJSON' from data source `/Users/siyan/Documents/2018/T7 making maps/Lab10/provinces.geojson' using driver `GeoJSON'
## Simple feature collection with 12 features and 3 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 3.307938 ymin: 50.75037 xmax: 7.227498 ymax: 53.57642
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

The last line performs a spatial intersect and selects municipalities whose centroids fall inside of the Utrecht boundary.

Next, load our commuting data:

## Parsed with column specification:
## cols(
##   source = col_integer(),
##   sink = col_integer(),
##   count = col_integer(),
##   weight = col_integer()
## )

Plotting commute journeys

# plot all commute source-sink pairs, with line width weighted by interaction
commute <- commute %>%
  filter(sink != source) %>%
  filter(interaction > 20) %>%
  mutate(lineWidth = interaction / max(interaction) * 10)

ggplot(commute) + geom_sf(aes(size = lineWidth)) + scale_size_identity()

Modeling spatial relations

We can try to model the commuting relationship based on other factors. Here, we use the distance between municipalities, the number of residents in the source municipality, and the number of jobs in the sink municipality.

residents <- read_csv("residents.csv") %>% select(id, weight)
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   count = col_integer(),
##   weight = col_integer()
## )
jobs <- read_csv("jobs.csv") %>% select(id, weight)
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   count = col_integer(),
##   weight = col_integer()
## )
distMatrix <- st_distance(x = munis.centroid, y = munis.centroid)
rownames(distMatrix) <- munis.centroid$id
colnames(distMatrix) <- munis.centroid$id

dist <- list(
source = rownames(distMatrix)[row(distMatrix)] %||% row(distMatrix), sink = colnames(distMatrix)[col(distMatrix)] %||% col(distMatrix), distance = distMatrix
) %>%
map_dfc(as.vector) %>%
mutate(source = as.numeric(source)) %>% mutate(sink = as.numeric(sink))

# join dist, residents and jobs into commute
commute <- commute %>%
  left_join(dist) %>%
  left_join(residents, by = c('source' = 'id')) %>% 
  rename(residents = weight) %>%
  left_join(jobs, by = c('sink' = 'id')) %>% 
  rename(jobs = weight)
## Joining, by = c("source", "sink")
# quick check: interaction should decline as distance increases
ggplot(commute, aes(distance, interaction)) + geom_point()

ggplot(commute, aes(log(distance), log(interaction))) + geom_point()

Let’s try to fit a linear model first:

lm(data = commute, formula = interaction ~ residents + jobs + distance) %>% summary()
## 
## Call:
## lm(formula = interaction ~ residents + jobs + distance, data = commute)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -15770  -1150   -309    922  40734 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.553e+03  5.241e+02   4.872 1.47e-06 ***
## residents    3.234e-02  3.423e-03   9.449  < 2e-16 ***
## jobs         9.666e-02  4.960e-03  19.486  < 2e-16 ***
## distance    -1.965e-01  2.255e-02  -8.713  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4632 on 515 degrees of freedom
## Multiple R-squared:  0.5032, Adjusted R-squared:  0.5003 
## F-statistic: 173.9 on 3 and 515 DF,  p-value: < 2.2e-16

The fitted linear model has R-squared value of only 0.50, so it does not fit too well. This is not surprising, since the first graph in the above chunk doesn’t look like it has a linear relationship.

What about a log-log plot?

ggplot(commute, aes(interaction)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

model <- glm(data = commute, formula = interaction ~ log(residents) + log(jobs) + log(distance), family = poisson())

r2 <- function(empirical, fitted) { 
  return(cor(empirical, fitted)^2)
}
r2(commute$interaction, fitted(model))
## [1] 0.8639754
tidy(model)

From the histogram, we can see that there are many small connections, and few large ones. Thus, it makes sense to use a Poisson regression model. The r-squared value of 0.86 is much better than the that of the linear model!

As expected, the log(residents) and log(jobs) coefficients are positive while the log(distance) coefficient is negative. Residents and jobs have coefficients of close to 1, which means their effect is close to linear - in general, a doubling of the number of jobs at a municipality corresponds to a doubling of number of commutes to it. The cofficient of distance is lower than -1, so it has an exponential relationship with the number of commutes. When distance is doubled, the interaction will drop by more than half.

Plotting the fitted values:

commute <- commute %>%
  mutate(fitted = fitted(model)) %>% 
  mutate(residual = interaction - fitted) %>% 
  mutate(residualSign = sign(residual))

commute %>%
  mutate(lineWidth = fitted / max(fitted) * 10) %>% 
  ggplot() + geom_sf(aes(size = lineWidth)) + scale_size_identity()

commute %>%
  mutate(lineWidth = abs(residual) / max(residual) * 10) %>%
  ggplot() + geom_sf(aes(size = lineWidth, color = factor(residualSign), alpha = 0.4)) + scale_size_identity()

The fitted values plot looks similar to the earlier plot. The residuals plot shows that the commutes with the highest interaction tend to be under-estimated in our model. Hence, there are likely other factors that affect the interaction that we did not account for.

All provinces

Now, we can explore commuting relationships in the whole of the Netherlands.

Re-load our data since it was previously filtered to Urecht only:

## Reading layer `OGRGeoJSON' from data source `/Users/siyan/Documents/2018/T7 making maps/Lab10/gem_2016.geojson' using driver `GeoJSON'
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform
## for that
## Simple feature collection with 479 features and 121 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 10425.16 ymin: 306846.2 xmax: 278026.1 ymax: 621876.3
## epsg (SRID):    28992
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957,0.343988,-1.87740,4.0725 +units=m +no_defs
# commuting data
commute <- read_csv("commuting.csv") %>% 
  select(source, sink, weight) %>% 
  rename(interaction = weight) %>%
  filter(source!=sink & interaction>750)
## Parsed with column specification:
## cols(
##   source = col_integer(),
##   sink = col_integer(),
##   count = col_integer(),
##   weight = col_integer()
## )
commute <- commute %>%
  left_join(munis.centroid, by = c("source" = "id")) %>%
  left_join(munis.centroid, by = c("sink" = "id")) 
commute <- commute %>%
  rowwise() %>%
  mutate(geometry = st_combine(c(geometry.x,geometry.y)) %>% 
           st_cast("LINESTRING")) %>%
  select(-geometry.x, -geometry.y) %>% st_as_sf(crs = 28992)

commute <- commute %>%
  mutate(lineWidth = interaction / max(interaction) * 10)

ggplot(commute) + geom_sf(aes(size = lineWidth)) + scale_size_identity()

From the plot, we can see that there are some more “major” municipalities which have high interactions with other municipalities near them. Short-distance commutes is also much more prevalent than long-distance commutes.

Summary table

incoming <- commute %>% 
  group_by(sink) %>%
  rename(id = sink) %>%
  summarise(sum_incoming = sum(interaction))

st_geometry(incoming) <- NULL

incoming <- incoming %>%
  full_join(munis)
## Joining, by = "id"
incoming[order(incoming$sum_incoming, decreasing = TRUE), ] %>% head(5)

The top 3 municipalities with most incoming commuters are Amsterdam, Rotterdam and Utrecht.

Model fitting

We can fit a spatial interaction model using the same factors as above (distance, no. of residents in source, no. of jobs in sink).

distMatrix <- st_distance(x = munis.centroid, y = munis.centroid)
rownames(distMatrix) <- munis.centroid$id
colnames(distMatrix) <- munis.centroid$id

dist <- list(
source = rownames(distMatrix)[row(distMatrix)] %||% row(distMatrix), sink =
    colnames(distMatrix)[col(distMatrix)] %||% col(distMatrix), distance = distMatrix
    ) %>%
  map_dfc(as.vector) %>%
  mutate(source = as.numeric(source)) %>% mutate(sink = as.numeric(sink))

# join dist, residents and jobs into commute
commute <- commute %>%
  left_join(dist) %>%
  left_join(residents, by = c('source' = 'id')) %>% 
  rename(residents = weight) %>%
  left_join(jobs, by = c('sink' = 'id')) %>% 
  rename(jobs = weight)
## Joining, by = c("source", "sink")
# quick check: interaction should decline as distance increases
ggplot(commute, aes(distance, interaction)) + geom_point()

ggplot(commute, aes(log(distance), log(interaction))) + geom_point()

ggplot(commute, aes(interaction)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

model <- glm(data = commute, formula = interaction ~ log(residents) + log(jobs) + log(distance), family = poisson())

r2 <- function(empirical, fitted) { 
  return(cor(empirical, fitted)^2)
}
r2(commute$interaction, fitted(model))
## [1] 0.2824222
tidy(model)

The r-squared for this model is only 0.28, hence it is not a good model. The absolute value of the estimates are all smaller than the values for the Utrecht model. Hence, there are likely many other factors affecting people’s commuting patterns, outside of the ones we used. Also, as our interaction values are abruptly cut off at 750, the model fitting may have been affected.

commute <- commute %>%
  mutate(fitted = fitted(model)) %>% 
  mutate(residual = interaction - fitted) %>% 
  mutate(residualSign = sign(residual))

commute %>%
  mutate(lineWidth = fitted / max(fitted) * 10) %>% 
  ggplot() + geom_sf(aes(size = lineWidth)) + scale_size_identity() + geom_sf(data=munis)

commute %>%
  mutate(lineWidth = abs(residual) / max(residual) * 10) %>%
  ggplot() + geom_sf(aes(size = lineWidth, color = factor(residualSign))) + scale_size_identity()

commute %>%
  mutate(lineWidth = fitted / max(fitted) * 10) %>% 
  ggplot() + geom_sf(aes(size = lineWidth)) + scale_size_identity() + geom_sf(data=munis)

From the fitted and residual plots, most of the commutes with high interaction are significantly under-estimated.

fitted_incoming <- commute %>% 
  group_by(sink) %>%
  rename(id = sink) %>%
  summarise(sum_incoming = sum(fitted))

st_geometry(fitted_incoming) <- NULL

fitted_incoming <- fitted_incoming %>%
  full_join(munis)
## Joining, by = "id"
fitted_incoming[order(fitted_incoming$sum_incoming, decreasing = TRUE), ] %>% head(5)

However, the model still predicts the top few municipalities with most incoming commuters correctly.