Lab

In this lab, we will be looking at travel patterns in Netherlands, starting with Utrecht. We aim to perform statistical analysis on spatial ‘interaction’ data in R by creating models to predict how people commute.

First, we start with reading in the data.

Reading layer `OGRGeoJSON' from data source `/Users/arroyo/Desktop/stuff/Term 5/02221/Week 11/02221-Lab10/gem_2016.geojson' using driver `GeoJSON'
Simple feature collection with 479 features and 121 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 10425.16 ymin: 306846.2 xmax: 278026.1 ymax: 621876.3
epsg (SRID):    28992
proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957,0.343988,-1.87740,4.0725 +units=m +no_defs

We only care about the province of Utrecht for this part of the lab. So let’s select only those municipalitites that fall within this province.

provinces <- st_read("provinces.geojson") %>%
  st_transform(crs = 28992)
Reading layer `OGRGeoJSON' from data source `/Users/arroyo/Desktop/stuff/Term 5/02221/Week 11/02221-Lab10/provinces.geojson' using driver `GeoJSON'
Simple feature collection with 12 features and 3 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 3.307938 ymin: 50.75037 xmax: 7.227498 ymax: 53.57642
epsg (SRID):    4326
proj4string:    +proj=longlat +datum=WGS84 +no_defs
utrecht <- provinces %>%
  filter(name == 'Utrecht')
munis <- munis %>% 
  filter(st_intersects(utrecht, 
                       st_centroid(munis),
                       sparse = F))

In the above step, we read in the geojson file of all the provinces in Netherlands. Next, we used filter to choose only the areas labelled as “Utrecht”. Finally, we find those municipalities where their centroid intersect the areas labelled as “Utrecht”

Now with the necessary municipalities, we can read in our other data.

commute <- read_csv("commuting.csv") %>%
  select(source, sink, weight) %>%
  rename(interaction = weight)
Parsed with column specification:
cols(
  source = col_integer(),
  sink = col_integer(),
  count = col_integer(),
  weight = col_integer()
)

Again, this set of data is for the entire country. Let’s filter it to only data in Utrecht

munis <- munis %>%
  mutate(id = as.numeric(str_replace(GM_CODE, "GM", "")))
commute <- commute %>% 
  filter(source %in% munis$id & sink %in% munis$id)

Now that we finally have all the data we need, we can do a plot to see how the interactions look like visually. First, we store all the centroid of the different municipalities. Then, we join both the source and the sink centroid to our commute table. Finally, we create lines between every pair of source and sink.

munis.centroid <- st_centroid(munis) %>%
  select(id)
commute <- commute %>%
  left_join(munis.centroid, by = c("source" = "id")) %>%
  left_join(munis.centroid, by = c("sink" = "id")) %>%
  rowwise() %>%
  mutate(geometry = st_combine(c(geometry.x, geometry.y)) %>%
         st_cast("MULTILINESTRING")) %>%
  select(-geometry.x, -geometry.y) %>%
  st_as_sf(crs = 28992)
ggplot(munis) + 
  geom_sf() + 
  geom_sf(data = commute) +
  theme_void() +
  ggtitle("Travel Patterns in Urecht")

The above map shows where people have been travelling to and travelling from. However, this does not show the volume of travel. We need to bind the width of each line to number of people commuting over that line. We will also filter out smaller connections and commuters that never leave their own municipality.

commute <- commute %>%
  filter(sink != source) %>%
  filter(interaction > 20) %>%
  mutate(lineWidth = interaction/max(interaction)*10)
ggplot(munis) + 
  geom_sf() + 
  geom_sf(data = commute, aes(size = lineWidth)) +
  scale_size_identity() + 
  theme_void() + 
  ggtitle("Travel Patterns in Urecht")

Now the plot shows more information. From the plot, it seems like it has one large employment centre with several smaller centres. Now that we can plot the actual number of commuters, let’s see if we can model the commuting relationship based the distance between municipalities, the number of residents and the number of jobs in each municipality.

residents <- read_csv("residents.csv") %>%
  select(id, weight)
Parsed with column specification:
cols(
  id = col_integer(),
  count = col_integer(),
  weight = col_integer()
)
jobs <- read_csv("jobs.csv") %>%
  select(id, weight)
Parsed with column specification:
cols(
  id = col_integer(),
  count = col_integer(),
  weight = col_integer()
)
dist <- st_distance(x = munis.centroid, y = munis.centroid)

The dist matrix is a bit hard to use right now. We need to convert it to a table with three columns (from, to, and distance between the two points)

rownames(dist) <- munis.centroid$id
colnames(dist) <- munis.centroid$id
dist <- list(
  source = rownames(dist)[row(dist)] %||% row(dist),
  sink = colnames(dist)[col(dist)] %||% col(dist),
  distance = dist
) %>%
  map_dfc(as.vector) %>%
  mutate(source = as.numeric(source)) %>%
  mutate(sink = as.numeric(sink))

Now that we have dist ready, we can join all the data together and plot it.

commute <- commute %>%
  left_join(dist) %>%
  left_join(residents, by = c('source' = 'id')) %>%
  rename(residents = weight) %>%
  left_join(jobs, by = c('sink' = 'id')) %>%
  rename(jobs = weight)
Joining, by = c("source", "sink")
ggplot(commute, aes(interaction, distance)) + 
  geom_point() + 
  ggtitle("Relationship Between Distance and Number of Interactions")

The above plot does not look linear. Let’s try using the log function.

ggplot(commute, aes(log(interaction), log(distance))) + 
  geom_point() +
  ggtitle("Logarithmic Relationships Between Distance and Number of Interactions")

Now we can create a naive linear model for it.

lm(data = commute, formula = interaction ~ residents + jobs + distance) %>% summary()

Call:
lm(formula = interaction ~ residents + jobs + distance, data = commute)

Residuals:
   Min     1Q Median     3Q    Max 
-15770  -1150   -309    922  40734 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.553e+03  5.241e+02   4.872 1.47e-06 ***
residents    3.234e-02  3.423e-03   9.449  < 2e-16 ***
jobs         9.666e-02  4.960e-03  19.486  < 2e-16 ***
distance    -1.965e-01  2.255e-02  -8.713  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4632 on 515 degrees of freedom
Multiple R-squared:  0.5032,    Adjusted R-squared:  0.5003 
F-statistic: 173.9 on 3 and 515 DF,  p-value: < 2.2e-16

From the summary above, we can see that the R-square value is only around 0.5. That is really low. It seems that the linear model does not fit the data well. Perhaps we need to look at the distribution of commuters in order to choose a better model.

ggplot(commute, aes(interaction)) + 
  geom_histogram() + 
  ggtitle("Histogram of Number of Commuters Over Number of Interactions") + 
  xlab("Number of Interactions Between Two Municipality") + 
  ylab("Count")

As we can see from the above plot, the distribution of commuters is very right skewed, seemingly following a poisson distribution. We should try using a poison regression model instead.

model <- glm(data = commute, formula = interaction ~ log(residents) + log(jobs) + log(distance), family = poisson())
r2 <- function(empirical, fitted) {
  return(cor(empirical, fitted)^2)
}
r2(commute$interaction, fitted(model))
[1] 0.8639754

Looking at the R-squared value of our new model, it is fitting really well at 0.864.

tidy(model)

From the coefficients of our new model, we can see, an one percent increase in number of residents, the number of interactions would increase by 0.859/100. Similarly, an one percent increase in number of jobs would result in a 0.984/100 more interactions. On the other hand, an one percent increase in distance would result in a decrease of 1.5/100 interactions.

Now, let’s plot our fitted model and see how it looks.

commute <- commute %>% 
  mutate(fitted = fitted(model)) %>%
  mutate(residual = interaction - fitted) %>%
  mutate(residualSign = sign(residual))
commute <- commute %>% 
  mutate(lineWidth = fitted/ max(fitted) * 10)
  
ggplot(munis) + 
  geom_sf() +
  geom_sf(data = commute, aes(size = lineWidth)) +
  scale_size_identity() + 
  theme_void() + 
  ggtitle("Predicted Travel Patterns in Utrecht")

Comparing this with our earlier plot of the actual values, we can see that the model fits really well. The two plots look really similar.

Now let’s look at the residual to see the difference between the fitted model and the actual data.

commute <- commute %>% mutate(lineWidth = abs(residual) / max(residual) * 10)
ggplot(munis) +
  geom_sf() + 
  geom_sf(data = commute, aes(size = lineWidth, color = factor(residualSign))) +
  scale_size_identity() + 
  theme_void() +
  guides(color=guide_legend(title="Sign of Residual")) +
  ggtitle("Residual of Model")

NA

Assignment

Now, let’s extend our analysis to the entire of Netherlands. First, we read in the data again.

munis <- st_read("gem_2016.geojson", crs = 28992)
Reading layer `OGRGeoJSON' from data source `/Users/arroyo/Desktop/stuff/Term 5/02221/Week 11/02221-Lab10/gem_2016.geojson' using driver `GeoJSON'
st_crs<- : replacing crs does not reproject data; use st_transform for that
Simple feature collection with 479 features and 121 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 10425.16 ymin: 306846.2 xmax: 278026.1 ymax: 621876.3
epsg (SRID):    28992
proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957,0.343988,-1.87740,4.0725 +units=m +no_defs
munis <- munis %>% 
  filter(WATER == 'NEE') %>%
  select(GM_CODE, GM_NAAM)
commute <- read_csv("commuting.csv") %>%
  select(source, sink, weight) %>%
  rename(interaction = weight)
Parsed with column specification:
cols(
  source = col_integer(),
  sink = col_integer(),
  count = col_integer(),
  weight = col_integer()
)

Next, we filter and clean the data.

munis <- munis %>%
  mutate(id = as.numeric(str_replace(GM_CODE, "GM", "")))
commute <- commute %>% 
  filter(source %in% munis$id & sink %in% munis$id)

Now, we join the commute data to the municipalities, and create lines between every source, sink pair. While we are at it, let’s filter the data to show only the more significant interactions.

munis.centroid <- st_centroid(munis) %>%
  select(id)
commute <- commute %>%
  left_join(munis.centroid, by = c("source" = "id")) %>%
  left_join(munis.centroid, by = c("sink" = "id")) %>%
  rowwise() %>%
  mutate(geometry = st_combine(c(geometry.x, geometry.y)) %>%
         st_cast("MULTILINESTRING")) %>%
  select(-geometry.x, -geometry.y) %>%
  st_as_sf(crs = 28992)
commute <- commute %>%
  filter(sink != source) %>%
  filter(interaction > 750) %>%
  mutate(lineWidth = interaction/max(interaction)*10)

Now, let’s plot the data and see how it looks like.

ggplot(munis) + 
  geom_sf() + 
  geom_sf(data = commute, aes(size = lineWidth)) +
  scale_size_identity() + 
  theme_void() + 
  ggtitle("Travel Patterns in Netherlands")

The plot looks right. Right now, we could create a summary table to see which municipalities have the most number of people travelling there.

commute.summary <- commute %>% 
  group_by(sink) %>% 
  summarise(incoming = sum(interaction))  %>%
  as.data.frame() %>%
  select(c(sink, incoming)) %>%
  arrange(desc(incoming))
  
munis.name <- munis %>%
  as.data.frame() %>%
  select(c(GM_NAAM, id))
merge(commute.summary, munis.name, by.x = "sink", by.y = "id", sort = FALSE)

Wow~ The top three most popular destinations are Amsterdam, Rotterdam and Utrecht! This makes sense. Amsterdam is the capital of Netherlands, Rotterdam is Europe’s largest port and Utrecht is the fourth largest city in Netherlands.

Now, let’s try running a spatial interaction model for Netherlands. Again, we need the data for distance between municipalities, number of jobs and number of residents.

dist <- st_distance(x = munis.centroid, y = munis.centroid)
rownames(dist) <- munis.centroid$id
colnames(dist) <- munis.centroid$id
dist <- list(
  source = rownames(dist)[row(dist)] %||% row(dist),
  sink = colnames(dist)[col(dist)] %||% col(dist),
  distance = dist
) %>%
  map_dfc(as.vector) %>%
  mutate(source = as.numeric(source)) %>%
  mutate(sink = as.numeric(sink))
commute <- commute %>%
  left_join(dist) %>%
  left_join(residents, by = c('source' = 'id')) %>%
  rename(residents = weight) %>%
  left_join(jobs, by = c('sink' = 'id')) %>%
  rename(jobs = weight)
Joining, by = c("source", "sink")

With all the data we need, we can run the model.

model <- glm(data = commute, formula = interaction ~ log(residents) + log(jobs) + log(distance), family = poisson())
r2 <- function(empirical, fitted) {
  return(cor(empirical, fitted)^2)
}
r2(commute$interaction, fitted(model))
[1] 0.2824222

The R square value is really low. The model looks bad.

tidy(model)

Looking at the coefficients of this model, it suggests that, an one percent increase in number of residents, should cause a 0.49/100 increase in number of interactions. Similarly, an one percent increase in number of jobs would result in 0.644/100 more interactions. On the other hand, an one percent increase in distance would result in a decrease of 1.12/100 interactions.

It is hard to explain why the numbers are so different compared to the model for Utrecht by looking at the numbers alone. Perhaps, there are other significant factors that are in play here, which we are missing out in our model. For instance, it could be more difficult to move between cities as compared to moving within the same city. Factors like these could affect the accuracy of our model

Perhaps we could look at the residuals of the model and try and see what’s going on.

commute <- commute %>% 
  mutate(fitted = fitted(model)) %>%
  mutate(residual = interaction - fitted) %>%
  mutate(residualSign = sign(residual))
commute <- commute %>% 
  mutate(lineWidth = abs(residual)/ max(residual) * 10)
ggplot(munis) +
  geom_sf() + 
  geom_sf(data = commute, aes(size = lineWidth, color = factor(residualSign))) +
  scale_size_identity() + 
  theme_void() +
  guides(color=guide_legend(title="Sign of Residual")) +
  ggtitle("Residual of Model")

Looking at the residual plot, it seems that our model is really far from actual empirical data. Another possible reason why our model failed, could be because of how we are representing our data. In our model, we only take into account the distance between municipalities from their centroids. This suggests that all the people and jobs are located in the center of each municipalities. Across larger land masses, the discrepancies due to this estimation, become a lot more significant. In reality, jobs and people might be more spreaded out, or located closer to the edge of a municipality.

---
title: "02221 Lab 10"
author: "Rayson Lim"
output: html_notebook
---

## Lab
In this lab, we will be looking at travel patterns in Netherlands, starting with Utrecht. We aim to perform statistical analysis on spatial 'interaction' data in R by creating models to predict how people commute. 

First, we start with reading in the data. 

```{r, echo = FALSE, warning=FALSE, message=FALSE}
library(sf)
library(devtools)
library(tidyverse)
library(broom)
library(ggplot2)

munis <- st_read("gem_2016.geojson", crs = 28992)
munis <- munis %>% 
  filter(WATER == 'NEE') %>%
  select(GM_CODE, GM_NAAM)
```

We only care about the province of Utrecht for this part of the lab. So let's select only those municipalitites that fall within this province. 
```{r}
provinces <- st_read("provinces.geojson") %>%
  st_transform(crs = 28992)
utrecht <- provinces %>%
  filter(name == 'Utrecht')
munis <- munis %>% 
  filter(st_intersects(utrecht, 
                       st_centroid(munis),
                       sparse = F))
```
In the above step, we read in the geojson file of all the provinces in Netherlands. Next, we used filter to choose only the areas labelled as "Utrecht". Finally, we find those municipalities where their centroid intersect the areas labelled as "Utrecht"

Now with the necessary municipalities, we can read in our other data. 
```{r}
commute <- read_csv("commuting.csv") %>%
  select(source, sink, weight) %>%
  rename(interaction = weight)
```
Again, this set of data is for the entire country. Let's filter it to only data in Utrecht
```{r}
munis <- munis %>%
  mutate(id = as.numeric(str_replace(GM_CODE, "GM", "")))
commute <- commute %>% 
  filter(source %in% munis$id & sink %in% munis$id)
```

Now that we finally have all the data we need, we can do a plot to see how the interactions look like visually. 
First, we store all the centroid of the different municipalities. Then, we join both the source and the sink centroid to our commute table. Finally, we create lines between every pair of source and sink.
```{r}
munis.centroid <- st_centroid(munis) %>%
  select(id)

commute <- commute %>%
  left_join(munis.centroid, by = c("source" = "id")) %>%
  left_join(munis.centroid, by = c("sink" = "id")) %>%
  rowwise() %>%
  mutate(geometry = st_combine(c(geometry.x, geometry.y)) %>%
         st_cast("MULTILINESTRING")) %>%
  select(-geometry.x, -geometry.y) %>%
  st_as_sf(crs = 28992)

ggplot(munis) + 
  geom_sf() + 
  geom_sf(data = commute) +
  theme_void() +
  ggtitle("Travel Patterns in Urecht")
```
The above map shows where people have been travelling to and travelling from. However, this does not show the volume of travel. We need to bind the width of each line to number of people commuting over that line. We will also filter out smaller connections and commuters that never leave their own municipality.

```{r}
commute <- commute %>%
  filter(sink != source) %>%
  filter(interaction > 20) %>%
  mutate(lineWidth = interaction/max(interaction)*10)

ggplot(munis) + 
  geom_sf() + 
  geom_sf(data = commute, aes(size = lineWidth)) +
  scale_size_identity() + 
  theme_void() + 
  ggtitle("Travel Patterns in Urecht")
```
Now the plot shows more information. From the plot, it seems like it has one large employment centre with several smaller centres. 
Now that we can plot the actual number of commuters, let's see if we can model the commuting relationship based the distance between municipalities, the number of residents and the number of jobs in each municipality. 
```{r}
residents <- read_csv("residents.csv") %>%
  select(id, weight)
jobs <- read_csv("jobs.csv") %>%
  select(id, weight)
dist <- st_distance(x = munis.centroid, y = munis.centroid)
```
The dist matrix is a bit hard to use right now. We need to convert it to a table with three columns (from, to, and distance between the two points)
```{r}
rownames(dist) <- munis.centroid$id
colnames(dist) <- munis.centroid$id
dist <- list(
  source = rownames(dist)[row(dist)] %||% row(dist),
  sink = colnames(dist)[col(dist)] %||% col(dist),
  distance = dist
) %>%
  map_dfc(as.vector) %>%
  mutate(source = as.numeric(source)) %>%
  mutate(sink = as.numeric(sink))
```

Now that we have dist ready, we can join all the data together and plot it.

```{r}
commute <- commute %>%
  left_join(dist) %>%
  left_join(residents, by = c('source' = 'id')) %>%
  rename(residents = weight) %>%
  left_join(jobs, by = c('sink' = 'id')) %>%
  rename(jobs = weight)

ggplot(commute, aes(interaction, distance)) + 
  geom_point() + 
  ggtitle("Relationship Between Distance and Number of Interactions")
```
The above plot does not look linear. Let's try using the log function. 
```{r}
ggplot(commute, aes(log(interaction), log(distance))) + 
  geom_point() +
  ggtitle("Logarithmic Relationships Between Distance and Number of Interactions")
```

Now we can create a naive linear model for it. 
```{r}
lm(data = commute, formula = interaction ~ residents + jobs + distance) %>% summary()
```
From the summary above, we can see that the R-square value is only around 0.5. That is really low. It seems that the linear model does not fit the data well. Perhaps we need to look at the distribution of commuters in order to choose a better model. 

```{r}
ggplot(commute, aes(interaction)) + 
  geom_histogram() + 
  ggtitle("Histogram of Number of Commuters Over Number of Interactions") + 
  xlab("Number of Interactions Between Two Municipality") + 
  ylab("Count")
```

As we can see from the above plot, the distribution of commuters is very right skewed, seemingly following a poisson distribution. We should try using a poison regression model instead. 

```{r}
model <- glm(data = commute, formula = interaction ~ log(residents) + log(jobs) + log(distance), family = poisson())
r2 <- function(empirical, fitted) {
  return(cor(empirical, fitted)^2)
}
r2(commute$interaction, fitted(model))
```

Looking at the R-squared value of our new model, it is fitting really well at 0.864. 

```{r}
tidy(model)
```

From the coefficients of our new model, we can see, an one percent increase in number of residents, the number of interactions would increase by 0.859/100. Similarly, an one percent increase in number of jobs would result in a 0.984/100 more interactions. On the other hand, an one percent increase in distance would result in a decrease of 1.5/100 interactions. 

Now, let's plot our fitted model and see how it looks. 
```{r}
commute <- commute %>% 
  mutate(fitted = fitted(model)) %>%
  mutate(residual = interaction - fitted) %>%
  mutate(residualSign = sign(residual))

commute <- commute %>% 
  mutate(lineWidth = fitted/ max(fitted) * 10)
  
ggplot(munis) + 
  geom_sf() +
  geom_sf(data = commute, aes(size = lineWidth)) +
  scale_size_identity() + 
  theme_void() + 
  ggtitle("Predicted Travel Patterns in Utrecht")
```
Comparing this with our earlier plot of the actual values, we can see that the model fits really well. The two plots look really similar. 

Now let's look at the residual to see the difference between the fitted model and the actual data.

```{r}
commute <- commute %>% mutate(lineWidth = abs(residual) / max(residual) * 10)
ggplot(munis) +
  geom_sf() + 
  geom_sf(data = commute, aes(size = lineWidth, color = factor(residualSign))) +
  scale_size_identity() + 
  theme_void() +
  guides(color=guide_legend(title="Sign of Residual")) +
  ggtitle("Residual of Model")
  
```

## Assignment

Now, let's extend our analysis to the entire of Netherlands. First, we read in the data again. 

```{r}
munis <- st_read("gem_2016.geojson", crs = 28992)
munis <- munis %>% 
  filter(WATER == 'NEE') %>%
  select(GM_CODE, GM_NAAM)

commute <- read_csv("commuting.csv") %>%
  select(source, sink, weight) %>%
  rename(interaction = weight)
```
Next, we filter and clean the data. 

```{r}
munis <- munis %>%
  mutate(id = as.numeric(str_replace(GM_CODE, "GM", "")))

commute <- commute %>% 
  filter(source %in% munis$id & sink %in% munis$id)
```

Now, we join the commute data to the municipalities, and create lines between every source, sink pair. While we are at it, let's filter the data to show only the more significant interactions. 

```{r}
munis.centroid <- st_centroid(munis) %>%
  select(id)

commute <- commute %>%
  left_join(munis.centroid, by = c("source" = "id")) %>%
  left_join(munis.centroid, by = c("sink" = "id")) %>%
  rowwise() %>%
  mutate(geometry = st_combine(c(geometry.x, geometry.y)) %>%
         st_cast("MULTILINESTRING")) %>%
  select(-geometry.x, -geometry.y) %>%
  st_as_sf(crs = 28992)

commute <- commute %>%
  filter(sink != source) %>%
  filter(interaction > 750) %>%
  mutate(lineWidth = interaction/max(interaction)*10)
```

Now, let's plot the data and see how it looks like. 

```{r}
ggplot(munis) + 
  geom_sf() + 
  geom_sf(data = commute, aes(size = lineWidth)) +
  scale_size_identity() + 
  theme_void() + 
  ggtitle("Travel Patterns in Netherlands")
```

The plot looks right. Right now, we could create a summary table to see which municipalities have the most number of people travelling there. 

```{r}
commute.summary <- commute %>% 
  group_by(sink) %>% 
  summarise(incoming = sum(interaction))  %>%
  as.data.frame() %>%
  select(c(sink, incoming)) %>%
  arrange(desc(incoming))
  
munis.name <- munis %>%
  as.data.frame() %>%
  select(c(GM_NAAM, id))

merge(commute.summary, munis.name, by.x = "sink", by.y = "id", sort = FALSE)
```

Wow~ The top three most popular destinations are Amsterdam, Rotterdam and Utrecht! This makes sense. Amsterdam is the capital of Netherlands, Rotterdam is Europe's largest port and Utrecht is the fourth largest city in Netherlands. 

Now, let's try running a spatial interaction model for Netherlands. Again, we need the data for distance between municipalities, number of jobs and number of residents. 

```{r}
dist <- st_distance(x = munis.centroid, y = munis.centroid)

rownames(dist) <- munis.centroid$id
colnames(dist) <- munis.centroid$id
dist <- list(
  source = rownames(dist)[row(dist)] %||% row(dist),
  sink = colnames(dist)[col(dist)] %||% col(dist),
  distance = dist
) %>%
  map_dfc(as.vector) %>%
  mutate(source = as.numeric(source)) %>%
  mutate(sink = as.numeric(sink))

commute <- commute %>%
  left_join(dist) %>%
  left_join(residents, by = c('source' = 'id')) %>%
  rename(residents = weight) %>%
  left_join(jobs, by = c('sink' = 'id')) %>%
  rename(jobs = weight)

```

With all the data we need, we can run the model. 

```{r}
model <- glm(data = commute, formula = interaction ~ log(residents) + log(jobs) + log(distance), family = poisson())
r2 <- function(empirical, fitted) {
  return(cor(empirical, fitted)^2)
}
r2(commute$interaction, fitted(model))

```

The R square value is really low. The model looks bad. 

```{r}
tidy(model)
```
Looking at the coefficients of this model, it suggests that, an one percent increase in number of residents, should cause a 0.49/100 increase in number of interactions. Similarly, an one percent increase in number of jobs would result in 0.644/100 more interactions. On the other hand, an one percent increase in distance would result in a decrease of 1.12/100 interactions. 

It is hard to explain why the numbers are so different compared to the model for Utrecht by looking at the numbers alone. Perhaps, there are other significant factors that are in play here, which we are missing out in our model. For instance, it could be more difficult to move between cities as compared to moving within the same city. Factors like these could affect the accuracy of our model

Perhaps we could look at the residuals of the model and try and see what's going on. 
```{r}
commute <- commute %>% 
  mutate(fitted = fitted(model)) %>%
  mutate(residual = interaction - fitted) %>%
  mutate(residualSign = sign(residual))

commute <- commute %>% 
  mutate(lineWidth = abs(residual)/ max(residual) * 10)

ggplot(munis) +
  geom_sf() + 
  geom_sf(data = commute, aes(size = lineWidth, color = factor(residualSign))) +
  scale_size_identity() + 
  theme_void() +
  guides(color=guide_legend(title="Sign of Residual")) +
  ggtitle("Residual of Model")
```

Looking at the residual plot, it seems that our model is really far from actual empirical data. Another possible reason why our model failed, could be because of how we are representing our data. In our model, we only take into account the distance between municipalities from their centroids. This suggests that all the people and jobs are located in the center of each municipalities. Across larger land masses, the discrepancies due to this estimation, become a lot more significant. In reality, jobs and people might be more spreaded out, or located closer to the edge of a municipality. 


