Jump to Lab Quesions
Data Reading
In this lab we are looking a spatial interaction data.
We will start by reading in a spatial dataset for all municipalities in the Netherlands in geojson format.
munis <- st_read("gem_2016.geojson",crs = 28992)
We do not need parts of the municipalities that are water or sea, so we can filter them out here
munis <- munis %>%
filter(WATER == 'NEE') %>%
select(GM_CODE, GM_NAAM)
There are a lot of municipalities in the Netherlands. In this lab, we will be focusing on the province of Utrecht. To do so, the goal is to load in a dataset of the provinces and select only those municipalities that fall within this province.
Here, we transform the provinces.geojson into the proper coordinate reference system (CRS), which in this case is the Dutch national CRS (28992)
Then we filter out the provinces dataset such that we only have the data belonging to the province of Utrecht
Lastly, we filter out the information from the munis dataset such that we only have data relating to the Utrecht province
* This is done by using st_centroid to get the geometric center of the munis geometries
* Then st_intersects is used to find out which points in munis intersects with the utrecht provices
* We filter those information out and store in back in the munis dataframe.
provinces <- st_read("provinces.geojson") %>%
st_transform(crs = 28992)
utrecht <- provinces %>%
filter(name == 'Utrecht')
munis <- munis %>%
filter(st_intersects(utrecht,st_centroid(munis),sparse =F))
Now we have a nice, clean dataset for municipalities.
Next, we read in a table with information on the number of commuters in between each municipality
commute <- read_csv("commuting.csv") %>%
select(source,sink,weight) %>%
rename(interaction = weight)
Then, we filter the table to match the municipalities in Utrecht
To do this we first need to make a new column and extract out the munis id from the GM_Code and then do the filter
munis <- munis %>%
mutate(id = as.numeric(str_replace(GM_CODE,"GM","")))
commute <- commute %>%
filter(source %in% munis$id & sink %in% munis$id)
From Points to Lines
We now have a clean table of commuting relationshsips.
To create a simple plot that draws a line for each of these relations between the municipalities, we need to do a few things:
1. Find the centroid for each municipality
2. Join both the source and the sink centroid to our table
3. Create a line feature based on these two points/centroids
munis.centroid <- st_centroid(munis) %>%
select(id)
commute <- commute %>%
left_join(munis.centroid, by = c("source" = "id")) %>%
left_join(munis.centroid, by = c("sink" = "id")) %>%
rowwise() %>%
mutate(geometry = st_combine(c(geometry.x,geometry.y)) %>% st_cast("LINESTRING")) %>%
select(-geometry.x,-geometry.y) %>% st_as_sf(crs = 28992)
Now that we have a table with commuting relations that is enhance with a spatial line string for each relation, we can use this to plot them!
ggplot(commute) + geom_sf() + labs(title = "Plot of Spacial Relations of commutes \n")

We also know the number of people commuting over each line so that would be a useful variable to bind to the width of the line.
We will also filter out some of the smaller connections and commuters that never leave their own municipality.
commute <- commute %>%
filter(sink != source) %>%
filter(interaction >20) %>%
mutate(lineWidth = interaction / max(interaction) * 10)
ggplot(commute) + geom_sf(aes(size = lineWidth)) + scale_size_identity() + labs(title = "Plot of Spacial Relations of commutes", subtitle="With interaction corresponding to line width")

The pattern of the lines show that there are lots of people travelling between certain regions. These regions are most likely the major municipalities within Utrecht. It seems that there are multiple large employment centres in the area.
Modelling Spatial Relations
Now that we can plot the actual number of commuters, let us see if we can model the commuting relationship based on some other information.
In this case, we care going to use the distance between municipalities, the number of residents, and the number of jobs in each municipality.
residents <- read_csv("residents.csv") %>%
select(id, weight)
jobs <- read_csv("jobs.csv") %>%
select(id, weight)
We can create a distance matrix from the centroids we created earlier.
dist <- st_distance(x = munis.centroid, y = munis.centroid)
We need to reformat the resulting distance matrix to a table with three columns (from, to and the distance between those points)
rownames(dist) <- munis.centroid$id
colnames(dist) <- munis.centroid$id
dist <- list(
source = rownames(dist)[row(dist)] %||% row(dist),
sink = colnames(dist)[col(dist)] %||% col(dist),
distance = dist
) %>%
map_dfc(as.vector) %>%
mutate(source = as.numeric(source)) %>%
mutate(sink = as.numeric(sink))
Now we join them together
commute <- commute %>%
left_join(dist) %>%
left_join(residents, by = c('source' = 'id')) %>%
rename(residents = weight) %>%
left_join(jobs,by = c('sink'='id')) %>%
rename(jobs=weight)
Let’s do a visual inspection to see if everything went well.
Comparing the number of commuters and the distance:
ggplot(commute,aes(interaction,distance))+
geom_point()+
labs(title="Visual comparision of number of commuters vs distance\n")

It looks like the relationship is not linear, perhaps a log-log plot would help:
ggplot(commute,aes(log(interaction),log(distance)))+
geom_point()+
labs(title="Log-log plot of number of commuters vs distance\n")

A naive linear model based on the available information can be estimated using an OLS regression
lm(data=commute, formula = interaction ~ residents+jobs+distance) %>% summary()
Call:
lm(formula = interaction ~ residents + jobs + distance, data = commute)
Residuals:
Min 1Q Median 3Q Max
-15770 -1150 -309 922 40734
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.553e+03 5.241e+02 4.872 1.47e-06 ***
residents 3.234e-02 3.423e-03 9.449 < 2e-16 ***
jobs 9.666e-02 4.960e-03 19.486 < 2e-16 ***
distance -1.965e-01 2.255e-02 -8.713 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4632 on 515 degrees of freedom
Multiple R-squared: 0.5032, Adjusted R-squared: 0.5003
F-statistic: 173.9 on 3 and 515 DF, p-value: < 2.2e-16
The summary above shows that the model does not really fit very well. Not only does if have a low adjusted \(R^2\) of 0.5003, the p-values for all parameters and the regression itself is very small.
Lets look at the distribution of the commuters
ggplot(commute,aes(interaction))+
geom_histogram()+
labs(title="Distribution of the interactions of commuters\n")

It looks like it follows the Poisson distribution quite closely.
From the monday’s lecture, in spatial interaction modelling we generally thinkg of spatial interactions as:
\(T_{ij} = k\frac{V^\mu_i W^\alpha_j}{d_{ij}^\beta}\)
or rewritten:
\(T_{ij}= kV^\mu_iW^\alpha_jd^{-\beta}_{ij}\)
we can rewrite this again to estimate it with linear estimators:
\(ln T_{ij} = k + \mu ln V_i + \alpha ln W_j - \beta ln d_{ij}\)
model <- glm(data = commute,formula = interaction ~log(residents) + log(jobs) + log (distance), family = poisson())
GLM doesn’t provide \(r^2\) by default so we make our own function
r2 <- function(empirical, fitted){
return(cor(empirical, fitted)^2)
}
r2(commute$interaction,fitted(model))
[1] 0.8639754
The \(R^2\) value is pretty close to 1 now, this means that the model is a much better fit now.
Let’s see if we can interpret the rest of the model information.
tidy(model)
All coefficients have high test-statistic values and a p-value of 0. This means that they are statistically significant. Our model is:
\(Interaction = \beta_0 + \beta_1 log(residents) + \beta_2 log(jobs) + \beta_3 log(distance)\)
\(= Interaction = 3.357 + 0.8585 log(residents) + 0.9836 log(jobs) - 1.5120 log(distance)\)
Adding the fitted values, as well as their residuals back to the original data table so we can (visually) inspect them.
commute <- commute %>%
mutate(fitted = fitted(model)) %>%
mutate(residual = interaction - fitted) %>%
mutate(residualSign = sign(residual))
commute %>%
mutate(lineWidth = fitted / max(fitted) * 10) %>%
ggplot() + geom_sf(aes(size = lineWidth))+
scale_size_identity()+
labs(title = "Plot of Model of Spacial Relations of Commutes", subtitle="With model values corresponding to line width")

Comparing this with the earlier plot of the actual values, they are actually pretty accurate. Given some lineWidths are different in absolute size but the difference in the lineWidths all correspond to the actual values.

For residuals, they can be positive or negative, but since line widths can only be positive, we will use the color aesthetic to indicate whether the residual is positive or negative:
commute %>%
mutate(lineWidth = abs(residual) / max(residual) * 10) %>%
ggplot() +geom_sf(aes(size=lineWidth,color = factor(residualSign))) +
scale_size_identity()+
labs(title = "Plot of residuals of models", subtitle="With residual values corresponding to line width",color="Residual Sign")

Back to top
Lab Questions
Create a plot visualizing all commuting relations in the Netherlands
(use a decent filter for this, for example only showing lines with more than 750 commuters)

Create a summary table for the destinations of commuting in the Netherlands
(Tip: use the group_by() and summarize() functions. Which municipalities have the top 3 of most incoming commuters?)
(Tip: the names of the municipalities are contained in the ‘munis’ object – you can join them to your summary table for easier reading.)
Summary table for the destinations of commuting in the Netherlands
nethertable <- nether_commute%>%
filter(sink != source) %>%
group_by(sink) %>%
summarise(inc_commuters = sum(interaction)) %>%
rename(destination = sink)
nethertable
Simple feature collection with 390 features and 2 fields
geometry type: MULTILINESTRING
dimension: XY
bbox: xmin: 24537.79 ymin: 309545.2 xmax: 271400.9 ymax: 612030
epsg (SRID): 28992
proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.2369,50.0087,465.658,-0.406857,0.350733,-1.87035,4.0812 +units=m +no_defs
adding names by joining municipalities and sorting by incoming commuters
nether_munis= data.frame(id = netherlands$id,name=netherlands$GM_NAAM)
nether_commute_name <- nethertable %>%
left_join(nether_munis,by = c('destination'='id'))
nether_commute_name[order(-nether_commute_name$inc_commuters),]
Simple feature collection with 390 features and 3 fields
geometry type: MULTILINESTRING
dimension: XY
bbox: xmin: 24537.79 ymin: 309545.2 xmax: 271400.9 ymax: 612030
epsg (SRID): 28992
proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.2369,50.0087,465.658,-0.406857,0.350733,-1.87035,4.0812 +units=m +no_defs
The municipalities which have the top 3 of most incoming commuters are Amsterdam,Rotterdam and Utrecht.
Run a spatial interaction model for the data for the entire country.
What’s the model fit (R2)?
#create a distance matrix from the centroids
dist_all <- st_distance(x = netherlands.centroid, y = netherlands.centroid)
#reformat the resulting distance matrix to a table with three columns
rownames(dist_all) <- netherlands.centroid$id
colnames(dist_all) <- netherlands.centroid$id
dist_all <- list(
source = rownames(dist_all)[row(dist_all)] %||% row(dist_all),
sink = colnames(dist_all)[col(dist_all)] %||% col(dist_all),
distance = dist_all
) %>%
map_dfc(as.vector) %>%
mutate(source = as.numeric(source)) %>%
mutate(sink = as.numeric(sink))
#join
nether_commute_all <- nether_commute %>%
left_join(dist_all) %>%
left_join(residents, by = c('source' = 'id')) %>%
rename(residents = weight) %>%
left_join(jobs,by = c('sink'='id')) %>%
rename(jobs=weight)
nether_commute_all <- nether_commute_all%>%
filter(sink != source) %>%
filter(interaction >0)
model2 <- glm(data = nether_commute_all,formula = interaction ~log(residents) + log(jobs) + log (distance), family = poisson())
summary(model2)
Call:
glm(formula = interaction ~ log(residents) + log(jobs) + log(distance),
family = poisson(), data = nether_commute_all)
Deviance Residuals:
Min 1Q Median 3Q Max
-1521.30 -15.76 -10.89 -5.03 439.49
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.0928051 0.0032170 2516 <2e-16 ***
log(residents) 0.6382339 0.0001860 3432 <2e-16 ***
log(jobs) 0.8232155 0.0001391 5917 <2e-16 ***
log(distance) -1.6026525 0.0001715 -9346 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 111112446 on 43191 degrees of freedom
Residual deviance: 24988156 on 43188 degrees of freedom
AIC: 25239767
Number of Fisher Scoring iterations: 5
r2(nether_commute_all$interaction,fitted(model2))
[1] 0.09636535
The model fit for R2 is Interactions = 8.0928051 + 0.6382339log(residents) + 0.8232155log(jobs) -1.6026525log(distance).
With an \(R^2\) value of 0.09636535.
Can you explain the different fit compared to the model for Utrecht?
Include the model parameters in your notebook and explain what the estimates of the coefficients mean.
This model is a weaker fit with an \(R^2\) value of 0.09636535 (vs Utrecht’s 0.8639754). This is possibly due to having such a large dataset which could contain large number of outliers, influencing the model. Other factors which are not accounted for could also have influenced the interactions when looking at it from a nationwide perspective versus within a municipality.
Look at the residuals and describe where you see the largest residuals.
Another way to look at this is to go back to your table of municipalities with most incoming commuters and compare the top 3 with the top 3 based on the fitted values.
#combining the model values with the table
nether_commute_all <- nether_commute_all %>%
mutate(fitted = fitted(model2)) %>%
mutate(residual = interaction - fitted) %>%
mutate(residualSign = sign(residual))
#Summary table for the destinations
#By Residuals
nethertable2 <- nether_commute_all%>%
filter(sink != source) %>%
group_by(sink) %>%
summarise(residual = sum(residual)) %>%
rename(destination = sink)
nether_commute_name2 <- nethertable2 %>%
left_join(nether_munis,by = c('destination'='id'))
nether_commute_name2[order(-nether_commute_name2$residual),]
Simple feature collection with 390 features and 3 fields
geometry type: MULTILINESTRING
dimension: XY
bbox: xmin: 24537.79 ymin: 309545.2 xmax: 271400.9 ymax: 612030
epsg (SRID): 28992
proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.2369,50.0087,465.658,-0.406857,0.350733,-1.87035,4.0812 +units=m +no_defs
The largest residuals are Amsterdam,Groningen, and Utrecht
Which municipality has risen to the top?
Groningen has replaced Rotterdam as 2nd place. Amsterdan and Utrecht still take 1st and 3rd respectively. In fact, Rotterndam is in last(599th) place with the largest negative residual. Meaning that the model severely underestimated the commutes to Rotterdam. Converse
Can you think of an explanation?
Groningen, though the capital city of the Groningen Province, does not actually have much commutes to the area for its given jobs, distance and residents as it is actually a university city, with close to a fourth of its population being students. Hence, the model overestimates the interaction by a large extent.
Rotterdam on the other hand, is a sea port much like Singapore and is actually the largest port in Europe (source: wikipedia). This means that there is another factor which accounts for commutes to the area which is not accounted for by our model - the large number of ships arriving at port in Rotterdam!
---
title: "Lab 10: Spatial Interaction Modeling in R"
author: "Justinian Siah"
date: "5 April 2018"
output: html_notebook
---
[Jump to Lab Quesions](#LabQ){#top}

#Data Reading
```{r warning=FALSE,message=FALSE,echo=FALSE}
#Load libraries
library(sf)
library(devtools)
library(tidyverse)
library(broom)
library(ggplot2)
library(cowplot)
```
  
In this lab we are looking a spatial interaction data.  
We will start by reading in a spatial dataset for all municipalities in the Netherlands in geojson format.   
```{r cache=TRUE,results='hide',warning=FALSE,message=FALSE}
munis <- st_read("gem_2016.geojson",crs = 28992)
```
  
We do not need parts of the municipalities that are water or sea, so we can filter them out here  
```{r results='hide',warning=FALSE}
munis <- munis %>%
  filter(WATER == 'NEE') %>%
  select(GM_CODE, GM_NAAM)
```
  
There are a lot of municipalities in the Netherlands. In this lab, we will be focusing on the province of Utrecht. To do so, the goal is to load in a dataset of the provinces and select only those municipalities that fall within this province.  
Here, we transform the provinces.geojson into the proper coordinate reference system (CRS), which in this case is the Dutch national CRS (28992)  
Then we filter out the provinces dataset such that we only have the data belonging to the province of Utrecht  
Lastly, we filter out the information from the munis dataset such that we only have data relating to the Utrecht province    
    * This is done by using st_centroid to get the geometric center of the munis geometries   
    * Then st_intersects is used to find out which points in munis intersects with the utrecht provices  
    * We filter those information out and store in back in the munis dataframe.    
      
```{r results='hide',warning=FALSE,message=FALSE}
provinces <- st_read("provinces.geojson") %>% 
  st_transform(crs = 28992)

utrecht <- provinces %>% 
  filter(name == 'Utrecht')

munis <- munis %>% 
  filter(st_intersects(utrecht,st_centroid(munis),sparse =F))
```
  
Now we have a nice, clean dataset for municipalities.  
Next, we read in a table with information on the number of commuters in between each municipality  
```{r results='hide',warning=FALSE,message=FALSE}
commute <-  read_csv("commuting.csv") %>% 
  select(source,sink,weight) %>% 
  rename(interaction = weight)
```
  
Then, we filter the table to match the municipalities in Utrecht  
To do this we first need to make a new column and extract out the munis id from the GM_Code and then do the filter  
```{r results='hide',warning=FALSE,message=FALSE}
munis <- munis %>% 
  mutate(id = as.numeric(str_replace(GM_CODE,"GM","")))

commute <- commute %>% 
  filter(source %in% munis$id & sink %in% munis$id)
```

***  

#From Points to Lines  
We now have a clean table of commuting relationshsips.  
To create a simple plot that draws a line for each of these relations between the municipalities, we need to do a few things:  
1. Find the centroid for each municipality  
2. Join both the source and the sink centroid to our table  
3. Create a line feature based on these two points/centroids  

```{r results='hide',warning=FALSE,message=FALSE}
munis.centroid <-  st_centroid(munis) %>% 
  select(id)

commute <- commute %>% 
  left_join(munis.centroid, by = c("source" = "id")) %>% 
  left_join(munis.centroid, by = c("sink" = "id")) %>% 
  rowwise() %>% 
  mutate(geometry = st_combine(c(geometry.x,geometry.y)) %>%  st_cast("LINESTRING")) %>% 
  select(-geometry.x,-geometry.y) %>% st_as_sf(crs = 28992)
```
  
  Now that we have a table with commuting relations that is enhance with a spatial line string for each relation, we can use this to plot them!

```{r results='hide',warning=FALSE,message=FALSE}
ggplot(commute) + geom_sf() + labs(title = "Plot of Spacial Relations of commutes \n")
```
  
We also know the number of people commuting over each line so that would be a useful variable to bind to the width of the line.  
We will also filter out some of the smaller connections and commuters that never leave their own municipality.  
    
```{r results='hide',warning=FALSE,message=FALSE}
commute <- commute %>% 
  filter(sink != source) %>% 
  filter(interaction >20) %>% 
  mutate(lineWidth = interaction / max(interaction) * 10)

ggplot(commute) + geom_sf(aes(size = lineWidth)) + scale_size_identity() + labs(title = "Plot of Spacial Relations of commutes", subtitle="With interaction corresponding to line width")
```
```{r include=FALSE}
p1 <- ggplot(commute) + geom_sf(aes(size = lineWidth,color=lineWidth)) + scale_size_identity() + labs(subtitle="  Actual Values") + theme(axis.text = element_blank(),legend.position="none") + scale_color_continuous(trans='reverse')
```


The pattern of the lines show that there are lots of people travelling between certain regions. These regions are most likely the major municipalities within Utrecht. It seems that there are multiple large employment centres in the area.  

***  

#Modelling Spatial Relations  
Now  that we can plot the actual number of commuters, let us see if we can model the commuting relationship based on some other information.  
In this case, we care going to use the distance between municipalities, the number of residents, and the number of jobs in each municipality.  
```{r results='hide',warning=FALSE,message=FALSE}
residents <-  read_csv("residents.csv") %>% 
  select(id, weight)

jobs <- read_csv("jobs.csv") %>% 
  select(id, weight)
```

We can create a distance matrix from the centroids we created earlier.  
```{r results='hide',warning=FALSE,message=FALSE}
dist <- st_distance(x = munis.centroid, y = munis.centroid)
```

We need to reformat the resulting distance matrix to a table with three columns (from, to and the distance between those points)
```{r results='hide',warning=FALSE,message=FALSE}
rownames(dist) <- munis.centroid$id
colnames(dist) <- munis.centroid$id
dist <- list(
  source = rownames(dist)[row(dist)] %||% row(dist),
  sink = colnames(dist)[col(dist)] %||% col(dist),
  distance = dist
) %>% 
  map_dfc(as.vector) %>% 
  mutate(source = as.numeric(source)) %>% 
  mutate(sink = as.numeric(sink))
```

Now we join them together  

```{r results='hide',warning=FALSE,message=FALSE}
commute <- commute %>% 
  left_join(dist) %>% 
  left_join(residents, by = c('source' = 'id')) %>% 
  rename(residents = weight) %>% 
  left_join(jobs,by = c('sink'='id')) %>% 
  rename(jobs=weight)
```

Let's do a visual inspection to see if everything went well.  
Comparing the number of commuters and the distance:  

```{r results='hide',warning=FALSE,message=FALSE}
ggplot(commute,aes(interaction,distance))+ 
  geom_point()+
  labs(title="Visual comparision of number of commuters vs distance\n")
```

It looks like the relationship is not linear, perhaps a log-log plot would help:  
```{r results='hide',warning=FALSE,message=FALSE}
ggplot(commute,aes(log(interaction),log(distance)))+ 
  geom_point()+
  labs(title="Log-log plot of number of commuters vs distance\n")
```

A naive linear model based on the available information can be estimated using an OLS regression
```{r}
lm(data=commute, formula = interaction ~ residents+jobs+distance) %>% summary()
```
The summary above shows that the model does not really fit very well. Not only does if have a low adjusted $R^2$ of 0.5003, the p-values for all parameters and the regression itself is very small.  
Lets look at the distribution of the commuters
```{r message=FALSE, warning=FALSE}
ggplot(commute,aes(interaction))+
  geom_histogram()+
  labs(title="Distribution of the interactions of commuters\n")
```
It looks like it follows the Poisson distribution quite closely.  
From the monday's lecture, in spatial interaction modelling we generally thinkg of spatial interactions as:  
$T_{ij} = k\frac{V^\mu_i W^\alpha_j}{d_{ij}^\beta}$  
or rewritten:  
$T_{ij}= kV^\mu_iW^\alpha_jd^{-\beta}_{ij}$  
we can rewrite this again to estimate it with linear estimators:  
$ln T_{ij} = k + \mu ln V_i + \alpha ln W_j - \beta ln d_{ij}$  
```{r}
model <- glm(data = commute,formula = interaction ~log(residents) + log(jobs) + log (distance), family = poisson())
```

GLM doesn't provide $r^2$ by default so we make our own function
```{r}
r2 <- function(empirical, fitted){
  return(cor(empirical, fitted)^2)
}

r2(commute$interaction,fitted(model))
```
The $R^2$ value is pretty close to 1 now, this means that the model is a much better fit now.  
Let's see if we can interpret the rest of the model information.  
```{r}
tidy(model)
```
All coefficients have high test-statistic values and a p-value of 0. This means that they are statistically significant. Our model is:  
&nbsp;  
$Interaction = \beta_0 + \beta_1 log(residents) + \beta_2 log(jobs) + \beta_3 log(distance)$  
&nbsp;  
$= Interaction = 3.357 + 0.8585 log(residents) + 0.9836 log(jobs) - 1.5120 log(distance)$  

<br />  
  
Adding the fitted values, as well as their residuals back to the original data table so we can (visually) inspect them.
```{r}
commute <- commute %>% 
  mutate(fitted = fitted(model)) %>% 
  mutate(residual = interaction - fitted) %>% 
  mutate(residualSign = sign(residual))

commute %>% 
  mutate(lineWidth = fitted / max(fitted) * 10) %>% 
  ggplot() + geom_sf(aes(size = lineWidth))+
  scale_size_identity()+
  labs(title = "Plot of Model of Spacial Relations of Commutes", subtitle="With model values corresponding to line width")
```
<br />
Comparing this with the earlier plot of the actual values, they are actually pretty accurate. Given some lineWidths are different in absolute size but the difference in the lineWidths all correspond to the actual values.  
```{r echo=FALSE}
p2 <- commute %>% 
  mutate(lineWidth = fitted / max(fitted) * 10) %>% 
  ggplot() + geom_sf(aes(size = lineWidth,color=lineWidth))+
  scale_size_identity()+
  labs(subtitle="Model Values")+
  theme(axis.text = element_blank(),legend.position="none")+
  scale_color_continuous(trans='reverse')

plot_grid(p1,p2,align = "h", ncol=2, scale = 1.1) 
```

<br />

For residuals, they can be positive or negative, but since line widths can only be positive, we will use the color aesthetic to indicate whether the residual is positive or negative:
```{r}
commute %>% 
  mutate(lineWidth = abs(residual) / max(residual) * 10) %>% 
  ggplot() +geom_sf(aes(size=lineWidth,color = factor(residualSign))) +
  scale_size_identity()+
  labs(title = "Plot of Residuals of Models", subtitle="With residual values corresponding to line width",color="Residual Sign")
```
[Back to top](#top)

***

#Lab Questions {#LabQ}
Create a plot visualizing all commuting relations in the Netherlands  
_(use a decent filter for this, for example only showing lines with more than 750 commuters)_

```{r echo=TRUE, message=FALSE, warning=FALSE, results="hide"}
knitr::opts_chunk$set(cache=TRUE)
#Read in information
netherlands <- st_read("gem_2016.geojson",crs = 28992)
#Remove water and sea
netherlands <- netherlands %>%
  filter(WATER == 'NEE') %>%
  select(GM_CODE, GM_NAAM)

#read in commute information
nether_commute <-  read_csv("commuting.csv") %>% 
  select(source,sink,weight) %>% 
  rename(interaction = weight)

#filtering table
netherlands <- netherlands %>% 
  mutate(id = as.numeric(str_replace(GM_CODE,"GM","")))
nether_commute <- nether_commute %>% 
  filter(source %in% netherlands$id & sink %in% netherlands$id)

#use centroid to join source and sink to table plus create line
netherlands.centroid <-  st_centroid(netherlands) %>% 
  select(id)
nether_commute <- nether_commute %>% 
  left_join(netherlands.centroid, by = c("source" = "id")) %>% 
  left_join(netherlands.centroid, by = c("sink" = "id")) %>% 
  rowwise() %>% 
  mutate(geometry = st_combine(c(geometry.x,geometry.y)) %>%  st_cast("LINESTRING")) %>% 
  select(-geometry.x,-geometry.y) %>% st_as_sf(crs = 28992)

#trim commutes (keep only lines with more than 750 commuters)
nether_commute_trim <- nether_commute%>% 
  filter(sink != source) %>% 
  filter(interaction >2000) %>% 
  mutate(lineWidth = interaction / max(interaction) * 10)
```
```{r echo=FALSE}
knitr::opts_chunk$set(cache=TRUE)
ggplot(nether_commute_trim) + geom_sf(aes(size = lineWidth))+
  scale_size_identity()+
  labs(title = "Plot of all commuting relations in the Netherlands",subtitle="with > 2000 commuters")+
  theme_grey()
  
```


<br />

Create a summary table for the destinations of commuting in the Netherlands  
_(Tip: use the group_by() and summarize() functions. Which municipalities have the top 3 of most incoming commuters?)  
(Tip: the names of the municipalities are contained in the ‘munis’ object – you can join them to your summary table for easier reading.)_  

<br />

*Summary table for the destinations of commuting in the Netherlands*
```{r}
nethertable <- nether_commute%>% 
  filter(sink != source) %>%
  group_by(sink) %>% 
  summarise(inc_commuters = sum(interaction)) %>% 
  rename(destination = sink)

nethertable
```
  
*adding names by joining municipalities and sorting by incoming commuters*
```{r}
nether_munis= data.frame(id = netherlands$id,name=netherlands$GM_NAAM)
nether_commute_name <- nethertable %>% 
  left_join(nether_munis,by = c('destination'='id'))
  
nether_commute_name[order(-nether_commute_name$inc_commuters),]
```
The municipalities which have the top 3 of most incoming commuters are *Amsterdam*,*Rotterdam* and *Utrecht*.

<br />

#####Run a spatial interaction model for the data for the entire country.  
What’s the model fit (R2)?  
```{r message=FALSE, warning=FALSE}
#create a distance matrix from the centroids
dist_all <- st_distance(x = netherlands.centroid, y = netherlands.centroid)

#reformat the resulting distance matrix to a table with three columns
rownames(dist_all) <- netherlands.centroid$id
colnames(dist_all) <- netherlands.centroid$id
dist_all <- list(
  source = rownames(dist_all)[row(dist_all)] %||% row(dist_all),
  sink = colnames(dist_all)[col(dist_all)] %||% col(dist_all),
  distance = dist_all
) %>% 
  map_dfc(as.vector) %>% 
  mutate(source = as.numeric(source)) %>% 
  mutate(sink = as.numeric(sink))

#join
nether_commute_all <- nether_commute %>% 
  left_join(dist_all) %>% 
  left_join(residents, by = c('source' = 'id')) %>% 
  rename(residents = weight) %>% 
  left_join(jobs,by = c('sink'='id')) %>% 
  rename(jobs=weight)

nether_commute_all <- nether_commute_all%>% 
  filter(sink != source) %>% 
  filter(interaction >0)

model2 <- glm(data = nether_commute_all,formula = interaction ~log(residents) + log(jobs) + log (distance), family = poisson())

summary(model2)

r2(nether_commute_all$interaction,fitted(model2))

```
The model fit for R2 is Interactions = 8.0928051 + 0.6382339log(residents) + 0.8232155log(jobs) -1.6026525log(distance).  
With an $R^2$ value of 0.09636535.

<br />

Can you explain the different fit compared to the model for Utrecht?  
_Include the model parameters in your notebook and explain what the estimates of the coefficients mean._  

This model is a weaker fit with an $R^2$ value of 0.09636535 (vs Utrecht's 0.8639754). This is possibly due to having such a large dataset which could contain large number of outliers, influencing the model. Other factors which are not accounted for could also have influenced the interactions when looking at it from a nationwide perspective versus within a municipality.  

<br />

Look at the residuals and describe where you see the largest residuals.  
_Another way to look at this is to go back to your table of municipalities with most incoming commuters and compare the top 3 with the top 3 based on the fitted values._  
```{r}
#combining the model values with the table
nether_commute_all <- nether_commute_all %>% 
  mutate(fitted = fitted(model2)) %>% 
  mutate(residual = interaction - fitted) %>% 
  mutate(residualSign = sign(residual))

#Summary table for the destinations
#By Residuals
nethertable2 <- nether_commute_all%>% 
  filter(sink != source) %>%
  group_by(sink) %>% 
  summarise(residual = sum(residual)) %>% 
  rename(destination = sink)

nether_commute_name2 <- nethertable2 %>% 
  left_join(nether_munis,by = c('destination'='id'))
  
nether_commute_name2[order(-nether_commute_name2$residual),]
```
```{r eval=FALSE, include=FALSE}
#By fitted values
nethertable3 <- nether_commute_all%>% 
  filter(sink != source) %>%
  group_by(sink) %>% 
  summarise(inc_commuters_fitted = sum(fitted)) %>% 
  rename(destination = sink)

nether_commute_name3 <- nethertable3 %>% 
  left_join(nether_munis,by = c('destination'='id'))

nether_commute_name3[order(-nether_commute_name3$inc_commuters_fitted),]
```


The largest residuals are *Amsterdam*,*Groningen*, and *Utrecht*  

<br /> 

Which municipality has risen to the top?  
Groningen has replaced Rotterdam as 2nd place. Amsterdan and Utrecht still take 1st and 3rd respectively. In fact, Rotterndam is in last(599th) place with the largest negative residual. Meaning that the model severely underestimated the commutes to Rotterdam. Converse

Can you think of an explanation?  
Groningen, though the capital city of the Groningen Province, does not actually have much commutes to the area for its given jobs, distance and residents as it is actually a university city, with close to a fourth of its population being students. Hence, the model overestimates the interaction by a large extent.  
Rotterdam on the other hand, is a sea port much like Singapore and is actually the largest port in Europe (source: wikipedia). This means that there is another factor which accounts for commutes to the area which is not accounted for by our model - the large number of ships arriving at port in Rotterdam!
