This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

install.packages(c("sf","devtools","tidyverse","broom"))
devtools::install_github("tidyverse/ggplot2")
lapply(c("sf","devtools","tidyverse","broom","ggplot2"),library,character.only=TRUE)
setwd("/Users/blossomtang/Documents/40.hass Making Maps/02221-Lab10")
munis <- st_read("gem_2016.geojson",crs = 28992)
munis <- munis %>%
  filter(WATER == 'NEE') %>%
  select(GM_CODE, GM_NAAM)

provinces <- st_read("provinces.geojson") %>%
  st_transform(crs = 28992)
utrecht <- provinces %>%
  filter(name=='Utrecht')
munis <- munis %>%
  filter(st_intersects(utrecht,st_centroid(munis),sparse=F))

The st_intersects function identifies whether the area of utrecht and the centroid of the selected city in munis intersect. Setting sparse=F returns a list of whether each row intersects or not, while setting it to T would only return the indexes of the rows that do intersect.

commute <- read_csv("commuting.csv") %>%
  select(source,sink,weight) %>%
  rename(interaction=weight)
Parsed with column specification:
cols(
  source = col_integer(),
  sink = col_integer(),
  count = col_integer(),
  weight = col_integer()
)

Now the %in% can match values as they both exist.

commute <- commute %>%
  left_join(munis.centroid, by=c("source"="id")) %>%
  left_join(munis.centroid, by=c("sink"="id")) %>%
  rowwise() %>%
  mutate(geometry=st_combine(c(geometry.x,geometry.y)) %>% st_cast("LINESTRING")) %>%
  select(-geometry.x, -geometry.y) %>%
  st_as_sf(crs=28992)
Error in mutate_impl(.data, dots) : 
  Evaluation error: no applicable method for 'st_geometry' applied to an object of class "NULL".

This chunk matches the coordinates to the entries in commute, groups them by rows, combines the coordinates into a geometry column, then converts the data frame to a simple features (sf) object, which is a collection of points.

The region where many thick lines intersect could be a big city where people travel to every day for work. Alternatively, it could be a very dense housing area where many people live, and offices are for some reason in the outskirts of Utrecht.

residents <- read_csv("residents.csv") %>%
  select(id,weight) 
jobs <- read_csv("jobs.csv") %>%
  select(id,weight)
#couldn't get st_distance to work on my computer, so I got a distance matrix from a classmate
#dist <- st_distance(x=munis.centroid, y=munis.centroid)

#rownames(dist) <- munis.centroid$id
#colnames(dist) <- munis.centroid$id
#dist <- list(
#   source = rownames(dist)[row(dist)] %||% row(dist),
#   sink = colnames(dist)[col(dist)] %||% col(dist),
#   distance = dist
# ) 
# dist <- dist %>%
#   map_dfc(as.vector) %>%
#   mutate(source=as.numeric(source)) %>%
#   mutate(sink=as.numeric(sink))
dist <- read.csv("dist.csv")

commute <- commute %>%
  left_join(dist) %>%
  left_join(residents, by = c('source'='id')) %>%
  rename(residents=weight) %>%
  left_join(jobs, by=c('sink'='id')) %>%
  rename(jobs=weight)

ggplot(commute, aes(interaction, distance)) + geom_point()
ggplot(commute, aes(log(interaction),log(distance))) + geom_point()
lm(data = commute, formula = interaction ~ residents + jobs + distance) %>% summary()

Call:
lm(formula = interaction ~ residents + jobs + distance, data = commute)

Residuals:
   Min     1Q Median     3Q    Max 
-15770  -1150   -309    922  40734 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.553e+03  5.241e+02   4.872 1.47e-06 ***
residents    3.234e-02  3.423e-03   9.449  < 2e-16 ***
jobs         9.666e-02  4.960e-03  19.486  < 2e-16 ***
distance    -1.965e-01  2.255e-02  -8.713  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4632 on 515 degrees of freedom
Multiple R-squared:  0.5032,    Adjusted R-squared:  0.5003 
F-statistic: 173.9 on 3 and 515 DF,  p-value: < 2.2e-16

The R squared value is 0.5003, which means the linear regression only explains about half the variation in data. But each factor has a very small p-value, indicating that they are each important to the model.

summary(model)

Call:
glm(formula = interaction ~ log(residents) + log(jobs) + log(distance), 
    family = poisson(), data = commute)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-94.452  -13.305   -7.521    2.780  119.174  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)     3.3571141  0.0211541   158.7   <2e-16 ***
log(residents)  0.8585323  0.0009390   914.3   <2e-16 ***
log(jobs)       0.9836158  0.0006917  1422.1   <2e-16 ***
log(distance)  -1.5120231  0.0017421  -867.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 3474337  on 518  degrees of freedom
Residual deviance:  251357  on 515  degrees of freedom
AIC: 255608

Number of Fisher Scoring iterations: 5

The R^2 value is 0.864, better than the previous linear model. From the summary of the model, we see that each of the factors are still significant as their p-values are very small. Here we are doing a level-log regression, so the coefficients mean that, say, if we increase number of residents by 1%, the interaction would increase by 0.859/100 units.

commute <- commute %>%
  mutate(fitted = fitted(model)) %>%
  mutate(residual = interaction - fitted) %>%
  mutate(residualSign = sign(residual))
commute %>%
  mutate(lineWidth = fitted / max(fitted)*10) %>%
  ggplot() + geom_sf(aes(size=lineWidth)) +
  scale_size_identity()

The line widths are more balanced now, with previously invisible connections becoming more visible.

Assignment

Visualizing all commuting relations

Summary of commuting destinations

commute2 %>%
  arrange(desc(count))
Simple feature collection with 390 features and 2 fields
geometry type:  MULTILINESTRING
dimension:      XY
bbox:           xmin: 24537.79 ymin: 309545.2 xmax: 271400.9 ymax: 612030
epsg (SRID):    28992
proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957,0.343988,-1.87740,4.0725 +units=m +no_defs

The municipalities with the most incoming number of commuters are Amsterdam, Rotterdam, and ’s-Gravenhage.

Spatial interaction

#add residents and jobs
commute <- commute %>%
  left_join(dist.w) %>%
  left_join(residents, by = c('source'='id')) %>%
  rename(residents=weight) %>%
  left_join(jobs, by=c('sink'='id')) %>%
  rename(jobs=weight)
Error in UseMethod("left_join") : 
  no applicable method for 'left_join' applied to an object of class "c('gg', 'ggplot')"

The R^2 value is 0.183, much lower than the fit for just Utrecht. This could be because there are factors other than residents, jobs, and distance that affect interaction when looking at the scale of a whole country instead of a city. It would take more for people to travel to a different city, maybe to visit family or there could be patterns of tourism or logistics aside from traveling for work. The three factors all have smaller magnitude coefficients than before, explaining less of the interaction variation although they are still significant. However, their signs remain the same, which makes sense - there is more interaction if there are more residents and jobs, but less between places far from each other.

The largest residuals are at a couple of cities on the West side of the Netherlands. There are also a lot of negative residuals around North Holland and the places with high interaction.

#finding top 3 based on fitted values
commute %>%
  mutate(fitted = fitted(model)) %>%
  group_by(sink) %>%
  summarize(sum = sum(fitted)) %>%
  left_join(as.data.frame(munis.w.centroid3), by=c("sink"="id")) %>%
  arrange(desc(sum))
Simple feature collection with 388 features and 3 fields
Active geometry column: geometry.x
geometry type:  GEOMETRY
dimension:      XY
bbox:           xmin: 24537.79 ymin: 309545.2 xmax: 271400.9 ymax: 607189.8
epsg (SRID):    28992
proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957,0.343988,-1.87740,4.0725 +units=m +no_defs

This is a table of the top three destinations with the highest fitted interaction. When ordered by fitted interaction instead of actual counts, the top three destinations are Rotterdam, Amsterdam, and Utrecht. Maybe the fitting over-estimates the interaction for Rotterdam because it is quite central or has a large population/number of jobs without people actually traveling there.

---
title: "R Notebook"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

```{r}
install.packages(c("sf","devtools","tidyverse","broom"))
devtools::install_github("tidyverse/ggplot2")
lapply(c("sf","devtools","tidyverse","broom","ggplot2"),library,character.only=TRUE)

```
```{r}
setwd("/Users/blossomtang/Documents/40.hass Making Maps/02221-Lab10")
munis <- st_read("gem_2016.geojson",crs = 28992)
munis <- munis %>%
  filter(WATER == 'NEE') %>%
  select(GM_CODE, GM_NAAM)

provinces <- st_read("provinces.geojson") %>%
  st_transform(crs = 28992)
utrecht <- provinces %>%
  filter(name=='Utrecht')
munis <- munis %>%
  filter(st_intersects(utrecht,st_centroid(munis),sparse=F))

```

The st_intersects function identifies whether the area of utrecht and the centroid of the selected city in munis intersect. Setting sparse=F returns a list of whether each row intersects or not, while setting it to T would only return the indexes of the rows that do intersect.

```{r}
commute <- read_csv("commuting.csv") %>%
  select(source,sink,weight) %>%
  rename(interaction=weight)

#commute <- commute %>%
#  filter(source %in% munis$id & sink %in% munis$id)
#commute has 0 rows now because munis does not have an id column

munis <- munis %>%
  mutate(id=as.numeric(str_replace(GM_CODE, "GM","")))
commute <- commute %>%
 filter(source %in% munis$id & sink %in% munis$id)
```
Now the %in% can match values as they both exist.
```{r}
munis.centroid <- st_centroid(munis) %>%
  select(id)

commute <- commute %>%
  left_join(munis.centroid, by=c("source"="id")) %>%
  left_join(munis.centroid, by=c("sink"="id")) %>%
  rowwise() %>%
  mutate(geometry=st_combine(c(geometry.x,geometry.y)) %>% st_cast("LINESTRING")) %>%
  select(-geometry.x, -geometry.y) %>%
  st_as_sf(crs=28992)
```
This chunk matches the coordinates to the entries in commute, groups them by rows, combines the coordinates into a geometry column, then converts the data frame to a simple features (sf) object, which is a collection of points.

```{r}
ggplot(commute) + geom_sf()
ggplot(munis) + geom_sf() + geom_sf(data=commute)
commute <- commute %>%
  filter(sink != source) %>%
  filter(interaction > 20) %>%
  mutate(lineWidth=as.numeric(interaction)/max(as.numeric(interaction))*10)
ggplot(commute) + geom_sf(aes(size=lineWidth)) + scale_size_identity()
```
The region where many thick lines intersect could be a big city where people travel to every day for work. Alternatively, it could be a very dense housing area where many people live, and offices are for some reason in the outskirts of Utrecht.

```{r}
residents <- read_csv("residents.csv") %>%
  select(id,weight) 
jobs <- read_csv("jobs.csv") %>%
  select(id,weight)
#couldn't get st_distance to work on my computer, so I got a distance matrix from a classmate
#dist <- st_distance(x=munis.centroid, y=munis.centroid)

#rownames(dist) <- munis.centroid$id
#colnames(dist) <- munis.centroid$id
#dist <- list(
#   source = rownames(dist)[row(dist)] %||% row(dist),
#   sink = colnames(dist)[col(dist)] %||% col(dist),
#   distance = dist
# ) 
# dist <- dist %>%
#   map_dfc(as.vector) %>%
#   mutate(source=as.numeric(source)) %>%
#   mutate(sink=as.numeric(sink))
dist <- read.csv("dist.csv")

commute <- commute %>%
  left_join(dist) %>%
  left_join(residents, by = c('source'='id')) %>%
  rename(residents=weight) %>%
  left_join(jobs, by=c('sink'='id')) %>%
  rename(jobs=weight)

ggplot(commute, aes(interaction, distance)) + geom_point()
ggplot(commute, aes(log(interaction),log(distance))) + geom_point()

```
```{r}
lm(data = commute, formula = interaction ~ residents + jobs + distance) %>% summary()
```
The R squared value is 0.5003, which means the linear regression only explains about half the variation in data. But each factor has a very small p-value, indicating that they are each important to the model.

```{r}
ggplot(commute,aes(interaction)) + geom_histogram()

model <- glm(data = commute, formula = interaction ~ log(residents) + log(jobs) + log(distance), family = poisson())

r2 <- function(empirical, fitted) {
  return(cor(empirical, fitted)^2)
}
r2(commute$interaction, fitted(model))
summary(model)

```
The R^2 value is 0.864, better than the previous linear model. From the summary of the model, we see that each of the factors are still significant as their p-values are very small. Here we are doing a level-log regression, so the coefficients mean that, say, if we increase number of residents by 1%, the interaction would increase by 0.859/100 units.

```{r}
commute <- commute %>%
  mutate(fitted = fitted(model)) %>%
  mutate(residual = interaction - fitted) %>%
  mutate(residualSign = sign(residual))

commute %>%
  mutate(lineWidth = fitted / max(fitted)*10) %>%
  ggplot() + geom_sf(aes(size=lineWidth)) +
  scale_size_identity()
```
The line widths are more balanced now, with previously invisible connections becoming more visible.
```{r}
commute %>%
  mutate(lineWidth = abs(residual)/max(residual)*10) %>%
  ggplot() + geom_sf(aes(color = factor(residualSign),size=lineWidth)) + scale_size_identity()
```

<b>Assignment</b>

Visualizing all commuting relations
```{r}
munis.w <- st_read("gem_2016.geojson",crs = 28992)
munis.w <- munis.w %>%
  filter(WATER == 'NEE') %>%
  select(GM_CODE, GM_NAAM)

commute <- read_csv("commuting.csv") %>%
  select(source,sink,weight,count) %>%
  rename(interaction=weight) %>%
  filter(count > 750)

munis.w <- munis.w %>%
  mutate(id=as.numeric(str_replace(GM_CODE, "GM","")))
commute <- commute %>%
 filter(source %in% munis.w$id & sink %in% munis.w$id)

munis.w.centroid <- st_centroid(munis.w) %>%
  select(id)

commute <- commute %>%
  left_join(munis.w.centroid, by=c("source"="id")) %>%
  left_join(munis.w.centroid, by=c("sink"="id")) %>%
  rowwise() %>%
  mutate(geometry=st_combine(c(geometry.x,geometry.y)) %>% st_cast("LINESTRING")) %>%
  select(-geometry.x, -geometry.y) %>%
  st_as_sf(crs=28992)

commute <- commute %>%
  filter(sink != source) %>%
  filter(interaction > 20) %>%
  mutate(lineWidth=as.numeric(interaction)/max(as.numeric(interaction))*10) 
ggplot(munis.w) + geom_sf() + geom_sf(data=commute,aes(size=lineWidth)) + scale_size_identity()

```

Summary of commuting destinations
```{r}

munis.w <- st_read("gem_2016.geojson",crs = 28992)
munis.w <- munis.w %>%
  filter(WATER == 'NEE') %>%
  select(GM_CODE, GM_NAAM)

commute2 <- read_csv("commuting.csv") %>%
  select(source,sink,weight,count) %>%
  rename(interaction=weight)

munis.w <- munis.w %>%
  mutate(id=as.numeric(str_replace(GM_CODE, "GM","")))
commute <- commute %>%
 filter(source %in% munis.w$id & sink %in% munis.w$id)

munis.w.centroid2 <- st_centroid(munis.w) %>%
  select(id,GM_NAAM)

commute2 <- commute2 %>%
  left_join(munis.w.centroid2, by=c("source"="id")) %>%
  left_join(munis.w.centroid2, by=c("sink"="id")) %>%
  rowwise() %>%
  mutate(geometry=st_combine(c(geometry.x,geometry.y)) %>% st_cast("LINESTRING")) %>%
  select(-geometry.x, -geometry.y) %>%
  st_as_sf(crs=28992)

commute2 <- commute2 %>%
  group_by(GM_NAAM.y) %>%
  summarize(count = sum(count))
install.packages("dplyr")
library(dplyr)
commute2 %>%
  arrange(desc(count))




```
The municipalities with the most incoming number of commuters are Amsterdam, Rotterdam, and 's-Gravenhage.

Spatial interaction
```{r}
#dist <- st_distance(x=munis.centroid.w, y=munis.centroid.w)

#distance matrix
# dist2 <- read.csv("distorig.csv")
# 
# rownames(dist2) <- munis.w.centroid$id
# colnames(dist2) <- munis.w.centroid$id
# dist2 <- list(
#   source = rownames(dist2)[row(dist2)] %||% row(dist2),
#   sink = colnames(dist2)[col(dist2)] %||% col(dist2),
#   distance = dist2
# ) 
# dist2 %>%
#   map_dfc(as.vector) %>%
#   mutate(source=as.numeric(source)) %>%
#   mutate(sink=as.numeric(sink))
dist.w <- read.csv("distcool.csv")


#add residents and jobs
commute <- commute %>%
  left_join(dist.w) %>%
  left_join(residents, by = c('source'='id')) %>%
  rename(residents=weight) %>%
  left_join(jobs, by=c('sink'='id')) %>%
  rename(jobs=weight)

model <- glm(data = commute, formula = interaction ~ log(residents) + log(jobs) + log(distance), family = poisson())

r2(commute$interaction, fitted(model))
summary(model)
```
The R^2 value is 0.183, much lower than the fit for just Utrecht. This could be because there are factors other than residents, jobs, and distance that affect interaction when looking at the scale of a whole country instead of a city. It would take more for people to travel to a different city, maybe to visit family or there could be patterns of tourism or logistics aside from traveling for work. The three factors all have smaller magnitude coefficients than before, explaining less of the interaction variation although they are still significant. However, their signs remain the same, which makes sense - there is more interaction if there are more residents and jobs, but less between places far from each other.

```{r}
commute %>%
  mutate(fitted = fitted(model)) %>%
  mutate(residual = interaction - fitted) %>%
  mutate(residualSign = sign(residual)) %>%
  mutate(lineWidth = abs(residual)/max(residual)*5) %>%  
  ggplot() + geom_sf(aes(color = factor(residualSign),size=lineWidth)) +
  scale_size_identity()

```

The largest residuals are at a couple of cities on the West side of the Netherlands. There are also a lot of negative residuals around North Holland and the places with high interaction.

```{r}
munis.w.centroid3 <- st_centroid(munis.w) %>%
  select(id,GM_NAAM)
#finding top 3 based on fitted values
commute %>%
  mutate(fitted = fitted(model)) %>%
  group_by(sink) %>%
  summarize(sum = sum(fitted)) %>%
  left_join(as.data.frame(munis.w.centroid3), by=c("sink"="id")) %>%
  arrange(desc(sum))

```

This is a table of the top three destinations with the highest fitted interaction. When ordered by fitted interaction instead of actual counts, the top three destinations are Rotterdam, Amsterdam, and Utrecht. Maybe the fitting over-estimates the interaction for Rotterdam because it is quite central or has a large population/number of jobs without people actually traveling there.

