This is the four notebook which Geomatica Basica students have to write to get started with R & RStudio. It aims at learning how to join non-spatial attributes to geospatial data. We will illustrate the topic by joining a spatial feature representing municipalities in Boyacá (i.e. a shapefile provided by DANE, 2017) and a non-spatial data frame with agriculture statistics (i.e a csv provided by the Ministerio de Agricultura y Desarrollo Rural, 2020).
Let’s start loading several libraries, in particular dplyr:
# install packages before anything
library(tidyverse)
library(dplyr)
library(ggplot2)
library(sf)
We start by reading a file with EVA statistics for Boyaca.
eva_boyaca <- read_csv(file = "../../datos/EVA_Boyaca_2019.csv")
eva_boyaca
We will use the shapefile corresponding to Marco Geoestadistico Departamental which is available at DANE Geoportal.
Let’s start reading the data using the sf library:
mun_boyaca <- sf::st_read("../../datos/shapefiles/15_BOYACA/ADMINISTRATIVO/MGN_MPIO_POLITICO.shp")
## Reading layer `MGN_MPIO_POLITICO' from data source `/Users/ivanlizarazo/Documents/ivan/GB/datos/shapefiles/15_BOYACA/ADMINISTRATIVO/MGN_MPIO_POLITICO.shp' using driver `ESRI Shapefile'
## Simple feature collection with 123 features and 2 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -74.66496 ymin: 4.655196 xmax: -71.94885 ymax: 7.055557
## geographic CRS: WGS 84
What is mun_boyaca?
mun_boyaca
Note that mun_boyaca is a simple feature collection. Make sure to review this link for understanding what is a simple feature.
Note also that the data uses the WGS1984 geographic coordinate reference system (i.e. 4326 epsg code).
Let’s calculate area of each municipality:
mun_boyaca$KM2 <- st_area(st_transform(mun_boyaca, 3116))/1E6
mun_boyaca$KM2 <- as.numeric(mun_boyaca$KM2)
mun_boyaca$KM2 <- round(mun_boyaca$KM2,3)
min(mun_boyaca$KM2)
## [1] 25.353
max(mun_boyaca$KM2)
## [1] 1513.601
Let’s make a map of municipalities:
library(leaflet)
bins <- c(0, 150, 300, 450, 600, 750, 900, 1200, 1600)
pal <- colorBin("RdYlGn", domain = mun_boyaca$KM2, bins = bins)
mapa <- leaflet(data = mun_boyaca) %>%
addTiles() %>%
addPolygons(label = ~KM2,
popup = ~MPIO_CNMBR,
fillColor = ~pal(KM2),
color = "#444444",
weight = 1,
smoothFactor = 0.5,
opacity = 1.0,
fillOpacity = 0.5,
highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE)
) %>%
addProviderTiles(providers$OpenStreetMap) %>%
addLegend("bottomright", pal = pal, values = ~KM2,
title = "Municipalities extent [Km2] (DANE, 2018)",
opacity = 1
)
mapa
We can use the left_join function, provided by the dplyr package, to join the municipalities and the agricultural statistics.
Let’s review how left_join works:
Left joins are a type of mutating join, since they simply add columns to the first table. To perform a left join with dplyr, call left_join(), passing two tibbles and a character vector of columns to join on.
# do not uncomment
# this is just an explanation of the function syntax
#left_join(a_tibble, another_tibble, by = c("id_col1", "id_col2"))
We need a common attribute (or shared variable) to base the left join on. The best attribute is an ID. In mun_boyaca, the MPIO_CCDGO attribute seems fine (it reads 15001 for Tunja). In eva_boyaca, the corresponding attribute is COD_MUN (it reads 15001 for Tunja).
However, the MPIO_CCDGO attribute is a string:
class(mun_boyaca$MPIO_CCDGO)
## [1] "character"
And the COD_MUN attribute is numeric:
class(eva_boyaca$COD_MUN)
## [1] "numeric"
To be able to do the join, we need to change both data type and contents for the code which identifies every municipality. For this task, it is a good idea to create a copy of the original statistical data. Using this approach, any false movement will not spoil the original data.
Let’s proceed step by step:
mun_boyaca$COD_MUN <- as.double(mun_boyaca$MPIO_CCDGO)
class(mun_boyaca$COD_MUN)
## [1] "numeric"
We saw in a previous notebook several functions for “editing” a tibble. Let’s filter the EVA tibble by crop type (e.g. an important crop as identified in the third notebook). In addition, let’s select several columns.
papa_boyaca <- eva_boyaca %>% filter(CULTIVO == "PAPA") %>% dplyr::select(MUNICIPIO, COD_MUN, YEAR, PERIODO, TON_PROD, RENDIM)
papa_boyaca
head(papa_boyaca)
tail(papa_boyaca)
unique(papa_boyaca$YEAR)
## [1] 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
unique(papa_boyaca$PERIODO)
## [1] "2006B" "2007A" "2007B" "2008A" "2008B" "2009A" "2009B" "2010A" "2010B"
## [10] "2011A" "2011B" "2012A" "2012B" "2013A" "2013B" "2014A" "2014B" "2015A"
## [19] "2015B" "2016A" "2016B" "2017A" "2017B" "2018A"
papa_boyaca %>%
gather("YEAR", "PERIODO", "TON_PROD", "RENDIM" , key = variable, value = number)
head(papa_boyaca)
tail(papa_boyaca)
summary(papa_boyaca)
## MUNICIPIO COD_MUN YEAR PERIODO
## Length:2228 Min. :15001 Min. :2006 Length:2228
## Class :character 1st Qu.:15226 1st Qu.:2009 Class :character
## Mode :character Median :15494 Median :2012 Mode :character
## Mean :15475 Mean :2012
## 3rd Qu.:15757 3rd Qu.:2015
## Max. :15879 Max. :2018
## TON_PROD RENDIM
## Min. : 2 Min. : 0.50
## 1st Qu.: 180 1st Qu.:10.00
## Median : 900 Median :13.00
## Mean : 4298 Mean :14.14
## 3rd Qu.: 3721 3rd Qu.:17.68
## Max. :95450 Max. :44.55
Before going further, let’s review several functions provided by the tidyr package: - gather() makes “wide” data longer - spread() makes “long” data wider - separate() splits a single column into multiple columns - unite() combines multiple columns into a single column
papa_boyaca %>% replace(is.na(.), 0) -> papa_boyaca2
papa_boyaca %>% group_by(MUNICIPIO, COD_MUN, YEAR) %>%
summarize(TON_PROD=sum(TON_PROD)) -> papa_boyaca2
## `summarise()` has grouped output by 'MUNICIPIO', 'COD_MUN'. You can override using the `.groups` argument.
head(papa_boyaca2)
tail(papa_boyaca2)
The following chunk a code seems hard to understand. I would suggest that you study the documentation of the functions group, gather, unite, and pivot to gain an appropriate understanding of the chunk:
papa_boyaca2 %>%
group_by(COD_MUN) %>%
gather("TON_PROD", key = variable, value = number) %>%
unite(combi, variable, YEAR) %>%
pivot_wider(names_from = combi, values_from = number, values_fill = 0) -> papa_boyaca3
Let’s check the head of the output:
head(papa_boyaca3)
Let’s check the tail of the output:
tail(papa_boyaca3)
Now, we can join the non-spatial attributes to the spatial simple feature object:
mun_boyaca_papa = left_join(mun_boyaca, papa_boyaca3, by="COD_MUN")
summary(mun_boyaca_papa)
## MPIO_CCDGO MPIO_CNMBR KM2 COD_MUN
## Length:123 Length:123 Min. : 25.35 Min. :15001
## Class :character Class :character 1st Qu.: 64.26 1st Qu.:15237
## Mode :character Mode :character Median : 122.70 Median :15500
## Mean : 188.11 Mean :15482
## 3rd Qu.: 203.16 3rd Qu.:15722
## Max. :1513.60 Max. :15897
##
## MUNICIPIO TON_PROD_2007 TON_PROD_2008 TON_PROD_2009
## Length:123 Min. : 0 Min. : 0 Min. : 0.0
## Class :character 1st Qu.: 359 1st Qu.: 168 1st Qu.: 236.5
## Mode :character Median : 2155 Median : 1320 Median : 1715.0
## Mean : 7036 Mean : 7660 Mean : 9085.0
## 3rd Qu.: 7238 3rd Qu.: 7428 3rd Qu.: 7802.5
## Max. :63000 Max. :64000 Max. :181575.0
## NA's :31 NA's :31 NA's :31
## TON_PROD_2010 TON_PROD_2011 TON_PROD_2012 TON_PROD_2013
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 274.5 1st Qu.: 251.2 1st Qu.: 264.2 1st Qu.: 265.5
## Median : 1887.5 Median : 1443.0 Median : 2365.5 Median : 2184.0
## Mean : 9073.2 Mean : 8069.9 Mean : 7812.2 Mean : 9058.9
## 3rd Qu.: 6232.5 3rd Qu.: 6232.5 3rd Qu.: 7312.5 3rd Qu.: 8655.0
## Max. :162990.0 Max. :116510.0 Max. :98160.0 Max. :96000.0
## NA's :31 NA's :31 NA's :31 NA's :31
## TON_PROD_2014 TON_PROD_2015 TON_PROD_2016 TON_PROD_2017
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0
## 1st Qu.: 228.2 1st Qu.: 279.5 1st Qu.: 238.5 1st Qu.: 255
## Median : 1965.0 Median : 1833.5 Median : 1500.0 Median : 1645
## Mean : 9408.5 Mean : 7889.8 Mean : 10407.7 Mean : 9347
## 3rd Qu.: 9987.5 3rd Qu.: 6852.0 3rd Qu.: 7762.5 3rd Qu.: 7200
## Max. :101140.0 Max. :86000.0 Max. :115500.0 Max. :99100
## NA's :31 NA's :31 NA's :31 NA's :31
## TON_PROD_2018 TON_PROD_2006 geometry
## Min. : 0.0 Min. : 0.0 POLYGON :123
## 1st Qu.: 223.5 1st Qu.: 193.2 epsg:4326 : 0
## Median : 949.0 Median : 1150.0 +proj=long...: 0
## Mean : 5997.5 Mean : 3233.8
## 3rd Qu.: 4050.0 3rd Qu.: 3516.2
## Max. :66750.0 Max. :30000.0
## NA's :31 NA's :31
head(mun_boyaca_papa[,1:10])
tail(mun_boyaca_papa[,1:10])
We can see that CUBARA and BERBEO municipalities do not have values for the TON_PROD and REND attributes for the PAPA crop.
Why this happened? It is because the papa statistics for these municipalities are not available (NA).
papa_cubara <- papa_boyaca2 %>% filter(COD_MUN==15223)
papa_cubara
Let’s load the libraries needed to make the map:
#install.packages("RColorBrewer")
library(RColorBrewer)
library(leaflet)
Let’s make a map of total production of potato in 2017:
bins <- c(0, 500, 1000, 5000, 10000, 15000, 25000, 50000, 75000)
pal <- colorBin("YlOrRd", domain = mun_boyaca_papa$TON_PROD_2017, bins = bins)
mapa <- leaflet(data = mun_boyaca_papa) %>%
addTiles() %>%
addPolygons(label = ~TON_PROD_2017,
popup = ~MPIO_CNMBR,
fillColor = ~pal(TON_PROD_2017),
color = "#444444",
weight = 1,
smoothFactor = 0.5,
opacity = 1.0,
fillOpacity = 0.5,
highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE)
) %>%
addProviderTiles(providers$OpenStreetMap) %>%
addLegend("bottomright", pal = pal, values = ~TON_PROD_2017,
title = "Potato crop production in Boyaca [Ton] (2017)",
opacity = 1
)
mapa
Now, we are going to repeat the process to make a different map, in this case the harvested area of potato crop in every municipality.
Let’s remind the EVA variables names:
summary(eva_boyaca)
## COD_DPTO DPTO COD_MUN MUNICIPIO
## Min. :15 Length:20576 Min. :15001 Length:20576
## 1st Qu.:15 Class :character 1st Qu.:15238 Class :character
## Median :15 Mode :character Median :15511 Mode :character
## Mean :15 Mean :15486
## 3rd Qu.:15 3rd Qu.:15753
## Max. :15 Max. :15897
##
## GRUPO SUBGRUPO CULTIVO DESAGREG
## Length:20576 Length:20576 Length:20576 Length:20576
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## YEAR PERIODO HA_SIEMBRA HA_COSECHA
## Min. :2006 Length:20576 Min. : 0.00 Min. : 0.00
## 1st Qu.:2009 Class :character 1st Qu.: 7.00 1st Qu.: 6.00
## Median :2012 Mode :character Median : 20.00 Median : 18.00
## Mean :2012 Mean : 94.11 Mean : 81.19
## 3rd Qu.:2015 3rd Qu.: 61.00 3rd Qu.: 55.00
## Max. :2018 Max. :8443.00 Max. :7528.00
##
## TON_PROD RENDIM CICLO
## Min. : 0.0 Min. : 0.07 Length:20576
## 1st Qu.: 17.0 1st Qu.: 1.29 Class :character
## Median : 66.0 Median : 4.80 Mode :character
## Mean : 959.8 Mean : 9.93
## 3rd Qu.: 315.0 3rd Qu.: 12.00
## Max. :95450.0 Max. :200.00
## NA's :193
Now, apply the filter & select functions to get the area cosechada en papa:
papa_boyaca <- eva_boyaca %>% filter(CULTIVO == "PAPA") %>% dplyr::select(MUNICIPIO, COD_MUN, YEAR, PERIODO, HA_COSECHA)
Then, let’s aggregate PERIODO values into YEAR:
papa_boyaca %>% group_by(MUNICIPIO, COD_MUN, YEAR) %>%
summarize(HA_COSECHA=sum(HA_COSECHA)) -> papa_boyaca2
Then, let’s reformat the statistics into wider format:
papa_boyaca2 %>%
group_by(COD_MUN) %>%
gather("HA_COSECHA", key = variable, value = number) %>%
unite(combi, variable, YEAR) %>%
pivot_wider(names_from = combi, values_from = number, values_fill = 0) -> papa_boyaca3
Now, it’s join time:
mun_boyaca_papa = left_join(mun_boyaca, papa_boyaca3, by="COD_MUN")
Let’s check the output:
mun_boyaca_papa
Finally, mapping time:
bins <- c(0, 100, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000)
pal <- colorBin("RdYlGn", domain = mun_boyaca_papa$HA_COSECHA_2017, bins = bins)
mapa <- leaflet(data = mun_boyaca_papa) %>%
addTiles() %>%
addPolygons(label = ~HA_COSECHA_2017,
popup = ~MPIO_CNMBR,
fillColor = ~pal(HA_COSECHA_2017),
color = "#444444",
weight = 1,
smoothFactor = 0.5,
opacity = 1.0,
fillOpacity = 0.5,
highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE)
) %>%
addProviderTiles(providers$OpenStreetMap) %>%
addLegend("bottomright", pal = pal, values = ~HA_COSECHA_2017,
title = "Potato harvested area in Boyaca [Ha] (2017)",
opacity = 1
)
The complete map can be displayed:
mapa
Feel free to explore more functionalities of the packages used here to examine and map agriculture trends in your department.
sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] RColorBrewer_1.1-2 leaflet_2.0.4.1 sf_0.9-7 forcats_0.5.1
## [5] stringr_1.4.0 dplyr_1.0.4 purrr_0.3.4 readr_1.4.0
## [9] tidyr_1.1.2 tibble_3.0.6 ggplot2_3.3.3 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.21 bslib_0.2.4
## [4] haven_2.3.1 leaflet.providers_1.9.0 colorspace_2.0-0
## [7] vctrs_0.3.6 generics_0.1.0 htmltools_0.5.1.1
## [10] yaml_2.2.1 rlang_0.4.10 e1071_1.7-4
## [13] jquerylib_0.1.3 pillar_1.4.7 glue_1.4.2
## [16] withr_2.4.1 DBI_1.1.1 dbplyr_2.1.0
## [19] modelr_0.1.8 readxl_1.3.1 lifecycle_1.0.0
## [22] munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0
## [25] rvest_0.3.6 htmlwidgets_1.5.3 evaluate_0.14
## [28] knitr_1.31 crosstalk_1.1.1 class_7.3-18
## [31] broom_0.7.5 Rcpp_1.0.6 KernSmooth_2.23-18
## [34] scales_1.1.1 backports_1.2.1 classInt_0.4-3
## [37] jsonlite_1.7.2 farver_2.0.3 fs_1.5.0
## [40] hms_1.0.0 digest_0.6.27 stringi_1.5.3
## [43] grid_4.0.4 cli_2.3.0 tools_4.0.4
## [46] magrittr_2.0.1 sass_0.3.1 crayon_1.4.1
## [49] pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2
## [52] reprex_1.0.0 lubridate_1.7.9.2 assertthat_0.2.1
## [55] rmarkdown_2.7 httr_1.4.2 rstudioapi_0.13
## [58] R6_2.5.0 units_0.6-7 compiler_4.0.4