Our Fourth R Notebook: Joining geospatial data and non-spatial attributes

1. Introduction

This is our fourth R notebook :Joining geospatiaal data and non-spatial attributes, written using R Markdown Notebook. It aims at learning how to join non-spatial attributes to geospatial data. We will illustrate the topic by joining a spatial feature representing municipalities in Sucre (i.e. a shapefile provided by DANE, 2017) and a non-spatial data frame with agriculture statistics (i.e a csv provided by the Ministerio de Agricultura y Desarrollo Rural, 2020).

2. Loading the requiered libraries

library(RColorBrewer)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(leaflet)
library(sf)

3. Reading a table with Sucre’s agriculture statistics

We start by reading a file with EVA statistics and filter for the Sucre departament.

EVA <- read.csv(file = "/Users/Acer/Documents/Materias U/Geomatica Basica/Evaluaciones_Agropecuarias_Municipales_EVA.csv", header=T, sep = ";") #Load the all file

eva_sucre <- filter(EVA, DEPARTAMENTO == "SUCRE");eva_sucre; #Show the file only with Sucre

4. Reading a shapefile with Boyaca’s municipalities

We will use the shapefile corresponding to Marco Geoestadistico Departamental which is available at DANE Geoportal.

Let’s start reading the data using the sf library:

#Leer el archivo .shp que contiene los poligonos de cada municipio del Sucre
mun_sucre <- read_sf("/Users/Acer/Documents/Materias U/Geomatica Basica/70_SUCRE/ADMINISTRATIVO/MGN_MPIO_POLITICO.shp"); select(mun_sucre, MPIO_CCDGO, MPIO_CNMBR,)

Note that mun_sucre is a simple feature collection. Make sure to review this link for understanding what is a simple feature.

Note also that the data uses the WGS1984 geographic coordinate reference system (i.e. 4326 epsg code).

Let’s calculate area of each municipality:

mun_sucre$KM2 <- st_area(st_transform(mun_sucre, 3116))/1E6
mun_sucre$KM2 <- as.numeric(mun_sucre$KM2)
mun_sucre$KM2 <- round(mun_sucre$KM2,3)
min(mun_sucre$KM2)

## [1] 56.555

max(mun_sucre$KM2)

## [1] 1494.163

Let’s make a map of municipalities:

bins <- c(56, 186.7, 317.4, 448.1, 578.8, 709.5, 840.2, 970.9, 1101.6,1232.3, 1495)
pal <- colorBin("RdBu", domain = mun_sucre$KM2, bins = bins)

  mapa <- leaflet(data = mun_sucre) %>%
  addTiles() %>%
  addPolygons(label = ~KM2,
              popup = ~MPIO_CNMBR,
              fillColor = ~pal(KM2),
              color = "#444444",
              weight = 1,
              smoothFactor = 0.5,
              opacity = 1.0,
              fillOpacity = 0.5,
              highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE)
              ) %>%
  addProviderTiles(providers$OpenStreetMap) %>%
  addLegend("bottomright", pal = pal, values = ~KM2,
    title = "Municipalities extent [Km2] (DANE, 2018)",
    opacity = 1
  );mapa

5. Joining attributes to the spatial object

5.1 The joining task

We can use the left_join function, provided by the dplyr package, to join the municipalities and the agricultural statistics. We need a common attribute (or shared variable) to base the left join on. The best attribute is an ID. In mun_sucre, the MPIO_CCDGO attribute seems fine (it reads 70001 for Sincelejo). In eva_sucre, the corresponding attribute is COD_MUN. (it reads 70001 for Sincelejo).

Now let’s see what type these attributes are

class(mun_sucre$MPIO_CCDGO); class(eva_sucre$COD_MUN.)

## [1] "character"

## [1] "integer"

To be able to do the join, we need to change both data type and contents for the code which identifies every municipality. For this task, it is a good idea to create a copy of the original statistical data. Using this approach, any false movement will not spoil the original data.

5.2 Editing tasks

Let’s proceed step by step:

mun_sucre$COD_MUN. <-  as.integer(mun_sucre$MPIO_CCDGO); class(mun_sucre$COD_MUN.); mun_sucre

## [1] "integer"

5.3 Selecting, filtering, and tyding statistics:

We saw in a previous notebook several functions for “editing” a tibble. Let’s filter the EVA tibble by crop type (e.g. an important crop as identified in the third notebook). In addition, let’s select several columns.

arroz_sucre <- eva_sucre %>%  filter(CULTIVO == "ARROZ")  %>%  dplyr::select(MUNICIPIO, COD_MUN., YEAR, PERIODO, TON_PRODUCCION, RENDIMIENTO); arroz_sucre

unique(arroz_sucre$YEAR)

##  [1] 2010 2006 2007 2008 2009 2011 2012 2013 2014 2015 2016 2017 2018

unique(arroz_sucre$PERIODO)

##  [1] "2010A" "2010B" "2006B" "2007A" "2007B" "2008A" "2008B" "2009A" "2009B"
## [10] "2011B" "2011A" "2012B" "2012A" "2013A" "2013B" "2014A" "2014B" "2015A"
## [19] "2015B" "2016A" "2016B" "2017A" "2017B" "2018A"

arroz_sucre%>%
  gather("YEAR","PERIODO","TON_PRODUCCION","RENDIMIENTO", key = variable, value = number)

summary(arroz_sucre)

##   MUNICIPIO            COD_MUN.          YEAR        PERIODO         
##  Length:634         Min.   :  704   Min.   :2006   Length:634        
##  Class :character   1st Qu.:70204   1st Qu.:2009   Class :character  
##  Mode  :character   Median :70265   Median :2012   Mode  :character  
##                     Mean   :58914   Mean   :2012                     
##                     3rd Qu.:70708   3rd Qu.:2015                     
##                     Max.   :70823   Max.   :2018                     
##                                                                      
##  TON_PRODUCCION      RENDIMIENTO   
##  Min.   :    0.00   Min.   :0.300  
##  1st Qu.:   25.25   1st Qu.:1.500  
##  Median :  145.00   Median :2.800  
##  Mean   : 1185.19   Mean   :2.752  
##  3rd Qu.:  566.00   3rd Qu.:3.890  
##  Max.   :47018.00   Max.   :6.500  
##                     NA's   :5

5.4 Converting from long format into wide format

Before going further, let’s review several functions provided by the tidyr package: - gather() makes “wide” data longer - spread() makes “long” data wider - separate() splits a single column into multiple columns - unite() combines multiple columns into a single column

arroz_sucre %>% replace(is.na(.), 0) -> arroz_sucre2

arroz_sucre %>% group_by(MUNICIPIO, COD_MUN., YEAR) %>%
   summarize(TON_PROD=sum(TON_PRODUCCION)) -> arroz_sucre2 ; head(arroz_sucre2); tail(arroz_sucre2)

## `summarise()` has grouped output by 'MUNICIPIO', 'COD_MUN.'. You can override using the `.groups` argument.

The following chunk a code seems hard to understand. I would suggest that you study the documentation of the functions group, gather, unite, and pivot to gain an appropriate understanding of the chunk:

arroz_sucre2 %>% 
  group_by(COD_MUN.) %>% 
  gather("TON_PROD", key = variable, value = number)   %>% 
  unite(combi, variable, YEAR) %>%
  pivot_wider(names_from = combi, values_from = number, values_fill = 0) -> arroz_sucre3 ; head(arroz_sucre3); tail(arroz_sucre3)

5.5 Join attributes to the spatial object

Now, we can join the non-spatial attributes to the spatial simple feature object:

mun_sucre$COD_MUN. <- as.integer(mun_sucre$MPIO_CCDGO)
mun_sucre_arroz = left_join(mun_sucre, arroz_sucre3, by="COD_MUN."); summary(mun_sucre_arroz); select(mun_sucre_arroz,MPIO_CCDGO,MPIO_CNMBR,KM2,COD_MUN.,MUNICIPIO,TON_PROD_2010,TON_PROD_2006,TON_PROD_2007,TON_PROD_2008,TON_PROD_2009,TON_PROD_2011,TON_PROD_2012,TON_PROD_2013,TON_PROD_2014,TON_PROD_2015,TON_PROD_2016,TON_PROD_2017,TON_PROD_2018)

##   DPTO_CCDGO         MPIO_CCDGO         MPIO_CNMBR         MPIO_CRSLC       
##  Length:26          Length:26          Length:26          Length:26         
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    MPIO_NAREA        MPIO_NANO     DPTO_CNMBR          Shape_Leng    
##  Min.   :  56.56   Min.   :2017   Length:26          Min.   :0.4269  
##  1st Qu.: 180.54   1st Qu.:2017   Class :character   1st Qu.:0.8558  
##  Median : 282.49   Median :2017   Mode  :character   Median :1.1494  
##  Mean   : 407.38   Mean   :2017                      Mean   :1.2339  
##  3rd Qu.: 417.35   3rd Qu.:2017                      3rd Qu.:1.4341  
##  Max.   :1494.16   Max.   :2017                      Max.   :2.9850  
##                                                                      
##    Shape_Area                geometry       KM2             COD_MUN.    
##  Min.   :0.004652   POLYGON      :26   Min.   :  56.55   Min.   :70001  
##  1st Qu.:0.014847   epsg:4326    : 0   1st Qu.: 180.54   1st Qu.:70231  
##  Median :0.023241   +proj=long...: 0   Median : 282.49   Median :70451  
##  Mean   :0.033493                      Mean   : 407.38   Mean   :70459  
##  3rd Qu.:0.034320                      3rd Qu.: 417.35   3rd Qu.:70707  
##  Max.   :0.122777                      Max.   :1494.16   Max.   :70823  
##                                                                         
##   MUNICIPIO         TON_PROD_2010     TON_PROD_2006     TON_PROD_2007    
##  Length:26          Min.   :   0.00   Min.   :    0.0   Min.   :    0.0  
##  Class :character   1st Qu.:  12.09   1st Qu.:   32.0   1st Qu.:   39.5  
##  Mode  :character   Median :  74.50   Median :  145.0   Median :  458.0  
##                     Mean   : 860.05   Mean   : 1527.3   Mean   : 1563.9  
##                     3rd Qu.: 845.25   3rd Qu.:  667.9   3rd Qu.: 1331.8  
##                     Max.   :5950.00   Max.   :16925.0   Max.   :12516.5  
##                     NA's   :8         NA's   :8         NA's   :8        
##  TON_PROD_2008     TON_PROD_2009     TON_PROD_2011   TON_PROD_2012     
##  Min.   :    0.0   Min.   :   0.00   Min.   :    0   Min.   :    0.00  
##  1st Qu.:   53.5   1st Qu.:  38.76   1st Qu.:   30   1st Qu.:   59.25  
##  Median :  242.5   Median : 103.00   Median :  195   Median :  173.50  
##  Mean   : 2594.7   Mean   : 783.01   Mean   : 1832   Mean   : 2246.82  
##  3rd Qu.: 1580.2   3rd Qu.: 789.00   3rd Qu.: 1902   3rd Qu.: 2349.97  
##  Max.   :21711.2   Max.   :4767.19   Max.   :10599   Max.   :16455.68  
##  NA's   :8         NA's   :8         NA's   :8       NA's   :8         
##  TON_PROD_2013     TON_PROD_2014     TON_PROD_2015     TON_PROD_2016  
##  Min.   :   0.00   Min.   :    0.0   Min.   :    0.0   Min.   :    0  
##  1st Qu.:  82.25   1st Qu.:  103.0   1st Qu.:  102.8   1st Qu.:   15  
##  Median : 454.00   Median :  455.5   Median :  482.5   Median :  422  
##  Mean   :1424.64   Mean   : 4122.9   Mean   : 6156.0   Mean   : 8237  
##  3rd Qu.:1872.31   3rd Qu.: 4106.2   3rd Qu.: 5203.4   3rd Qu.: 4314  
##  Max.   :6130.00   Max.   :26092.0   Max.   :36666.0   Max.   :62557  
##  NA's   :8         NA's   :8         NA's   :8         NA's   :8      
##  TON_PROD_2017      TON_PROD_2018  
##  Min.   :    0.00   Min.   :    0  
##  1st Qu.:   33.75   1st Qu.:    0  
##  Median :  650.40   Median :   31  
##  Mean   : 5469.56   Mean   : 3518  
##  3rd Qu.: 2753.75   3rd Qu.: 3248  
##  Max.   :43394.00   Max.   :21401  
##  NA's   :8          NA's   :8

select(mun_sucre_arroz,MPIO_CCDGO,MPIO_CNMBR,KM2,COD_MUN.,MUNICIPIO,TON_PROD_2010,TON_PROD_2006,TON_PROD_2007,TON_PROD_2008,TON_PROD_2009,TON_PROD_2011,TON_PROD_2012,TON_PROD_2013,TON_PROD_2014,TON_PROD_2015,TON_PROD_2016,TON_PROD_2017,TON_PROD_2018)

We can see that BUENAVISTA and CHALAN municipalities do not have values for the TON_PROD and REND attributes for the ARROZ crop.

Why this happened? It is because the ARROZ statistics for these municipalities are not available (NA).

arroz_buenavista <-  arroz_sucre2 %>% filter(COD_MUN.==70110); arroz_buenavista

6. Map rice production in Sucre’s municipalities

Let’s make a map of total production of rice in 2017:

bins <- c(5000, 10000, 15000, 25000, 50000, 75000)
pal <- colorBin("YlOrRd", domain = mun_sucre_arroz$TON_PROD_2017, bins = bins)


  mapa <- leaflet(data = mun_sucre_arroz) %>%
  addTiles() %>%
  addPolygons(label = ~TON_PROD_2017,
              popup = ~MPIO_CNMBR,
              fillColor = ~pal(TON_PROD_2017),
              color = "#444444",
              weight = 1,
              smoothFactor = 0.5,
              opacity = 1.0,
              fillOpacity = 0.5,
              highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE)
              ) %>%
  addProviderTiles(providers$OpenStreetMap) %>%
  addLegend("bottomright", pal = pal, values = ~TON_PROD_2017,
    title = "rice crop production in Sucre [Ton] (2017)",
    opacity = 1
  ); mapa

## Warning in pal(TON_PROD_2017): Some values were outside the color scale and will
## be treated as NA

7. Map rice harvested area in Sucre’s municipalities

Now, we are going to repeat the process to make a different map, in this case the harvested area of rice crop in every municipality.

Let’s remind the EVA variables names:

summary(eva_sucre)

##   ï..COD_DEP. DEPARTAMENTO          COD_MUN.      MUNICIPIO        
##  Min.   :70   Length:4503        Min.   :  704   Length:4503       
##  1st Qu.:70   Class :character   1st Qu.:70124   Class :character  
##  Median :70   Mode  :character   Median :70265   Mode  :character  
##  Mean   :70                      Mean   :58496                     
##  3rd Qu.:70                      3rd Qu.:70678                     
##  Max.   :70                      Max.   :70823                     
##                                                                    
##     GRUPO             SUBGRUPO           CULTIVO         
##  Length:4503        Length:4503        Length:4503       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  DESAGREGACION_REGIONAL_Y.O_SISTEMA_PRODUCTIVO      YEAR     
##  Length:4503                                   Min.   :2006  
##  Class :character                              1st Qu.:2009  
##  Mode  :character                              Median :2012  
##                                                Mean   :2012  
##                                                3rd Qu.:2015  
##                                                Max.   :2018  
##                                                              
##    PERIODO           HA_SEMBRADA    HA._COSECHADA   TON_PRODUCCION   
##  Length:4503        Min.   :  1.0   Min.   :  0.0   Min.   :    0.0  
##  Class :character   1st Qu.: 10.0   1st Qu.:  8.0   1st Qu.:   12.0  
##  Mode  :character   Median : 50.0   Median : 40.0   Median :   78.0  
##                     Mean   :132.1   Mean   :120.3   Mean   :  591.6  
##                     3rd Qu.:160.0   3rd Qu.:140.0   3rd Qu.:  349.0  
##                     Max.   :990.0   Max.   :995.0   Max.   :47018.0  
##                                                                      
##   RENDIMIENTO     ESTADO_FISICO_PRODUCCION NOMBRE._CIENTIFICO
##  Min.   : 0.100   Length:4503              Length:4503       
##  1st Qu.: 1.500   Class :character         Class :character  
##  Median : 3.200   Mode  :character         Mode  :character  
##  Mean   : 5.038                                              
##  3rd Qu.: 7.000                                              
##  Max.   :70.000                                              
##  NA's   :129                                                 
##     CICLO          
##  Length:4503       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Now, apply the filter & select functions to get the area cosechada en arroz:

arroz_sucre <- eva_sucre %>%  filter(CULTIVO == "ARROZ")  %>%  dplyr::select(MUNICIPIO, COD_MUN., YEAR, PERIODO, HA._COSECHADA)

Then, let’s aggregate PERIODO values into YEAR:

arroz_sucre %>% group_by(MUNICIPIO, COD_MUN., YEAR) %>%
   summarize(HA._COSECHADA=sum(HA._COSECHADA)) -> arroz_sucre2

## `summarise()` has grouped output by 'MUNICIPIO', 'COD_MUN.'. You can override using the `.groups` argument.

Then, let’s reformat the statistics into wider format:

arroz_sucre2 %>% 
  group_by(COD_MUN.) %>% 
  gather("HA._COSECHADA", key = variable, value = number)   %>% 
  unite(combi, variable, YEAR) %>%
  pivot_wider(names_from = combi, values_from = number, values_fill = 0) -> arroz_sucre3
mun_sucre$COD_MUN. <- as.integer(mun_sucre$MPIO_CCDGO)

Now, it’s join time:

mun_sucre_arroz = left_join(mun_sucre, arroz_sucre3, by="COD_MUN."); mun_sucre_arroz

Finally, mapping time:

bins <- c(0, 100, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000)
pal <- colorBin("RdYlGn", domain = mun_sucre_arroz$HA._COSECHADA_2017, bins = bins)

  mapa <- leaflet(data = mun_sucre_arroz) %>%
  addTiles() %>%
  addPolygons(label = ~HA._COSECHADA_2017,
              popup = ~MPIO_CNMBR,
              fillColor = ~pal(HA._COSECHADA_2017),
              color = "#444444",
              weight = 1,
              smoothFactor = 0.5,
              opacity = 1.0,
              fillOpacity = 0.5,
              highlightOptions = highlightOptions(color = "white", weight = 2, bringToFront = TRUE)
              ) %>%
  addProviderTiles(providers$OpenStreetMap) %>%
  addLegend("bottomright", pal = pal, values = ~HA._COSECHADA_2017,
    title = "rice harvested area in Sucre [Ha] (2017)",
    opacity = 1
  ); mapa

9. Reproducibility

sessionInfo()

## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Spanish_Colombia.1252  LC_CTYPE=Spanish_Colombia.1252   
## [3] LC_MONETARY=Spanish_Colombia.1252 LC_NUMERIC=C                     
## [5] LC_TIME=Spanish_Colombia.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] sf_0.9-7           leaflet_2.0.4.1    forcats_0.5.1      stringr_1.4.0     
##  [5] dplyr_1.0.5        purrr_0.3.4        readr_1.4.0        tidyr_1.1.3       
##  [9] tibble_3.1.0       ggplot2_3.3.3      tidyverse_1.3.0    RColorBrewer_1.1-2
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6              lubridate_1.7.10        leaflet.providers_1.9.0
##  [4] class_7.3-18            assertthat_0.2.1        digest_0.6.27          
##  [7] utf8_1.1.4              R6_2.5.0                cellranger_1.1.0       
## [10] backports_1.2.1         reprex_1.0.0            evaluate_0.14          
## [13] e1071_1.7-4             httr_1.4.2              pillar_1.5.1           
## [16] rlang_0.4.10            readxl_1.3.1            rstudioapi_0.13        
## [19] rmarkdown_2.7           htmlwidgets_1.5.3       munsell_0.5.0          
## [22] broom_0.7.5             compiler_4.0.4          modelr_0.1.8           
## [25] xfun_0.21               pkgconfig_2.0.3         htmltools_0.5.1.1      
## [28] tidyselect_1.1.0        fansi_0.4.2             crayon_1.4.1           
## [31] dbplyr_2.1.0            withr_2.4.1             grid_4.0.4             
## [34] jsonlite_1.7.2          gtable_0.3.0            lifecycle_1.0.0        
## [37] DBI_1.1.1               magrittr_2.0.1          units_0.7-0            
## [40] scales_1.1.1            KernSmooth_2.23-18      cli_2.3.1              
## [43] stringi_1.5.3           farver_2.1.0            fs_1.5.0               
## [46] xml2_1.3.2              ellipsis_0.3.1          generics_0.1.0         
## [49] vctrs_0.3.6             tools_4.0.4             glue_1.4.2             
## [52] hms_1.0.0               crosstalk_1.1.1         yaml_2.2.1             
## [55] colorspace_2.0-0        classInt_0.4-3          rvest_1.0.0            
## [58] knitr_1.31              haven_2.3.1