My first R Notebook: Select crops with largest production

1.Introduction

This is the first R Markdown Notebook for the Geomatica Basica 2022 course. It aims to know what are the agricultural crops with largest production in a given department. We will use as main source an excel file containing the Evaluaciones Agropecuarias Municipales (EVA) produced by UPRA for 2020.

For writing the notebook follow these directions:

Each chunk is executed by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.
A new chunk is added by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
The notebook should be saved with the .Rmd extension in the same directory where the UPRA EVA file was stored.

You may use Preview to get a rendered HTML copy of the contents of the editor. However, if you want to run all R code chunks, you need to use Knit to HTML which is available under the Preview option.

2. Setup

In this step, we will install and load the required R libraries.

#uncomment for first-time execution
#install.packages('tidyverse') 
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1

## Warning: package 'tidyr' was built under R version 4.1.2

## Warning: package 'readr' was built under R version 4.1.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(readxl)

3. Read an excel file with estadisticas agropecuarias municipales

To know names of the excel files existing in our workspace type:

(archivos = list.files(pattern='xls'))

## [1] "~$GB_IL_lista.xlsx"        "12-Anexos-municipales.xls"
## [3] "BaseSIPRA2020.xlsx"        "CNPV-2018-NBI.xlsx"       
## [5] "divipola_editado.xls"      "divipola.xls"             
## [7] "GB_IL_lista.xls"           "GB_IL_lista.xlsx"         
## [9] "NBI.xlsx"

To know names of the excel sheets existing in the EVA file:

(hojas = readxl::excel_sheets("BaseSIPRA2020.xlsx"))

## [1] "Agrícola_SIPRA_AGRONET"       "InventarioBovino"            
## [3] "InventarioPorcino"            "InvBufalosCaprinoOvinoEquino"
## [5] "InvAves"

To read the excel sheet we are interested:

eva2020 = readxl::read_excel("BaseSIPRA2020.xlsx", sheet = "Agrícola_SIPRA_AGRONET")

To know what object is eva2020:

eva2020

4. Subset data relevant to our department

As we are interested in just a department we need to filter eva2020:

stder2020 = dplyr::filter(eva2020, Departamento == "Santander")

To display what the filtered object:

stder2020

5. Clean the filtered data

Note that there a few colummns that are not relevant (e.g. Codigo del departamento and Departamento). Let’s keep the relevant colummns in another object:

stder2020.tmp <-  stder2020 %>% select('Código del Municipio':'Ciclo del cultivo')

To display the result:

stder2020.tmp

Note that there are several columns with strange names (i.e. blank spaces or accent marks). We need to rename those columns:

stder2020.tmp %>% rename(Cod_Mun = 'Código del Municipio', 
                         Grupo = 'Grupo cultivo según especie', 
                         Subgrupo = 'Subgrupo cultivo según especie',
                         AreaSiembra = 'Area Sembrada (ha)',
                         AreaCosecha = 'Area Cosechada (ha)',
                         Produccion = 'Producción (t)',                              
                         Rendimiento = 'Rendimiento (t/ha)', Ciclo='Ciclo del cultivo') ->                                  nstder2020

To display the result:

nstder2020

Note also that several columns which represent numeric values (e.g. Produccion or Rendimiento) are defined as character data types. We need to change the type of those columns:

nstder2020 %>% mutate(AreaSiembra = as.numeric(AreaSiembra),
                       AreaCosecha = as.numeric(AreaCosecha),
                       Produccion = as.numeric(Produccion),
                       Rendimiento = as.numeric(Rendimiento)) -> nstder2020

To display the result:

nstder2020

6. Data processing & analysis

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

dplyr makes this very easy through the use of the group_by() function, which splits the data into groups. When the data is grouped in this way summarize() can be used to collapse each group into a single-row summary. summarize() does this by applying an aggregating or summary function to each group.

For example, if we wanted to group by crop and find the total production (Ton/ha) in our department’s municipalities, we type:

nstder2020 %>%
  filter(Produccion > 0) %>%
  group_by(Cultivo) %>%
  summarize(total_produccion = sum(Produccion)) %>% 
  arrange(desc(total_produccion))

We found the most important crops in our department.

Now we want to find the municipalities with higher production for every crop:

nstder2020 %>%
  group_by(Cultivo, Municipio) %>%
  summarize(max_prod = max(Produccion, na.rm = TRUE)) %>%
  slice(which.max(max_prod))  %>%
  arrange(desc(max_prod))

## `summarise()` has grouped output by 'Cultivo'. You can override using the `.groups` argument.

nstder2020 %>%
  group_by(Grupo,Municipio) %>%
  summarize(max_prod = max(Produccion, na.rm = TRUE)) %>%
  slice(which.max(max_prod))  %>%
  arrange(desc(max_prod))

## `summarise()` has grouped output by 'Grupo'. You can override using the `.groups` argument.

Now, let’s select a group of crops and export the data for future reuse:

nstder2020 %>%
  group_by(Cod_Mun, Municipio, Grupo) %>%
  filter(Grupo=='Frutales') %>% 
  summarize(max_prod = max(Produccion, na.rm = TRUE)) %>%
  arrange(desc(max_prod)) -> frutales2020

## `summarise()` has grouped output by 'Cod_Mun', 'Municipio'. You can override using the `.groups` argument.

frutales2020

nstder2020 %>%
  group_by(Cod_Mun, Municipio, Grupo) %>%
  filter(Grupo=='Leguminosas Y Oleaginosas') %>% 
  summarize(max_prod = max(Produccion, na.rm = TRUE)) %>%
  arrange(desc(max_prod)) -> oleaginosas2020

## `summarise()` has grouped output by 'Cod_Mun', 'Municipio'. You can override using the `.groups` argument.

7. Write selected data to file

#write_csv(frutales2020, "./stder_frutales_2020.csv")
write_csv(oleaginosas2020, "./stder_oleag_2020.csv")

8. Reproducibility

Cite this work as Lizarazo, I. Reading and processing municipal agricultural statistics for 2020. Available at: https://rpubs.com/ials2un/readingEVAv1.

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] readxl_1.3.1    forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
##  [5] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.3   
##  [9] ggplot2_3.3.5   tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1  xfun_0.24         bslib_0.2.5.1     haven_2.4.3      
##  [5] colorspace_2.0-2  vctrs_0.3.8       generics_0.1.0    htmltools_0.5.1.1
##  [9] yaml_2.2.1        utf8_1.2.2        rlang_0.4.11      jquerylib_0.1.4  
## [13] pillar_1.6.2      glue_1.4.2        withr_2.4.2       DBI_1.1.1        
## [17] bit64_4.0.5       dbplyr_2.1.1      modelr_0.1.8      lifecycle_1.0.0  
## [21] munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0  rvest_1.0.1      
## [25] evaluate_0.14     knitr_1.33        tzdb_0.2.0        parallel_4.1.0   
## [29] fansi_0.5.0       broom_0.7.12      Rcpp_1.0.7        backports_1.4.1  
## [33] scales_1.1.1      vroom_1.5.7       jsonlite_1.7.2    bit_4.0.4        
## [37] fs_1.5.0          hms_1.1.1         digest_0.6.27     stringi_1.7.3    
## [41] grid_4.1.0        cli_3.0.1         tools_4.1.0       magrittr_2.0.1   
## [45] sass_0.4.0        crayon_1.4.1      pkgconfig_2.0.3   ellipsis_0.3.2   
## [49] xml2_1.3.2        reprex_2.0.1      lubridate_1.8.0   rstudioapi_0.13  
## [53] assertthat_0.2.1  rmarkdown_2.9     httr_1.4.2        R6_2.5.0         
## [57] compiler_4.1.0

My first R Notebook: Select crops with largest production

Ivan Lizarazo

23.03.2022

1.Introduction

2. Setup

3. Read an excel file with estadisticas agropecuarias municipales

4. Subset data relevant to our department

5. Clean the filtered data

6. Data processing & analysis

7. Write selected data to file

8. Reproducibility