How to read and subset .xlsx agricultural statistics data

Introduction

This is a very simple notebook which illustrates how to subset EVA 2019-2022 agricultural statistics data provided by UPRA at https://upra.gov.co/es-co/Evas_Documentos/BaseEVA_Agr%C3%ADcola20192022.xlsx.

It aims to help Geomatica Basica students at UNAL to gain digital skills in R.

Prerequisites

It is assumed that you have already downloaded the xlsx file above mentioned and saved it into the same directory where this notebook lives.

When opened in Excel, the first rows of the BaseEVA_Agricola20192022.xlsx file look like this:

Then, you need to prune the original file in order to remove banners as well as to rename the columns to be able to read it from R. There should not be blank spaces nor strangeous characters on any column name.

Setup

Clean the temporal memory:

rm(list=ls())

Load the R libraries:

library(readxl)
library(readr)
library(dplyr)

Read the EVA 2019-2022 dataset

Let’s check names of files which live in the same directory than this notebook:

list.files()

##  [1] "~$BaseEVA_Agrícola20192022.xlsx" "BaseEVA_Agrícola20192022.xlsx"  
##  [3] "EVA_19_22.xlsx"                  "figs"                           
##  [5] "Produccion2018.nb.html"          "Produccion2022.html"            
##  [7] "Produccion2022.nb.html"          "Produccion2022.Rmd"             
##  [9] "rsconnect"                       "stder_frutales_2022.csv"

In this case, the fixed file is EVA_19_22.xlsx

As this is an excel file, it is worth to know how many sheets are there:

hojas <- excel_sheets("EVA_19_22.xlsx")

Let’s see:

hojas

## [1] "Datos"

There is a single sheet there. Let’s read it:

EVA19_22 <- read_excel("EVA_19_22.xlsx")

EVA19_22

Note that the COD_MUN column is a character. This is the common data type for identifiers.

Subset data for your department

As we are interested in just a department we need to filter EVA19_22:

(stder19_22 = dplyr::filter(EVA19_22, DPTO == "Santander"))

Now, we will select only data for 2022:

(stder_22 = dplyr::filter(stder19_22, YEAR == 2022))

Subset the group of crops you are analyzing

Now, let’s select only the group of crops we are interested in, and summarize the production per municipality:

stder_22 %>%
  group_by(COD_MUN, MUNICIPIO, GRUPO) %>%
  filter(GRUPO=='Frutales') %>% 
  summarize(max_prod = max(PRODUCCION, na.rm = TRUE)) %>%
  arrange(desc(max_prod)) -> frutales2022

## `summarise()` has grouped output by 'COD_MUN', 'MUNICIPIO'. You can override
## using the `.groups` argument.

frutales2022

Write the data

We will use the readr library to save our data in csv format:

write_csv(frutales2022, "stder_frutales_2022.csv")

Let’s check that the file has been created:

list.files()

##  [1] "~$BaseEVA_Agrícola20192022.xlsx" "BaseEVA_Agrícola20192022.xlsx"  
##  [3] "EVA_19_22.xlsx"                  "figs"                           
##  [5] "Produccion2018.nb.html"          "Produccion2022.html"            
##  [7] "Produccion2022.nb.html"          "Produccion2022.Rmd"             
##  [9] "rsconnect"                       "stder_frutales_2022.csv"

Let’s read the file to check that the data are correct:

los_datos22 <- read_csv("stder_frutales_2022.csv")

## Rows: 86 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): MUNICIPIO, GRUPO
## dbl (2): COD_MUN, max_prod
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

los_datos22

Note that the COD_MUN column is now a double (i.e. numeric). This is not the common data type for identifiers.

In case we need to change such data type, we can use mutate:

los_datos22 %>% mutate(COD_MUN = as.character(COD_MUN)) -> nuevos_datos22

head(nuevos_datos22)

Well done! Now, you can proceed to create the notebook where we will join the 2022 production data to the polygons representing municipalities in our department.

Reference

If you use code from this notebook in your project please cite it as follows:

Lizarazo, I. 2023. How to read and subset .xlsx agricultural statistics data. Available at https://rpubs.com/otra_vez_eva

Reproducibility

sessionInfo()

## R version 4.3.0 (2023-04-21)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Bogota
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_1.1.2  readr_2.1.4  readxl_1.4.2
## 
## loaded via a namespace (and not attached):
##  [1] crayon_1.5.2      vctrs_0.6.3       cli_3.6.1         knitr_1.43       
##  [5] rlang_1.1.1       xfun_0.39         generics_0.1.3    jsonlite_1.8.7   
##  [9] bit_4.0.5         glue_1.6.2        htmltools_0.5.5   sass_0.4.6       
## [13] hms_1.1.3         fansi_1.0.4       rmarkdown_2.22    cellranger_1.1.0 
## [17] evaluate_0.21     jquerylib_0.1.4   tibble_3.2.1      tzdb_0.4.0       
## [21] fastmap_1.1.1     yaml_2.3.7        lifecycle_1.0.3   compiler_4.3.0   
## [25] pkgconfig_2.0.3   rstudioapi_0.15.0 digest_0.6.31     R6_2.5.1         
## [29] tidyselect_1.2.0  utf8_1.2.3        parallel_4.3.0    vroom_1.6.3      
## [33] pillar_1.9.0      magrittr_2.0.3    bslib_0.4.2       bit64_4.0.5      
## [37] tools_4.3.0       cachem_1.0.8