My third R Notebook: Crop production dynamics

1.Introduction

This is the third R Markdown Notebook for the Geomatica Basica 2022 course. It illustrates how to obtain multi-year stats for a given group of crops in any department. We will use as main data source the Evaluaciones Agropecuarias Municipales (EVA), a 2007-2018 agricultural data set provided by the Ministerio de Agricultura y Desarrollo Rural.

2. Setup

In this step, we will install and load the required R libraries.

#run the following lines from the command line:
#install.packages('dplyr')
#install.packages('readxl')
#install.packages('sf)

library(tidyverse)
library(dplyr)
library(readr)
library(ggplot2)

3. Download the multi-year EVA dataset for your department

Go to https://www.datos.gov.co/Agricultura-y-Desarrollo-Rural/Evaluaciones-Agropecuarias-Municipales-EVA/2pnw-mmge/data to find the dataset.

In the linked website (i.e datos.gov.co), visualize the data and apply a filter to get only the data corresponding to your department. Then, export the data in csv format.

Now, go to the downloads directory to find out the Evaluaciones_Agropecuarias_Municipales_EVA.csv file. Move it to your working directory.

3. Read the EVA dataset

list.files("./datos", pattern=c('csv'))

## [1] "co.csv"                                        
## [2] "Evaluaciones_Agropecuarias_Municipales_EVA.csv"
## [3] "stder_frutales_2020.csv"                       
## [4] "stder_oleag_2020.csv"

Now, let’s proceed to read the 2007-2018 EVA dataset:

# adjust the filepath acording to your data
(eva = read_csv("./datos/Evaluaciones_Agropecuarias_Municipales_EVA.csv", col_names = TRUE,
                show_col_types = FALSE))

View the table and confirm that it contains agricultural stats for different years. Can you say what is the time period for these data?

It is very important to know what are the names the software “sees”. It may be different from what we see. Let’s find out it:

names(eva)

##  [1] "CÓD. \nDEP."                                  
##  [2] "DEPARTAMENTO"                                 
##  [3] "CÓD. MUN."                                    
##  [4] "MUNICIPIO"                                    
##  [5] "GRUPO \nDE CULTIVO"                           
##  [6] "SUBGRUPO \nDE CULTIVO"                        
##  [7] "CULTIVO"                                      
##  [8] "DESAGREGACIÓN REGIONAL Y/O SISTEMA PRODUCTIVO"
##  [9] "AÑO"                                          
## [10] "PERIODO"                                      
## [11] "Área Sembrada\n(ha)"                          
## [12] "Área Cosechada\n(ha)"                         
## [13] "Producción\n(t)"                              
## [14] "Rendimiento\n(t/ha)"                          
## [15] "ESTADO FISICO PRODUCCION"                     
## [16] "NOMBRE \nCIENTIFICO"                          
## [17] "CICLO DE CULTIVO"

4. Clean the EVA dataset

Let’s select a few attributes to clean the eva object:

# check the objet output in the last chunk and
# change attribute names according to your own data
eva %>% dplyr::select('CÓD. MUN.':'ESTADO FISICO PRODUCCION') -> eva.tmp

eva.tmp

Check the output and verify that only the relevant columms were selected.

Now, change names for several columns which contains empty or “noisy” characters:

# make sure to use the column names that are in your eva.tmp object
eva.tmp %>%  dplyr::rename('Cod_Mun' = 'CÓD. MUN.', 
                         'Grupo' = 'GRUPO \nDE CULTIVO',
                         'Subgrupo' = 'SUBGRUPO \nDE CULTIVO', 
                         'Year' = 'AÑO',
                         'AreaSembrada' = 'Área Sembrada\n(ha)',
                         'AreaCosechada' = 'Área Sembrada\n(ha)',
                         'Produccion' = 'Producción\n(t)',                                                                 'Rendimiento' =  'Rendimiento\n(t/ha)',   
                         'Sistema' = 'DESAGREGACIÓN REGIONAL Y/O SISTEMA PRODUCTIVO',
                         'Estado' = 'ESTADO FISICO PRODUCCION') -> new_eva

Let’s check the output:

new_eva

Check the table above and verify that quantitative attributes are stored as numeric data types (not as string data types).

5. Data analysis

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

The dplyr library makes this very easy through the use of the group_by() function, which splits the data into groups. When the data is grouped in this way, summarize() can be used to collapse each group into a single-row summary. summarize() does this by applying an aggregating or summary function to each group.

5.1 The most importance crops between 2007 and 2018

For example, if we want to find out the total production per group of crops, we type:

new_eva %>%
  ##filter(Produccion > 0) %>%
  group_by(Grupo) %>%
  summarize(total_produccion = sum(Produccion)) %>% 
  arrange(desc(total_produccion))

Note that the total production contains the sum of production for every group of crops between 2007 and 2018. In Santander, the two crops with higher production were “Frutales” and “Tubérculos y Plátanos”.

To save the total produccion in an object:

new_eva %>%
  group_by(Grupo) %>%
  summarize(total_produccion = sum(Produccion)) -> PT

To filter the most important crops:

PT %>% 
  filter(total_produccion > 1000000) -> main.groups

To know the total production for the main groups of crops:

(value = sum(main.groups$total_produccion))

## [1] 18075086

To add a new attribute with percent of total production:

main.groups$percent = main.groups$total_produccion/value

To create a pie chart of total production of main.groups:

library(ggplot2)
# Barplot
bp<- ggplot(main.groups, aes(x="", y=percent, fill=Grupo))+
geom_bar(width = 1, stat = "identity")
# Piechart
pie <- bp + coord_polar("y", start=0)
pie

#### 5.2 Municipalities with higher production for every group of crops:

To find out the municipalities leading the production for every crop from 2007 to 2018:

new_eva %>%
  group_by(Grupo, MUNICIPIO) %>%
  summarize(total_prod = sum(Produccion, na.rm = TRUE)) %>%
  slice(which.max(total_prod))  %>%
  arrange(desc(total_prod))

To save the object:

new_eva %>%
  group_by(Grupo, MUNICIPIO) %>%
  summarize(total_prod = sum(Produccion, na.rm = TRUE)) %>%
  slice(which.max(total_prod))  -> leaders

## `summarise()` has grouped output by 'Grupo'. You can override using the `.groups` argument.

leaders

To filter the most important municipalities from the agricultural point of view:

leaders %>% 
  filter(total_prod > 50000) -> main.leaders

Let’s plot the filtered leaders:

# Basic barplot
p<-ggplot(data=main.leaders, aes(x=MUNICIPIO, y=total_prod)) +
  geom_bar(stat="identity")
p

5.3 Dynamics of one important crop between 2007 and 2018

new_eva %>% 
  filter(MUNICIPIO=="LEBRIJA" & CULTIVO=="PIÑA") %>% 
  group_by(Year, CULTIVO) %>%
  select(MUNICIPIO, CULTIVO, Produccion, Year) ->  lebrija_pineapple

Let’s check the output:

lebrija_pineapple

Let’s do a quick plot of pineapple production in Lebrija for the whole time period covered by the EVA dataset:

g <- ggplot(aes(x=Year, y=Produccion/1000), data = lebrija_pineapple) + geom_bar(stat='identity') + labs(y='Produccion de Piña [Ton x 1000]')

Add a title and visualize:

g + ggtitle("Evolution of Pineapple Crop Production in Lebrija from 2007 to 2018") + labs(caption= "Based on EVA data (Minagricultura, 2020)")

6. Further analysis and graphics

I have illustrated a few examples of analysis. You could analyze other variables than produccion (e.g. rendimiento, areacosechada, areasembrada).

More examples of analysis are available in this notebook.

7. Bibliography

In your notebook, cite this work as Lizarazo, I., 2022. Understanding dynamic productivity of crops. Available at https://rpubs.com/ials2un/production_dyn_v1.

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4    
## [5] readr_2.1.2     tidyr_1.2.0     tibble_3.1.3    ggplot2_3.3.5  
## [9] tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7        lubridate_1.8.0   assertthat_0.2.1  digest_0.6.27    
##  [5] utf8_1.2.2        R6_2.5.0          cellranger_1.1.0  backports_1.4.1  
##  [9] reprex_2.0.1      evaluate_0.14     httr_1.4.2        highr_0.9        
## [13] pillar_1.6.2      rlang_0.4.11      readxl_1.3.1      rstudioapi_0.13  
## [17] jquerylib_0.1.4   rmarkdown_2.9     labeling_0.4.2    bit_4.0.4        
## [21] munsell_0.5.0     broom_0.7.12      compiler_4.1.0    modelr_0.1.8     
## [25] xfun_0.24         pkgconfig_2.0.3   htmltools_0.5.1.1 tidyselect_1.1.1 
## [29] fansi_0.5.0       crayon_1.4.1      tzdb_0.2.0        dbplyr_2.1.1     
## [33] withr_2.4.2       grid_4.1.0        jsonlite_1.7.2    gtable_0.3.0     
## [37] lifecycle_1.0.0   DBI_1.1.1         magrittr_2.0.1    scales_1.1.1     
## [41] cli_3.0.1         stringi_1.7.3     vroom_1.5.7       farver_2.1.0     
## [45] fs_1.5.0          xml2_1.3.2        bslib_0.2.5.1     ellipsis_0.3.2   
## [49] generics_0.1.0    vctrs_0.3.8       tools_4.1.0       bit64_4.0.5      
## [53] glue_1.4.2        hms_1.1.1         parallel_4.1.0    yaml_2.2.1       
## [57] colorspace_2.0-2  rvest_1.0.1       knitr_1.33        haven_2.4.3      
## [61] sass_0.4.0