This is the third R Markdown Notebook for the Geomatica Basica 2022 course. It illustrates how to obtain multi-year stats for a given group of crops in any department. We will use as main data source the Evaluaciones Agropecuarias Municipales (EVA), a 2007-2018 agricultural data set provided by the Ministerio de Agricultura y Desarrollo Rural.
In this step, we will install and load the required R libraries.
#run the following lines from the command line:
#install.packages('dplyr')
#install.packages('readxl')
#install.packages('sf)
library(tidyverse)
library(dplyr)
library(readr)
library(ggplot2)
Go to https://www.datos.gov.co/Agricultura-y-Desarrollo-Rural/Evaluaciones-Agropecuarias-Municipales-EVA/2pnw-mmge/data to find the dataset.
In the linked website (i.e datos.gov.co), visualize the data and apply a filter to get only the data corresponding to your department. Then, export the data in csv format.
Now, go to the downloads directory to find out the Evaluaciones_Agropecuarias_Municipales_EVA.csv file. Move it to your working directory.
list.files("./datos", pattern=c('csv'))
## [1] "co.csv"
## [2] "Evaluaciones_Agropecuarias_Municipales_EVA.csv"
## [3] "stder_frutales_2020.csv"
## [4] "stder_oleag_2020.csv"
Now, let’s proceed to read the 2007-2018 EVA dataset:
# adjust the filepath acording to your data
(eva = read_csv("./datos/Evaluaciones_Agropecuarias_Municipales_EVA.csv", col_names = TRUE,
show_col_types = FALSE))
View the table and confirm that it contains agricultural stats for different years. Can you say what is the time period for these data?
It is very important to know what are the names the software “sees”. It may be different from what we see. Let’s find out it:
names(eva)
## [1] "CÓD. \nDEP."
## [2] "DEPARTAMENTO"
## [3] "CÓD. MUN."
## [4] "MUNICIPIO"
## [5] "GRUPO \nDE CULTIVO"
## [6] "SUBGRUPO \nDE CULTIVO"
## [7] "CULTIVO"
## [8] "DESAGREGACIÓN REGIONAL Y/O SISTEMA PRODUCTIVO"
## [9] "AÑO"
## [10] "PERIODO"
## [11] "Área Sembrada\n(ha)"
## [12] "Área Cosechada\n(ha)"
## [13] "Producción\n(t)"
## [14] "Rendimiento\n(t/ha)"
## [15] "ESTADO FISICO PRODUCCION"
## [16] "NOMBRE \nCIENTIFICO"
## [17] "CICLO DE CULTIVO"
Let’s select a few attributes to clean the eva object:
# check the objet output in the last chunk and
# change attribute names according to your own data
eva %>% dplyr::select('CÓD. MUN.':'ESTADO FISICO PRODUCCION') -> eva.tmp
eva.tmp
Check the output and verify that only the relevant columms were selected.
Now, change names for several columns which contains empty or “noisy” characters:
# make sure to use the column names that are in your eva.tmp object
eva.tmp %>% dplyr::rename('Cod_Mun' = 'CÓD. MUN.',
'Grupo' = 'GRUPO \nDE CULTIVO',
'Subgrupo' = 'SUBGRUPO \nDE CULTIVO',
'Year' = 'AÑO',
'AreaSembrada' = 'Área Sembrada\n(ha)',
'AreaCosechada' = 'Área Sembrada\n(ha)',
'Produccion' = 'Producción\n(t)', 'Rendimiento' = 'Rendimiento\n(t/ha)',
'Sistema' = 'DESAGREGACIÓN REGIONAL Y/O SISTEMA PRODUCTIVO',
'Estado' = 'ESTADO FISICO PRODUCCION') -> new_eva
Let’s check the output:
new_eva
Check the table above and verify that quantitative attributes are stored as numeric data types (not as string data types).
Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.
The dplyr library makes this very easy through the use of the group_by() function, which splits the data into groups. When the data is grouped in this way, summarize() can be used to collapse each group into a single-row summary. summarize() does this by applying an aggregating or summary function to each group.
For example, if we want to find out the total production per group of crops, we type:
new_eva %>%
##filter(Produccion > 0) %>%
group_by(Grupo) %>%
summarize(total_produccion = sum(Produccion)) %>%
arrange(desc(total_produccion))
Note that the total production contains the sum of production for every group of crops between 2007 and 2018. In Santander, the two crops with higher production were “Frutales” and “Tubérculos y Plátanos”.
To save the total produccion in an object:
new_eva %>%
group_by(Grupo) %>%
summarize(total_produccion = sum(Produccion)) -> PT
To filter the most important crops:
PT %>%
filter(total_produccion > 1000000) -> main.groups
To know the total production for the main groups of crops:
(value = sum(main.groups$total_produccion))
## [1] 18075086
To add a new attribute with percent of total production:
main.groups$percent = main.groups$total_produccion/value
To create a pie chart of total production of main.groups:
library(ggplot2)
# Barplot
bp<- ggplot(main.groups, aes(x="", y=percent, fill=Grupo))+
geom_bar(width = 1, stat = "identity")
# Piechart
pie <- bp + coord_polar("y", start=0)
pie
#### 5.2 Municipalities with higher production for every group of crops:
To find out the municipalities leading the production for every crop from 2007 to 2018:
new_eva %>%
group_by(Grupo, MUNICIPIO) %>%
summarize(total_prod = sum(Produccion, na.rm = TRUE)) %>%
slice(which.max(total_prod)) %>%
arrange(desc(total_prod))
To save the object:
new_eva %>%
group_by(Grupo, MUNICIPIO) %>%
summarize(total_prod = sum(Produccion, na.rm = TRUE)) %>%
slice(which.max(total_prod)) -> leaders
## `summarise()` has grouped output by 'Grupo'. You can override using the `.groups` argument.
leaders
To filter the most important municipalities from the agricultural point of view:
leaders %>%
filter(total_prod > 50000) -> main.leaders
Let’s plot the filtered leaders:
# Basic barplot
p<-ggplot(data=main.leaders, aes(x=MUNICIPIO, y=total_prod)) +
geom_bar(stat="identity")
p
new_eva %>%
filter(MUNICIPIO=="LEBRIJA" & CULTIVO=="PIÑA") %>%
group_by(Year, CULTIVO) %>%
select(MUNICIPIO, CULTIVO, Produccion, Year) -> lebrija_pineapple
Let’s check the output:
lebrija_pineapple
Let’s do a quick plot of pineapple production in Lebrija for the whole time period covered by the EVA dataset:
g <- ggplot(aes(x=Year, y=Produccion/1000), data = lebrija_pineapple) + geom_bar(stat='identity') + labs(y='Produccion de Piña [Ton x 1000]')
Add a title and visualize:
g + ggtitle("Evolution of Pineapple Crop Production in Lebrija from 2007 to 2018") + labs(caption= "Based on EVA data (Minagricultura, 2020)")
I have illustrated a few examples of analysis. You could analyze other variables than produccion (e.g. rendimiento, areacosechada, areasembrada).
More examples of analysis are available in this notebook.
In your notebook, cite this work as Lizarazo, I., 2022. Understanding dynamic productivity of crops. Available at https://rpubs.com/ials2un/production_dyn_v1.
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
## [5] readr_2.1.2 tidyr_1.2.0 tibble_3.1.3 ggplot2_3.3.5
## [9] tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 lubridate_1.8.0 assertthat_0.2.1 digest_0.6.27
## [5] utf8_1.2.2 R6_2.5.0 cellranger_1.1.0 backports_1.4.1
## [9] reprex_2.0.1 evaluate_0.14 httr_1.4.2 highr_0.9
## [13] pillar_1.6.2 rlang_0.4.11 readxl_1.3.1 rstudioapi_0.13
## [17] jquerylib_0.1.4 rmarkdown_2.9 labeling_0.4.2 bit_4.0.4
## [21] munsell_0.5.0 broom_0.7.12 compiler_4.1.0 modelr_0.1.8
## [25] xfun_0.24 pkgconfig_2.0.3 htmltools_0.5.1.1 tidyselect_1.1.1
## [29] fansi_0.5.0 crayon_1.4.1 tzdb_0.2.0 dbplyr_2.1.1
## [33] withr_2.4.2 grid_4.1.0 jsonlite_1.7.2 gtable_0.3.0
## [37] lifecycle_1.0.0 DBI_1.1.1 magrittr_2.0.1 scales_1.1.1
## [41] cli_3.0.1 stringi_1.7.3 vroom_1.5.7 farver_2.1.0
## [45] fs_1.5.0 xml2_1.3.2 bslib_0.2.5.1 ellipsis_0.3.2
## [49] generics_0.1.0 vctrs_0.3.8 tools_4.1.0 bit64_4.0.5
## [53] glue_1.4.2 hms_1.1.1 parallel_4.1.0 yaml_2.2.1
## [57] colorspace_2.0-2 rvest_1.0.1 knitr_1.33 haven_2.4.3
## [61] sass_0.4.0