## First, install the packages "FAOSTAT"
library(FAOSTAT) # For data imports from FAOSTAT
library(tidyr) # For data cleaning
library(dplyr) # For data manipulation
library(rstatix) # For data analysis
library(ggplot2) # For beautiful graphic creation
library(patchwork) # For graph combination
Import Datasets from FAOSTAT Using R
Hin Lyhour (Ph.D.)
Senoir lecturer and researcher
Faculty of Agricultural Biosystems Engineering
Royal University of Agriculture, Cambodia
Date: June 07, 2025
This lesson is designed to assist researchers, or students, who want to extract datasets directly from FAOSTAT, available from <https://www.fao.org/faostat/en/#country>. Firstly, we need to undestand basic R. Then, when we want to import any datasets from a specific country, we should install FAOSTAT
packages using the code instal.packages("FAOSTAT")
, which is done only one time. Then, when we want to recall the packages for use, we just type library(FAOSTAT)
, which is needed to be recalled whenever the R project
is reopened for operation.
After the FAOSTAT
library is recalled for use, we can use the code search_dataset()
to explore all the data available there. For easy data manipulation, we should give a name to the search, as shown in the code below.
1.1 Entire Data Search in FAOSTAT
# Search datasets available on FAOSTAT
<- search_dataset() datasets
# Check the column names available in the datasets
names(datasets)
[1] "code" "label" "date_update" "note_update"
[5] "release_current" "state_current" "year_current" "release_next"
[9] "state_next" "year_next"
# Check the first three rows of the datasets
head(datasets, 3)
code label
1 PA Producer Prices (old series)
2 HS Indicators from Household Surveys (gender, area, socioeconomics)
3 FA Food Aid Shipments (WFP)
date_update note_update release_current state_current year_current
1 1991-12-31 1991-12-31 final 1990
2 2014-07-31 2014-07-31 final 2014
3 2016-12-22 2016-12-22 preliminary 2016
release_next state_next year_next
1
2
3
The result shows that datasets have 8 columns, including code
, label
, and so on. Column 1 code
refers to the list of acronyms, while column 2 label
refers to the full names of the acronyms specified in column 1. To easily identify the areas we may be interested in, we check the label
column only, as shown below.
# Check the areas in the `label` column only
$code datasets
[1] "PA" "HS" "FA" "AF" "AE" "RA" "CBH" "RM" "RY" "FT"
[11] "FBSH" "PD" "MK" "LC" "RFM" "FBS" "SCL" "CB" "RL" "BE"
[21] "RT" "RP" "EMN" "EK" "OEA" "GT" "EM" "GCE" "GN" "GF"
[31] "GI" "GV" "GPP" "EA" "GFDI" "ESB" "SUA" "HCES" "FDIQ" "MDDW"
[41] "TCL" "TM" "TI" "TCLI" "CAHD" "OA" "OER" "PP" "QI" "QV"
[51] "CP" "FDI" "IG" "EI" "GLE" "QCL" "RFN" "RFB" "FO" "FOP"
[61] "FS" "ET" "IC" "PE" "WCAD" "SDGB" "CS" "CISP" "RN" "RI"
[71] "RS" "GCI" "OE" "GEI" "GEM" "GEL" "GEP"
The result indicates 77 areas of specific data, which covers a range of information that can be used for specific studies.
1.2 Selection of Food Security in Cambodia
1.2.1 Selection of Food Security
In order to extract specific data for use, we can use the code get_faostat_bulk("")
, while the acronym to be used inside the quotation marks, we can use the acronym we may see in the previous result. For example, we use the acronym FS
, which stands for Food Security
, in the quotation marks. We also need to give a name to the code, so that the name will be used subsequently for the analysis.
# Import the Food Security dataset (FS) only
<- get_faostat_bulk("FS") fs_data
After extacting the Food Security
dataset, we can check the new data to understand more about the information inside.
# Check the colnames
colnames(fs_data)
[1] "area_code" "area_code__m49_" "area" "item_code"
[5] "item" "element_code" "element" "year_code"
[9] "year" "unit" "value" "flag"
[13] "note"
# Check the first three rows of the dataset
head(fs_data, 3)
area_code area_code__m49_ area item_code
1 2 '004 Afghanistan 21010
2 2 '004 Afghanistan 21010
3 2 '004 Afghanistan 21010
item
1 Average dietary energy supply adequacy (percent) (3-year average)
2 Average dietary energy supply adequacy (percent) (3-year average)
3 Average dietary energy supply adequacy (percent) (3-year average)
element_code element year_code year unit value flag note
1 6121 value 20002002 2000-2002 % 87 E
2 6121 value 20012003 2001-2003 % 88 E
3 6121 value 20022004 2002-2004 % 91 E
We can see that the new dataset contains 13 columns
–one of which is column area
that contains country names. In case, we want to extract only Cambodia
data, we can further use the code filter()
from the dplyr
packages.
1.2.2 Selection of Cambodia
# Extract only the Cambodia data
<- fs_data |> filter(area =="Cambodia")
Camb
# Double-check if only the Cambodia data is extracted
|> freq_table(area) Camb
# A tibble: 1 × 3
area n prop
<chr> <int> <dbl>
1 Cambodia 1098 100
Based on the codes above, we select only Cambodia
data. Then, we can check the final data as follows:
# Check the dataset
colnames(Camb)
[1] "area_code" "area_code__m49_" "area" "item_code"
[5] "item" "element_code" "element" "year_code"
[9] "year" "unit" "value" "flag"
[13] "note"
# Check the first 5 rows
head(Camb, 5)
area_code area_code__m49_ area item_code
1 115 '116 Cambodia 21010
2 115 '116 Cambodia 21010
3 115 '116 Cambodia 21010
4 115 '116 Cambodia 21010
5 115 '116 Cambodia 21010
item
1 Average dietary energy supply adequacy (percent) (3-year average)
2 Average dietary energy supply adequacy (percent) (3-year average)
3 Average dietary energy supply adequacy (percent) (3-year average)
4 Average dietary energy supply adequacy (percent) (3-year average)
5 Average dietary energy supply adequacy (percent) (3-year average)
element_code element year_code year unit value flag note
1 6121 value 20002002 2000-2002 % 100 E
2 6121 value 20012003 2001-2003 % 102 E
3 6121 value 20022004 2002-2004 % 103 E
4 6121 value 20032005 2003-2005 % 103 E
5 6121 value 20042006 2004-2006 % 104 E
# Check type of data
str(Camb)
'data.frame': 1098 obs. of 13 variables:
$ area_code : int 115 115 115 115 115 115 115 115 115 115 ...
$ area_code__m49_: chr "'116" "'116" "'116" "'116" ...
$ area : chr "Cambodia" "Cambodia" "Cambodia" "Cambodia" ...
$ item_code : chr "21010" "21010" "21010" "21010" ...
$ item : chr "Average dietary energy supply adequacy (percent) (3-year average)" "Average dietary energy supply adequacy (percent) (3-year average)" "Average dietary energy supply adequacy (percent) (3-year average)" "Average dietary energy supply adequacy (percent) (3-year average)" ...
$ element_code : int 6121 6121 6121 6121 6121 6121 6121 6121 6121 6121 ...
$ element : chr "value" "value" "value" "value" ...
$ year_code : int 20002002 20012003 20022004 20032005 20042006 20052007 20062008 20072009 20082010 20092011 ...
$ year : chr "2000-2002" "2001-2003" "2002-2004" "2003-2005" ...
$ unit : chr "%" "%" "%" "%" ...
$ value : chr "100" "102" "103" "103" ...
$ flag : chr "E" "E" "E" "E" ...
$ note : chr "" "" "" "" ...
After checking the dataset, supposing we want to study year
and value
, which represent Food Security
and is expressed as %
. Therefore, we can check the number of years as follows.
# Check the year_code
|> freq_table(year_code) Camb
# A tibble: 46 × 3
year_code n prop
<int> <int> <dbl>
1 2000 25 2.3
2 2001 25 2.3
3 2002 26 2.4
4 2003 26 2.4
5 2004 26 2.4
6 2005 25 2.3
7 2006 25 2.3
8 2007 25 2.3
9 2008 25 2.3
10 2009 25 2.3
# ℹ 36 more rows
The year_code
column contains both individual years and year ranges. For easy data manipulation, we use the individual years only.
1.2.3 Data cleaning
# Select only individual year,
<- Camb |> filter(grepl("^\\d{4}$", year_code))
Camb1
# Check the data again
|> freq_table(year_code) Camb1
# A tibble: 24 × 3
year_code n prop
<int> <int> <dbl>
1 2000 25 4.3
2 2001 25 4.3
3 2002 26 4.5
4 2003 26 4.5
5 2004 26 4.5
6 2005 25 4.3
7 2006 25 4.3
8 2007 25 4.3
9 2008 25 4.3
10 2009 25 4.3
# ℹ 14 more rows
The result indicates that the data can be multiple in individual years, so we can calculate means in each year, using the code below:
# Calculate means
## First, turn value into numeric
<- Camb1 |> mutate(value=
Camb1 as.numeric(value))
## Second, do the calculation
<- Camb1 |> group_by(year_code) |>
Food summarise(Mean=mean(value, na.rm=T),
SD = sd(value, na.rm=T),
SE = SD/sqrt(n()))
## Show the result
Food
# A tibble: 24 × 4
year_code Mean SD SE
<int> <dbl> <dbl> <dbl>
1 2000 330. 726. 145.
2 2001 401. 800. 160.
3 2002 392. 804. 158.
4 2003 399. 815. 160.
5 2004 407. 834. 164.
6 2005 387. 831. 166.
7 2006 455. 910. 182.
8 2007 469. 940. 188.
9 2008 418. 908. 182.
10 2009 477. 963. 193.
# ℹ 14 more rows
Turning the value
column into numeric is important because this column is not considered numeric, as shown in the str()
code, while it must be numeric. After conversion, we can use the second code, within which na.rm=T
must be mentioned to avoid the situation when missing data can happen. The code details can be studied from the code above.
II. Plotting Graph
Graphs can be plotted using ggplot2
packages, available from https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf. Below is the basic code, and modifications can be made by using further commands.
# Plot the graph
#| label: fig-FS1
#| fig-cap: "Food security over time"
|> ggplot(aes(year_code, Mean)) +
Food geom_point()
2.1 Graphic Modifications
The graph can be modified for clearer visualization. We can add theme(text = element_text(size=16))
to increase the fond size. In this example, the font size increases to 16. We can use labs()
to change the name of X & Y axes
. Additionally, we can give names to each graph codes for shorter commands subsequently.
# Step 1: Name the code
<- Food |> ggplot(aes(year_code, Mean)) +
p geom_point()
p
# Step 2: Increase the size
<- p + theme(text = element_text(size=16))
p1
p1
# step 3: Rename the axis titles and give the title
<- p1 + labs(x="Year", y="Food security (%)")
p2
p2
# step 4: Add the prediction curve (smooth)
<- p2 + geom_smooth()
p3 p3
The results above show that food security is no longer a matter for Cambodia
because we have achieved 100%
since 2000
, according to FAOSTAT
data. Until now, the index is at least six-fold
, compared to the base year. Therefore, what should be considered is food safety.
2.2 Graphic Combination
Graphs can be combined easily using the following codes based on the given graph names. Remember that we use import all the libraries as shown at the beginning to enable these codes.
# Combine all the graph
+ p1 p
# Three graphs in one
+ p1 + p2 p
# Three graphs in one
+ p1/p2 p
# Four graphs in one
+ p1)/(p2+p3) (p
For further R learn, please give your comment, or contact me at hlyhour@rua.edu.kh.