Import Datasets from FAOSTAT Using R

Hin Lyhour (Ph.D.)

Senoir lecturer and researcher
Faculty of Agricultural Biosystems Engineering
Royal University of Agriculture, Cambodia
Date: June 07, 2025

This lesson is designed to assist researchers, or students, who want to extract datasets directly from FAOSTAT, available from <https://www.fao.org/faostat/en/#country>. Firstly, we need to undestand basic R. Then, when we want to import any datasets from a specific country, we should install FAOSTAT packages using the code instal.packages("FAOSTAT"), which is done only one time. Then, when we want to recall the packages for use, we just type library(FAOSTAT), which is needed to be recalled whenever the R project is reopened for operation.

## First, install the packages "FAOSTAT"
library(FAOSTAT) # For data imports from FAOSTAT
library(tidyr) # For data cleaning
library(dplyr) # For data manipulation
library(rstatix) # For data analysis
library(ggplot2) # For beautiful graphic creation
library(patchwork) # For graph combination

After the FAOSTAT library is recalled for use, we can use the code search_dataset() to explore all the data available there. For easy data manipulation, we should give a name to the search, as shown in the code below.

1.1 Entire Data Search in FAOSTAT

# Search datasets available on FAOSTAT
datasets <- search_dataset()

# Check the column names available in the datasets
names(datasets)

 [1] "code"            "label"           "date_update"     "note_update"    
 [5] "release_current" "state_current"   "year_current"    "release_next"   
 [9] "state_next"      "year_next"

# Check the first three rows of the datasets
head(datasets, 3)

  code                                                            label
1   PA                                     Producer Prices (old series)
2   HS Indicators from Household Surveys (gender, area, socioeconomics)
3   FA                                         Food Aid Shipments (WFP)
  date_update note_update release_current state_current year_current
1  1991-12-31                  1991-12-31         final         1990
2  2014-07-31                  2014-07-31         final         2014
3  2016-12-22                  2016-12-22   preliminary         2016
  release_next state_next year_next
1                                  
2                                  
3

The result shows that datasets have 8 columns, including code, label, and so on. Column 1 code refers to the list of acronyms, while column 2 label refers to the full names of the acronyms specified in column 1. To easily identify the areas we may be interested in, we check the label column only, as shown below.

# Check the areas in the `label` column only
datasets$code

 [1] "PA"   "HS"   "FA"   "AF"   "AE"   "RA"   "CBH"  "RM"   "RY"   "FT"  
[11] "FBSH" "PD"   "MK"   "LC"   "RFM"  "FBS"  "SCL"  "CB"   "RL"   "BE"  
[21] "RT"   "RP"   "EMN"  "EK"   "OEA"  "GT"   "EM"   "GCE"  "GN"   "GF"  
[31] "GI"   "GV"   "GPP"  "EA"   "GFDI" "ESB"  "SUA"  "HCES" "FDIQ" "MDDW"
[41] "TCL"  "TM"   "TI"   "TCLI" "CAHD" "OA"   "OER"  "PP"   "QI"   "QV"  
[51] "CP"   "FDI"  "IG"   "EI"   "GLE"  "QCL"  "RFN"  "RFB"  "FO"   "FOP" 
[61] "FS"   "ET"   "IC"   "PE"   "WCAD" "SDGB" "CS"   "CISP" "RN"   "RI"  
[71] "RS"   "GCI"  "OE"   "GEI"  "GEM"  "GEL"  "GEP"

The result indicates 77 areas of specific data, which covers a range of information that can be used for specific studies.

1.2 Selection of Food Security in Cambodia

1.2.1 Selection of Food Security

In order to extract specific data for use, we can use the code get_faostat_bulk(""), while the acronym to be used inside the quotation marks, we can use the acronym we may see in the previous result. For example, we use the acronym FS, which stands for Food Security, in the quotation marks. We also need to give a name to the code, so that the name will be used subsequently for the analysis.

# Import the Food Security dataset (FS) only
fs_data <- get_faostat_bulk("FS")

After extacting the Food Security dataset, we can check the new data to understand more about the information inside.

# Check the colnames
colnames(fs_data)

 [1] "area_code"       "area_code__m49_" "area"            "item_code"      
 [5] "item"            "element_code"    "element"         "year_code"      
 [9] "year"            "unit"            "value"           "flag"           
[13] "note"

# Check the first three rows of the dataset
head(fs_data, 3)

  area_code area_code__m49_        area item_code
1         2            '004 Afghanistan     21010
2         2            '004 Afghanistan     21010
3         2            '004 Afghanistan     21010
                                                               item
1 Average dietary energy supply adequacy (percent) (3-year average)
2 Average dietary energy supply adequacy (percent) (3-year average)
3 Average dietary energy supply adequacy (percent) (3-year average)
  element_code element year_code      year unit value flag note
1         6121   value  20002002 2000-2002    %    87    E     
2         6121   value  20012003 2001-2003    %    88    E     
3         6121   value  20022004 2002-2004    %    91    E

We can see that the new dataset contains 13 columns–one of which is column area that contains country names. In case, we want to extract only Cambodia data, we can further use the code filter() from the dplyr packages.

1.2.2 Selection of Cambodia

# Extract only the Cambodia data
Camb <- fs_data |> filter(area =="Cambodia")

# Double-check if only the Cambodia data is extracted
Camb |> freq_table(area)

# A tibble: 1 × 3
  area         n  prop
  <chr>    <int> <dbl>
1 Cambodia  1098   100

Based on the codes above, we select only Cambodia data. Then, we can check the final data as follows:

# Check the dataset
colnames(Camb)

 [1] "area_code"       "area_code__m49_" "area"            "item_code"      
 [5] "item"            "element_code"    "element"         "year_code"      
 [9] "year"            "unit"            "value"           "flag"           
[13] "note"

# Check the first 5 rows
head(Camb, 5)

  area_code area_code__m49_     area item_code
1       115            '116 Cambodia     21010
2       115            '116 Cambodia     21010
3       115            '116 Cambodia     21010
4       115            '116 Cambodia     21010
5       115            '116 Cambodia     21010
                                                               item
1 Average dietary energy supply adequacy (percent) (3-year average)
2 Average dietary energy supply adequacy (percent) (3-year average)
3 Average dietary energy supply adequacy (percent) (3-year average)
4 Average dietary energy supply adequacy (percent) (3-year average)
5 Average dietary energy supply adequacy (percent) (3-year average)
  element_code element year_code      year unit value flag note
1         6121   value  20002002 2000-2002    %   100    E     
2         6121   value  20012003 2001-2003    %   102    E     
3         6121   value  20022004 2002-2004    %   103    E     
4         6121   value  20032005 2003-2005    %   103    E     
5         6121   value  20042006 2004-2006    %   104    E

# Check type of data
str(Camb)

'data.frame':   1098 obs. of  13 variables:
 $ area_code      : int  115 115 115 115 115 115 115 115 115 115 ...
 $ area_code__m49_: chr  "'116" "'116" "'116" "'116" ...
 $ area           : chr  "Cambodia" "Cambodia" "Cambodia" "Cambodia" ...
 $ item_code      : chr  "21010" "21010" "21010" "21010" ...
 $ item           : chr  "Average dietary energy supply adequacy (percent) (3-year average)" "Average dietary energy supply adequacy (percent) (3-year average)" "Average dietary energy supply adequacy (percent) (3-year average)" "Average dietary energy supply adequacy (percent) (3-year average)" ...
 $ element_code   : int  6121 6121 6121 6121 6121 6121 6121 6121 6121 6121 ...
 $ element        : chr  "value" "value" "value" "value" ...
 $ year_code      : int  20002002 20012003 20022004 20032005 20042006 20052007 20062008 20072009 20082010 20092011 ...
 $ year           : chr  "2000-2002" "2001-2003" "2002-2004" "2003-2005" ...
 $ unit           : chr  "%" "%" "%" "%" ...
 $ value          : chr  "100" "102" "103" "103" ...
 $ flag           : chr  "E" "E" "E" "E" ...
 $ note           : chr  "" "" "" "" ...

After checking the dataset, supposing we want to study year and value, which represent Food Security and is expressed as %. Therefore, we can check the number of years as follows.

# Check the year_code
Camb |> freq_table(year_code)

# A tibble: 46 × 3
   year_code     n  prop
       <int> <int> <dbl>
 1      2000    25   2.3
 2      2001    25   2.3
 3      2002    26   2.4
 4      2003    26   2.4
 5      2004    26   2.4
 6      2005    25   2.3
 7      2006    25   2.3
 8      2007    25   2.3
 9      2008    25   2.3
10      2009    25   2.3
# ℹ 36 more rows

The year_code column contains both individual years and year ranges. For easy data manipulation, we use the individual years only.

1.2.3 Data cleaning

# Select only individual year,
Camb1 <- Camb |> filter(grepl("^\\d{4}$", year_code))

# Check the data again
Camb1 |> freq_table(year_code)

# A tibble: 24 × 3
   year_code     n  prop
       <int> <int> <dbl>
 1      2000    25   4.3
 2      2001    25   4.3
 3      2002    26   4.5
 4      2003    26   4.5
 5      2004    26   4.5
 6      2005    25   4.3
 7      2006    25   4.3
 8      2007    25   4.3
 9      2008    25   4.3
10      2009    25   4.3
# ℹ 14 more rows

The result indicates that the data can be multiple in individual years, so we can calculate means in each year, using the code below:

# Calculate means
## First, turn value into numeric
Camb1 <- Camb1 |> mutate(value=
                           as.numeric(value))

## Second, do the calculation
Food <- Camb1 |> group_by(year_code) |>
  summarise(Mean=mean(value, na.rm=T),
            SD = sd(value, na.rm=T),
            SE = SD/sqrt(n()))

## Show the result
Food

# A tibble: 24 × 4
   year_code  Mean    SD    SE
       <int> <dbl> <dbl> <dbl>
 1      2000  330.  726.  145.
 2      2001  401.  800.  160.
 3      2002  392.  804.  158.
 4      2003  399.  815.  160.
 5      2004  407.  834.  164.
 6      2005  387.  831.  166.
 7      2006  455.  910.  182.
 8      2007  469.  940.  188.
 9      2008  418.  908.  182.
10      2009  477.  963.  193.
# ℹ 14 more rows

Turning the value column into numeric is important because this column is not considered numeric, as shown in the str() code, while it must be numeric. After conversion, we can use the second code, within which na.rm=T must be mentioned to avoid the situation when missing data can happen. The code details can be studied from the code above.

II. Plotting Graph

Graphs can be plotted using ggplot2 packages, available from https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf. Below is the basic code, and modifications can be made by using further commands.

# Plot the graph
#| label: fig-FS1
#| fig-cap: "Food security over time"
Food |> ggplot(aes(year_code, Mean)) +
  geom_point()

2.1 Graphic Modifications

The graph can be modified for clearer visualization. We can add theme(text = element_text(size=16)) to increase the fond size. In this example, the font size increases to 16. We can use labs() to change the name of X & Y axes. Additionally, we can give names to each graph codes for shorter commands subsequently.

# Step 1: Name the code
p <- Food |> ggplot(aes(year_code, Mean)) +
  geom_point()
p

# Step 2: Increase the size
p1 <- p + theme(text = element_text(size=16))
p1

# step 3: Rename the axis titles and give the title
p2 <- p1 + labs(x="Year", y="Food security (%)")
p2

# step 4: Add the prediction curve (smooth)
p3 <- p2 + geom_smooth()
p3

The results above show that food security is no longer a matter for Cambodia because we have achieved 100% since 2000, according to FAOSTAT data. Until now, the index is at least six-fold, compared to the base year. Therefore, what should be considered is food safety.

2.2 Graphic Combination

Graphs can be combined easily using the following codes based on the given graph names. Remember that we use import all the libraries as shown at the beginning to enable these codes.

# Combine all the graph
p + p1

Figure 5: Two graphs combined horizontally

# Three graphs in one
p + p1 + p2

Figure 6: Three graphs combined horizontally

# Three graphs in one
p + p1/p2

Figure 7: Three graphs combined: the other two vertically arranged

# Four graphs in one
(p + p1)/(p2+p3)

Figure 8: Two graphs above and two more below

For further R learn, please give your comment, or contact me at hlyhour@rua.edu.kh.