US Food imports 1999-2021

The more diverse we get, the more diverse our food becomes. Consumers require variety, quality, and convenience in the foods they consume. The US department of agriculture (USDA) has published the following dataset (loaded below).

It provides import values of edible products entering U.S. and their origin of shipment. Data are from the U.S. Department of Commerce, U.S. Census Bureau. Food import values are compiled by calendar year into food groups. Twenty-two years of annual data are included, enabling us to track long-term growth patterns.

Looking at the past 22 years, how did the prices change? The focus will be on vegetable imports as wars and climate change affect fresh produce the most.

The original data was compiled in an excel woksheet. However, since my focus was on vegetables only, I saved that portion of the data in a CSV file. I uploaded both.

library(readxl)
## Warning: package 'readxl' was built under R version 4.1.3
setwd("C:/Users/mayss/Desktop/Data Science/Datasets/Project 1")

#import Excel file into Rstudio
Allfoodimports <- read_excel('US_Food_Imports.xlsx')
## New names:
## * `` -> `...2`
## * `` -> `...3`
## * `` -> `...4`
## * `` -> `...5`
## * `` -> `...6`
## * `` -> `...7`
## * `` -> `...8`
## * `` -> `...9`
## * `` -> `...10`
## * `` -> `...11`
## * `` -> `...12`
## * `` -> `...13`
## * `` -> `...14`
## * `` -> `...15`
## * `` -> `...16`
## * `` -> `...17`
## * `` -> `...18`
## * `` -> `...19`
## * `` -> `...20`
## * `` -> `...21`
## * `` -> `...22`
## * `` -> `...23`
## * `` -> `...24`
## * `` -> `...25`
## * `` -> `...26`
## * `` -> `...27`

The vegetable portion using a CSV file.

Load the libraries and load the data

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6     v purrr   0.3.4
## v tibble  3.1.7     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
setwd("C:/Users/mayss/Desktop/Data Science/Datasets/Project 1")

#import the csv file into Rstudio
VegetableImports <- read_csv("Vegetable_Imports.csv")
## Rows: 28 Columns: 25
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): Type, Country
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Second, clean the data: Three steps

1- Fix the spaces and the variables’ names.

names(VegetableImports) <- tolower(names(VegetableImports))
names(VegetableImports) <- gsub("/", "_", names(VegetableImports))

2- Convert data from wide to long:

As you can see, we have a “wide” data. To tidy a dataset like this, we need to pivot the columns into a new pair of variables.

In the wide data, each year was a variable, its observations were how much the US paid in millions to import the products. In Rstudio, our data needs to be “long”, thus, a conversion from wide to long is needed. In the long data, the year itself is a variable and the numeric values are the observations. It is compiled in a column called ‘year’. The amount paid for each product was compiled in a new column, called” million”.

VegetableImports_long <- VegetableImports %>%
  pivot_longer(  #converting columns 3-25, into one column called 'year'.  
    cols = 3:25, 
    names_to = c("year"),
    values_to = "million" #converting the amount paid for each product, each year to a single column, called'million'.
  )
# Calling the head() function
head(VegetableImports_long)
## # A tibble: 6 x 4
##   type                              country year  million
##   <chr>                             <chr>   <chr>   <dbl>
## 1 Total vegetables and preparations Mexico  2021    7792.
## 2 Total vegetables and preparations Mexico  2020    8168.
## 3 Total vegetables and preparations Mexico  2019    7244.
## 4 Total vegetables and preparations Mexico  2018    6773.
## 5 Total vegetables and preparations Mexico  2017    6308.
## 6 Total vegetables and preparations Mexico  2016    6471.

3- Fix the “year” variable. Looking at the converted data, the tibble describes the year variable as a character (categorical variable). This will create a problem when we start using it. We need to convert it to a numerical variable (dbl).

#convert column 'year' from character to numeric
VegetableImports_long$'year' <- as.numeric(VegetableImports_long$'year')

#view new data frame
VegetableImports_long
## # A tibble: 644 x 4
##    type                              country  year million
##    <chr>                             <chr>   <dbl>   <dbl>
##  1 Total vegetables and preparations Mexico   2021   7792.
##  2 Total vegetables and preparations Mexico   2020   8168.
##  3 Total vegetables and preparations Mexico   2019   7244.
##  4 Total vegetables and preparations Mexico   2018   6773.
##  5 Total vegetables and preparations Mexico   2017   6308.
##  6 Total vegetables and preparations Mexico   2016   6471.
##  7 Total vegetables and preparations Mexico   2015   5651.
##  8 Total vegetables and preparations Mexico   2014   5452.
##  9 Total vegetables and preparations Mexico   2013   5330.
## 10 Total vegetables and preparations Mexico   2012   4762.
## # ... with 634 more rows

At this point the dataset is ready to be analyzed.

The variables are:

1- type- The type of vegetables imported. This variable includes for sub-categories: dried vegetables, fresh vegetables, frozen vegetables, and total vegetables imported for each year. Bear in mind that the total includes all the other sub-categories. Therefore, to analyze the situation, I need to create two subsets: total, and sub-categories. Otherwise, the numbers will duplicate. This variable is categorical.

2- Country: another categorical variable. It lists the countries that these products were imported from.

3- Year: (quantitative variable). Ranges between 1999 and 2021.

4- Million: (quantitative variable). It represents the amount paid to import these vegetables. The prices are in millions of dollars.

The big picture: Which country do we pay the most to import vegetables?

Let us look at the overall vegetable imports. To do so, I need to create a subset that only included the total vegetables imported between 1999-2021.

library(dplyr)
total_vegetables <- VegetableImports_long
# subset the rows of dataframe with condition
total_vegetables = filter(VegetableImports_long,type == "Total vegetables and preparations")
total_vegetables
## # A tibble: 184 x 4
##    type                              country  year million
##    <chr>                             <chr>   <dbl>   <dbl>
##  1 Total vegetables and preparations Mexico   2021   7792.
##  2 Total vegetables and preparations Mexico   2020   8168.
##  3 Total vegetables and preparations Mexico   2019   7244.
##  4 Total vegetables and preparations Mexico   2018   6773.
##  5 Total vegetables and preparations Mexico   2017   6308.
##  6 Total vegetables and preparations Mexico   2016   6471.
##  7 Total vegetables and preparations Mexico   2015   5651.
##  8 Total vegetables and preparations Mexico   2014   5452.
##  9 Total vegetables and preparations Mexico   2013   5330.
## 10 Total vegetables and preparations Mexico   2012   4762.
## # ... with 174 more rows

Plot the data: Create the alluvial showing the ammount paid to import vegetable to the US between 1999 and 2021.

Load the needed libraries. Then create the code. Use colors to represent the countries we imported from over the 22 years.

# Load the needed libraries
library(alluvial)
## Warning: package 'alluvial' was built under R version 4.1.3
library(ggalluvial)
## Warning: package 'ggalluvial' was built under R version 4.1.3
total_vegetables <- total_vegetables

Create the alluvial graph

ggalluv <- ggplot(total_vegetables,
             aes(x = year, y = million, alluvium = country)) +
  theme_bw() +
  geom_alluvium(aes(fill = country), 
                color = "white",
                width = .1, 
                alpha = .8,
                decreasing = FALSE) +
  scale_fill_brewer(palette = "Spectral") + # Spectral has enough colors for all countries listed
  scale_x_continuous(lim = c(1999, 2021))+
  ggtitle("US Vegetable Imports in Millions (1999-2021)\n")+ # \n breaks the long title
  ylab("Value in Millions") +
  xlab("Year")+
  labs(caption = 'US Department of Agriculture') # Add caption 
ggalluv

Looking at the alluvial graph, it is safe to say that the U.S. pays the highest prices to import vegetables from Mexico. It is tempting to say that Mexico is where most of our vegetables are imported from. However, that would be misleading. The three variables analyzed in the graph are countries, the amount paid, and the yearly span. The volume of the food imported is not included in this subset. To study that angle, I need to create a new subset that includes the money paid per food volume ratio and cross-reference it with the countries listed. Unfortunately, the original data reported the total amount of vegetables imported but listed that under “world.” It did not specify the countries for that category.

Here is another way to represent the amount paid in millions by the U.S. to import vegetables across the past 22 years

time_series <- ggplot(total_vegetables, 
       aes(x = year, y = million, color = country)) +
    geom_line(lwd = 2)+ #adding more boldness to the lines
    theme_classic()+
    theme(
      axis.title = element_text(size = 13, face = 'bold'),
      axis.text = element_text(size = 10, face = 'bold')
    )+ 
    theme_light()+ #adding light grid in the background
    ggtitle("US Vegetable Imports in Millions (1999-2021)\n")+
    ylab('value in millions')+
    xlab('Year')+
    labs(caption = 'US Department of Agriculture') # Add caption 
time_series

While the alluvial graph is more appealing to the eye, reading the time-series graph is easier. Both graphs show the following:

1- We pay Mexico the most (most likely because we import the most, however, that is not proven by this data)

2- Importing from Mexico has the highest rate of increase.

3- The highest prices we paid overall on vegetable imports were in 2020, due to COVID.

Two sub-problems

I want to investigate two quick aspects: the overall change of costs per importing vegetable across the past 22 years and the three types of vegetables we imported last year, 2021.

First,the overall change of costs per importing vegetable across the past 22 years

# load the new subdata
setwd("C:/Users/mayss/Desktop/Data Science/Datasets/Project 1")
pricesperyear <- read_csv('subset.csv')
## Rows: 23 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (1): Year
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Add a new column that calculates the proportion of money paid in million per 1000 million tons
pricesperyear %>%
  mutate(pricesperyear, proportion = prices_in_millions / quantity_in_1000mt) -> ratio1

# select year and proportion columns  
  select(ratio1, Year, proportion) -> ratio2
  
#plot the scatter plot
 ratioplot <- ggplot(data = ratio2) + 
            geom_point(mapping = aes(x = Year, y = proportion))
 ratioplot

Looking at the scatter plot, we can see what we confirmed earlier, 2020 had the highest cost (about 1.3 million per every 1000 mt). Also, notice how steep the price rate growth is between 2004 and 2009; this could be due to the Iraq war.

Finally, let us look at the types of vegetables we imported last year, 2021.

The type of vegetable imported listed in the data are dried vegetable, fresh vegetables, and frozen vegetables.

# Compile the three categories in a new subdata using filter function.  

sub_vegetables <- VegetableImports_long
sub_vegetables = filter(VegetableImports_long,type != "Total vegetables and preparations")
sub_vegetables
## # A tibble: 460 x 4
##    type             country  year million
##    <chr>            <chr>   <dbl>   <dbl>
##  1 Fresh vegetables Mexico   2021   6677.
##  2 Fresh vegetables Mexico   2020   7108.
##  3 Fresh vegetables Mexico   2019   6294.
##  4 Fresh vegetables Mexico   2018   5868.
##  5 Fresh vegetables Mexico   2017   5476.
##  6 Fresh vegetables Mexico   2016   5606.
##  7 Fresh vegetables Mexico   2015   4840.
##  8 Fresh vegetables Mexico   2014   4675.
##  9 Fresh vegetables Mexico   2013   4591.
## 10 Fresh vegetables Mexico   2012   4056.
## # ... with 450 more rows
# focus on 2021 only by creating another subset 

sub_vegetables1 <- sub_vegetables
# subset the rows of dataframe with condition
sub_vegetables1 = filter(sub_vegetables1,year == 2021)
sub_vegetables1
## # A tibble: 20 x 4
##    type              country           year million
##    <chr>             <chr>            <dbl>   <dbl>
##  1 Fresh vegetables  Mexico            2021  6677. 
##  2 Fresh vegetables  Canada            2021  1788. 
##  3 Fresh vegetables  Peru              2021   360. 
##  4 Fresh vegetables  Guatemala         2021   190. 
##  5 Fresh vegetables  China (Mainland)  2021   119  
##  6 Fresh vegetables  Costa Rica        2021   109. 
##  7 Fresh vegetables  Rest of world     2021   456. 
##  8 Frozen vegetables Mexico            2021   530. 
##  9 Frozen vegetables Canada            2021   130. 
## 10 Frozen vegetables Guatemala         2021    78.6
## 11 Frozen vegetables China (Mainland)  2021    62  
## 12 Frozen vegetables Belgium           2021    56.3
## 13 Frozen vegetables Ecuador           2021    54.3
## 14 Frozen vegetables Rest of world     2021   263  
## 15 Dried vegetables  Canada            2021   290. 
## 16 Dried vegetables  China (Mainland)  2021   131. 
## 17 Dried vegetables  Mexico            2021    56.6
## 18 Dried vegetables  India             2021    53.3
## 19 Dried vegetables  Turkey            2021    24  
## 20 Dried vegetables  Rest of world     2021   240.

Remove the ‘year’ column, since it is a constant component.

sub_vegetables2 <-sub_vegetables1 %>%
select('type', 'country', 'million')
head(sub_vegetables2)
## # A tibble: 6 x 3
##   type             country          million
##   <chr>            <chr>              <dbl>
## 1 Fresh vegetables Mexico             6677.
## 2 Fresh vegetables Canada             1788.
## 3 Fresh vegetables Peru                360.
## 4 Fresh vegetables Guatemala           190.
## 5 Fresh vegetables China (Mainland)    119 
## 6 Fresh vegetables Costa Rica          109.

Plot the dataset.

# create a bar graph
subbars <- ggplot(data = sub_vegetables2) +
  geom_bar(mapping = aes(x = type, y = million,  fill = country), stat = 'identity', position = 'dodge') + # the dodge aesthetic visualizes the different countries.  
  scale_fill_brewer(palette = 'BrBG') +
  coord_flip() +
  ggtitle('Cost of US Vegetble Imports for 2021\n')+
  xlab("Type of Imported Vegetables" ) +
  ylab("Values in Millions")+
  labs(caption = 'US Department of Agriculture') # Add caption 
subbars

While we knew that importing from Mexico had the highest cost, we can see that within the subcategories, importing fresh vegetables costs more than the other types.

Moving forward:

With the Ukrainian war and the inflation we see in the gas prices, it would be interesting (scary) to see how prices will look like in 2022. Moreover, looking at the data, I wish they had included the rate of prices per volume for each country; it would have given a clearer picture, especially when one starts to see the geographical component effects. For example, let’s go back to the time-series plot. We find that prices are almost the same between Guatemala and Italy, even though Italy is double the distance from the U.S. One can’t help to wonder: are the prices close because we did not import from Italy as much as we did from Guatemala? Or is it because the vegetable prices in Italy are lower?

Thank you!