US Food imports 1999-2021
The more diverse we get, the more diverse our food becomes.
Consumers require variety, quality, and convenience in the foods they
consume. The US department of agriculture (USDA) has published the
following dataset (loaded below).
It provides import values of edible products entering U.S. and their
origin of shipment. Data are from the U.S. Department of Commerce, U.S.
Census Bureau. Food import values are compiled by calendar year into
food groups. Twenty-two years of annual data are included, enabling us
to track long-term growth patterns.
Looking at the past 22 years, how did the prices change? The focus
will be on vegetable imports as wars and climate change affect fresh
produce the most.
The original data was compiled in an excel woksheet. However, since
my focus was on vegetables only, I saved that portion of the data in a
CSV file. I uploaded both.
library(readxl)
## Warning: package 'readxl' was built under R version 4.1.3
setwd("C:/Users/mayss/Desktop/Data Science/Datasets/Project 1")
#import Excel file into Rstudio
Allfoodimports <- read_excel('US_Food_Imports.xlsx')
## New names:
## * `` -> `...2`
## * `` -> `...3`
## * `` -> `...4`
## * `` -> `...5`
## * `` -> `...6`
## * `` -> `...7`
## * `` -> `...8`
## * `` -> `...9`
## * `` -> `...10`
## * `` -> `...11`
## * `` -> `...12`
## * `` -> `...13`
## * `` -> `...14`
## * `` -> `...15`
## * `` -> `...16`
## * `` -> `...17`
## * `` -> `...18`
## * `` -> `...19`
## * `` -> `...20`
## * `` -> `...21`
## * `` -> `...22`
## * `` -> `...23`
## * `` -> `...24`
## * `` -> `...25`
## * `` -> `...26`
## * `` -> `...27`
The vegetable portion using a CSV file.
Load the libraries and load the data
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.7 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
setwd("C:/Users/mayss/Desktop/Data Science/Datasets/Project 1")
#import the csv file into Rstudio
VegetableImports <- read_csv("Vegetable_Imports.csv")
## Rows: 28 Columns: 25
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): Type, Country
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Second, clean the data: Three steps
1- Fix the spaces and the variables’ names.
names(VegetableImports) <- tolower(names(VegetableImports))
names(VegetableImports) <- gsub("/", "_", names(VegetableImports))
2- Convert data from wide to long:
As you can see, we have a “wide” data. To tidy a dataset like this,
we need to pivot the columns into a new pair of variables.
In the wide data, each year was a variable, its observations were
how much the US paid in millions to import the products. In Rstudio, our
data needs to be “long”, thus, a conversion from wide to long is needed.
In the long data, the year itself is a variable and the numeric values
are the observations. It is compiled in a column called ‘year’. The
amount paid for each product was compiled in a new column, called”
million”.
VegetableImports_long <- VegetableImports %>%
pivot_longer( #converting columns 3-25, into one column called 'year'.
cols = 3:25,
names_to = c("year"),
values_to = "million" #converting the amount paid for each product, each year to a single column, called'million'.
)
# Calling the head() function
head(VegetableImports_long)
## # A tibble: 6 x 4
## type country year million
## <chr> <chr> <chr> <dbl>
## 1 Total vegetables and preparations Mexico 2021 7792.
## 2 Total vegetables and preparations Mexico 2020 8168.
## 3 Total vegetables and preparations Mexico 2019 7244.
## 4 Total vegetables and preparations Mexico 2018 6773.
## 5 Total vegetables and preparations Mexico 2017 6308.
## 6 Total vegetables and preparations Mexico 2016 6471.
3- Fix the “year” variable. Looking at the converted data, the
tibble describes the year variable as a
character (categorical variable). This will create a
problem when we start using it. We need to convert it to a numerical
variable (dbl).
#convert column 'year' from character to numeric
VegetableImports_long$'year' <- as.numeric(VegetableImports_long$'year')
#view new data frame
VegetableImports_long
## # A tibble: 644 x 4
## type country year million
## <chr> <chr> <dbl> <dbl>
## 1 Total vegetables and preparations Mexico 2021 7792.
## 2 Total vegetables and preparations Mexico 2020 8168.
## 3 Total vegetables and preparations Mexico 2019 7244.
## 4 Total vegetables and preparations Mexico 2018 6773.
## 5 Total vegetables and preparations Mexico 2017 6308.
## 6 Total vegetables and preparations Mexico 2016 6471.
## 7 Total vegetables and preparations Mexico 2015 5651.
## 8 Total vegetables and preparations Mexico 2014 5452.
## 9 Total vegetables and preparations Mexico 2013 5330.
## 10 Total vegetables and preparations Mexico 2012 4762.
## # ... with 634 more rows
At this point the dataset is ready to be
analyzed.
The variables are:
1- type- The type of vegetables imported. This variable includes for
sub-categories: dried vegetables, fresh vegetables, frozen vegetables,
and total vegetables imported for each year. Bear in mind that the total
includes all the other sub-categories. Therefore, to analyze the
situation, I need to create two subsets: total, and sub-categories.
Otherwise, the numbers will duplicate. This variable is
categorical.
2- Country: another categorical variable. It lists the countries
that these products were imported from.
3- Year: (quantitative variable). Ranges between 1999 and 2021.
4- Million: (quantitative variable). It represents the amount paid
to import these vegetables. The prices are in millions of dollars.
The big picture: Which country do we pay the most to import
vegetables?
Let us look at the overall vegetable imports. To do so, I need to
create a subset that only included the total vegetables imported between
1999-2021.
library(dplyr)
total_vegetables <- VegetableImports_long
# subset the rows of dataframe with condition
total_vegetables = filter(VegetableImports_long,type == "Total vegetables and preparations")
total_vegetables
## # A tibble: 184 x 4
## type country year million
## <chr> <chr> <dbl> <dbl>
## 1 Total vegetables and preparations Mexico 2021 7792.
## 2 Total vegetables and preparations Mexico 2020 8168.
## 3 Total vegetables and preparations Mexico 2019 7244.
## 4 Total vegetables and preparations Mexico 2018 6773.
## 5 Total vegetables and preparations Mexico 2017 6308.
## 6 Total vegetables and preparations Mexico 2016 6471.
## 7 Total vegetables and preparations Mexico 2015 5651.
## 8 Total vegetables and preparations Mexico 2014 5452.
## 9 Total vegetables and preparations Mexico 2013 5330.
## 10 Total vegetables and preparations Mexico 2012 4762.
## # ... with 174 more rows
Plot the data: Create the alluvial showing the ammount paid to
import vegetable to the US between 1999 and 2021.
Load the needed libraries. Then create the code. Use colors to
represent the countries we imported from over the 22 years.
# Load the needed libraries
library(alluvial)
## Warning: package 'alluvial' was built under R version 4.1.3
library(ggalluvial)
## Warning: package 'ggalluvial' was built under R version 4.1.3
total_vegetables <- total_vegetables
Create the alluvial graph
ggalluv <- ggplot(total_vegetables,
aes(x = year, y = million, alluvium = country)) +
theme_bw() +
geom_alluvium(aes(fill = country),
color = "white",
width = .1,
alpha = .8,
decreasing = FALSE) +
scale_fill_brewer(palette = "Spectral") + # Spectral has enough colors for all countries listed
scale_x_continuous(lim = c(1999, 2021))+
ggtitle("US Vegetable Imports in Millions (1999-2021)\n")+ # \n breaks the long title
ylab("Value in Millions") +
xlab("Year")+
labs(caption = 'US Department of Agriculture') # Add caption
ggalluv

Looking at the alluvial graph, it is safe to say that the U.S. pays
the highest prices to import vegetables from Mexico. It is tempting to
say that Mexico is where most of our vegetables are imported from.
However, that would be misleading. The three variables analyzed in the
graph are countries, the amount paid, and the yearly span. The volume of
the food imported is not included in this subset. To study that angle, I
need to create a new subset that includes the money paid per food volume
ratio and cross-reference it with the countries listed. Unfortunately,
the original data reported the total amount of vegetables imported but
listed that under “world.” It did not specify the countries for that
category.
Here is another way to represent the amount paid in millions by the
U.S. to import vegetables across the past 22 years
time_series <- ggplot(total_vegetables,
aes(x = year, y = million, color = country)) +
geom_line(lwd = 2)+ #adding more boldness to the lines
theme_classic()+
theme(
axis.title = element_text(size = 13, face = 'bold'),
axis.text = element_text(size = 10, face = 'bold')
)+
theme_light()+ #adding light grid in the background
ggtitle("US Vegetable Imports in Millions (1999-2021)\n")+
ylab('value in millions')+
xlab('Year')+
labs(caption = 'US Department of Agriculture') # Add caption
time_series

While the alluvial graph is more appealing to the eye, reading the
time-series graph is easier. Both graphs show the following:
1- We pay Mexico the most (most likely because we import the most,
however, that is not proven by this data)
2- Importing from Mexico has the highest rate of increase.
3- The highest prices we paid overall on vegetable imports were in
2020, due to COVID.
Two sub-problems
I want to investigate two quick aspects: the overall change of costs
per importing vegetable across the past 22 years and the three types of
vegetables we imported last year, 2021.
First,the overall change of costs per importing vegetable across the
past 22 years
# load the new subdata
setwd("C:/Users/mayss/Desktop/Data Science/Datasets/Project 1")
pricesperyear <- read_csv('subset.csv')
## Rows: 23 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (1): Year
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Add a new column that calculates the proportion of money paid in million per 1000 million tons
pricesperyear %>%
mutate(pricesperyear, proportion = prices_in_millions / quantity_in_1000mt) -> ratio1
# select year and proportion columns
select(ratio1, Year, proportion) -> ratio2
#plot the scatter plot
ratioplot <- ggplot(data = ratio2) +
geom_point(mapping = aes(x = Year, y = proportion))
ratioplot

Looking at the scatter plot, we can see what we confirmed earlier,
2020 had the highest cost (about 1.3 million per every 1000 mt). Also,
notice how steep the price rate growth is between 2004 and 2009; this
could be due to the Iraq war.
Finally, let us look at the types of vegetables we imported last
year, 2021.
The type of vegetable imported listed in the data are dried
vegetable, fresh vegetables, and frozen vegetables.
# Compile the three categories in a new subdata using filter function.
sub_vegetables <- VegetableImports_long
sub_vegetables = filter(VegetableImports_long,type != "Total vegetables and preparations")
sub_vegetables
## # A tibble: 460 x 4
## type country year million
## <chr> <chr> <dbl> <dbl>
## 1 Fresh vegetables Mexico 2021 6677.
## 2 Fresh vegetables Mexico 2020 7108.
## 3 Fresh vegetables Mexico 2019 6294.
## 4 Fresh vegetables Mexico 2018 5868.
## 5 Fresh vegetables Mexico 2017 5476.
## 6 Fresh vegetables Mexico 2016 5606.
## 7 Fresh vegetables Mexico 2015 4840.
## 8 Fresh vegetables Mexico 2014 4675.
## 9 Fresh vegetables Mexico 2013 4591.
## 10 Fresh vegetables Mexico 2012 4056.
## # ... with 450 more rows
# focus on 2021 only by creating another subset
sub_vegetables1 <- sub_vegetables
# subset the rows of dataframe with condition
sub_vegetables1 = filter(sub_vegetables1,year == 2021)
sub_vegetables1
## # A tibble: 20 x 4
## type country year million
## <chr> <chr> <dbl> <dbl>
## 1 Fresh vegetables Mexico 2021 6677.
## 2 Fresh vegetables Canada 2021 1788.
## 3 Fresh vegetables Peru 2021 360.
## 4 Fresh vegetables Guatemala 2021 190.
## 5 Fresh vegetables China (Mainland) 2021 119
## 6 Fresh vegetables Costa Rica 2021 109.
## 7 Fresh vegetables Rest of world 2021 456.
## 8 Frozen vegetables Mexico 2021 530.
## 9 Frozen vegetables Canada 2021 130.
## 10 Frozen vegetables Guatemala 2021 78.6
## 11 Frozen vegetables China (Mainland) 2021 62
## 12 Frozen vegetables Belgium 2021 56.3
## 13 Frozen vegetables Ecuador 2021 54.3
## 14 Frozen vegetables Rest of world 2021 263
## 15 Dried vegetables Canada 2021 290.
## 16 Dried vegetables China (Mainland) 2021 131.
## 17 Dried vegetables Mexico 2021 56.6
## 18 Dried vegetables India 2021 53.3
## 19 Dried vegetables Turkey 2021 24
## 20 Dried vegetables Rest of world 2021 240.
Remove the ‘year’ column, since it is a constant component.
sub_vegetables2 <-sub_vegetables1 %>%
select('type', 'country', 'million')
head(sub_vegetables2)
## # A tibble: 6 x 3
## type country million
## <chr> <chr> <dbl>
## 1 Fresh vegetables Mexico 6677.
## 2 Fresh vegetables Canada 1788.
## 3 Fresh vegetables Peru 360.
## 4 Fresh vegetables Guatemala 190.
## 5 Fresh vegetables China (Mainland) 119
## 6 Fresh vegetables Costa Rica 109.
Plot the dataset.
# create a bar graph
subbars <- ggplot(data = sub_vegetables2) +
geom_bar(mapping = aes(x = type, y = million, fill = country), stat = 'identity', position = 'dodge') + # the dodge aesthetic visualizes the different countries.
scale_fill_brewer(palette = 'BrBG') +
coord_flip() +
ggtitle('Cost of US Vegetble Imports for 2021\n')+
xlab("Type of Imported Vegetables" ) +
ylab("Values in Millions")+
labs(caption = 'US Department of Agriculture') # Add caption
subbars

While we knew that importing from Mexico had the highest cost, we
can see that within the subcategories, importing fresh vegetables costs
more than the other types.
Moving forward:
With the Ukrainian war and the inflation we see in the gas prices,
it would be interesting (scary) to see how prices will look like in
2022. Moreover, looking at the data, I wish they had included the rate
of prices per volume for each country; it would have given a clearer
picture, especially when one starts to see the geographical component
effects. For example, let’s go back to the time-series plot. We find
that prices are almost the same between Guatemala and Italy, even though
Italy is double the distance from the U.S. One can’t help to wonder: are
the prices close because we did not import from Italy as much as we did
from Guatemala? Or is it because the vegetable prices in Italy are
lower?
Thank you!