1. Introduction

Lead (Pb) is the second most toxic metal after arsenic (As), comprising 0.002% of the earth’s crust.

Lead, like other heavy metals, does not biodegrade or disappear from the environment over time. Although lead levels in the food supply declined dramatically between the 1970s and 1990s, low levels of lead continue to be detected in some foods due to the continued presence in the environment. Lead from the soil can be absorbed by plants grown to obtain fruit or vegetables or plants used as ingredients in food or animal feed.

Lead is poisonous to humans and can affect people of any age or health. Lead is especially harmful to vulnerable populations, including babies, young children, pregnant women and their fetuses and others with chronic health conditions. High levels of lead exposure can seriously damage the health and development of children, specifically the brain and nervous system. The neurological effects of high levels of lead exposure during early childhood include learning disabilities, behavioral difficulties and reduced IQ. Because lead can accumulate in the body, even low-level chronic exposure can be dangerous over time.

It is estimated that exposure to lead was responsible, in 2004, for 143 thousand deaths and 0.6% of the global disease burden, taking into account mild mental under development and cardiovascular outcomes

1.1 Purpose

This work focus on:

1- investigate the content of Lead in foods using the R language as a technique for filtering, analyzing and modeling distributions according to the varieties of food consumed, countries, analytical conditions and dates.

2- Find out where and which foods are most contaminated and create possible hypotheses of cause.

2. Initial data and database ingestion:

The information was taken from the “GEMSFOODS” database of the World Health Organization (WHO) available at:

https://extranet.who.int/gemsfood/

All calculations and graphs were made in ** Rstudio version 1.3.959 ** and ** R base 3.6.3 ** and using the following libraries:

library(readr)
library(ggplot2)
library(tidyverse)
library(knitr)
library(RColorBrewer)
library(ggalluvial)
library(treemapify)
library(gganimate)
library(ggrepel)
library(DescTools)
library(ggthemes)

The codes used to perform that work will be shown. All processing is based on more modern standards in reproducibility [ref] in research where individual data is not manipulated and the entire process (from database recognition to the final report) is done in an automated way and can be audited by the code used.

We started by consolidating the database, the files were downloaded in groups to facilitate the transfering:

GEMS2000 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(1980-2000).csv", dec=",")
GEMS2010 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(2000-2010).csv", dec=",")
GEMS2014 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(2010-2014).csv", dec=",")
GEMS2016 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(2014-2016).csv", dec=",")
GEMS2018 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(2016-2018).csv", dec=",")
GEMS2020 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(2018-2020).csv", dec=",")
GEMS <- rbind(GEMS2000,GEMS2010,GEMS2014,GEMS2016,GEMS2018,GEMS2020)

3. Main variables:

Checking the information available:

names(GEMS)
 [1] "RecordType"              "RegionName"              "CountryName"            
 [4] "ContaminantName"         "FoodCategory"            "FoodName"               
 [7] "FoodCode"                "LocalFoodName"           "FoodStateName"          
[10] "ResultValue"             "ResultText"              "UnitName"               
[13] "LOD"                     "LOQ"                     "LODMin"                 
[16] "LODMax"                  "LOQMin"                  "LOQMax"                 
[19] "Mean"                    "MeanLower"               "MeanUpper"              
[22] "Median"                  "NinetiethPercentile"     "StandardDeviation"      
[25] "SamplingPeriodStart"     "SamplingPeriodStartText" "SamplingPeriodEnd"      
[28] "SamplingPeriodEndText"   "RepresentativenessName"  "LabCount"               
[31] "LabNumber"               "FoodOriginName"          "AnalyticalQAName"       
[34] "SampleCount"             "SamplesBelowLOQ"         "RangeMin"               
[37] "RangeMax"                "ResultBasisName"         "PortionTypeName"        
[40] "SerialNumber"            "RowNum"                 

The data used will be: “CountryName” the country of origin of the analytical data. “FoodCategory” The type of food, that is, the analytical matrix such as “Vegetables” or “Animal Feed”. “FoodOriginName”, which describes the origin as Domestic or imported. “LOQ” the limit of quantification. “ResultValue”, the value obtained in the analysis when quantified. “AnalyticalQAName”, which describes the conditions of the analysis, if made in an accredited laboratory or with use of internal quality control only. “FoodName” which is the exact description of the food as “Potato” or “Rice”.

Before, the evaluation of the units of the Lead content should be made:

summary(GEMS$UnitName)
 mg/kg  ug/kg 
123819 240956 

When quantified, the result was reported in two different units, mg/kg and ug/kg.

All units will be changed to mg/kg and the values adjusted for the same unit, where 1ug/kg is equivalent to 0.001mg/kg:

Sys.time()
GEMSx <-GEMS
  GEMS1_14 <- GEMS[,14]
  GEMS1_10 <- GEMS[,10]
  GEMS1_12 <- GEMS[,12]
 for(i in 1:nrow(GEMS)){
  if(GEMS1_12[i]=="ug/kg"){
    GEMS1_14[i] <- GEMS1_14[i]/1000
    GEMS1_10[i] <- GEMS1_10[i]/1000
    GEMS1_12[i] <- "mg/kg"
  } else {
NULL
  }
 }
GEMSx[,14]<- GEMS1_14
GEMSx[,10]<- GEMS1_10
GEMSx[,12]<- GEMS1_12

Sys.time()

NOTE: Due to long processing time, the code chunk above was previously executed and saved as a RDS file. Which will be read and stored in a new object(GEMS1).

GEMS1 <- readRDS("/mnt/DADOS/Data Science/GEMFoods/GEMS1.rds")

Confirming the change:

summary(GEMS1$UnitName)
 mg/kg  ug/kg 
364775      0 

The Limit of Quantification has also been changed (Column 14)

4. Exploratory analysis and search for the biggest contaminations:

4.1 Distribution of the data source

Let’s see in a practical way which countries contributed the most to the WHO database using the CountryName variable and how many times a particular country appears:

GEMS1 %>% count(CountryName,sort = TRUE) %>% top_n(20) %>% ggplot()+geom_col(aes(n,reorder(CountryName,n,max),fill=CountryName),show.legend = FALSE)+theme_economist()+xlab("Country Name")+ylab("Number of samples")
Selecting by n

ggplot(GEMS1, aes(x=reorder(CountryName,CountryName,function(x)+length(x)), fill=CountryName, ))+geom_bar(show.legend= FALSE)+ coord_flip()+xlab("Country Name")+ylab("Number of samples")+theme_economist()

Figure 1. Number of samples per database contributor in the period from 1995 to 2020.

“WHO European Region”, “China”, “Canada”, “United States of America”, “Brazil” and “Thailand” are the main contributors. WHO/Europe have 54% of contributions and Canada in second with 13%.

From now on, a sample will be used containing only the 6 countries with the most contributions (GEMS2)

To look for any possible change in the quality of the tests according to the time, we will convert all dates to year.

GEMS2$SamplingPeriodEnd <- as.POSIXct(GEMS2$SamplingPeriodEnd, format = "%m/%d/%Y %H:%M:%S")
GEMS2$SamplingPeriodEnd <- format(GEMS2$SamplingPeriodEnd, format="%Y")

4.2 Analytical Quality of Tests

Evaluating the available information on which tools used for the quality of the tests, the variable “AnalyticalQAName” will be used:

ggplot(GEMS)+geom_bar(aes(reorder(AnalyticalQAName,AnalyticalQAName,function(x)+length(x))
                        ,fill=AnalyticalQAName), show.legend= FALSE, stat = "count")+
  coord_flip()+
  xlab("AnalyticalQAName")+xlab("Quality control Tool")+ylab("Number of samples")

Figure 2. Type of analytical quality control tool used in relation to the number of samples in the database.

We see that the equivalent of approximately 48% (175,133) of the tests were carried out in accredited laboratories and 16% (51139) had some internal quality control or proficiency test.

We now we can investigate whether the origin of the analytical data (whether or not to use an accredited laboratory) interferes with the mean/spread within the data for each country. First, we calculate the average among all samples but taking into account all quantified results, first a summary of the total data:

total average:

GEMS2 %>% group_by(CountryName) %>%  summarise(mean=mean(ResultValue,na.rm=TRUE), n=n()) %>% kable()
CountryName mean n
Canada 0.0184340 48909
China 0.1594830 26633
United States of America 0.0039717 18378
Brazil 0.0487711 16166
Thailand 0.0977092 10624
WHO European Region 0.2078260 198592

Plotting:

GEMS2 %>% 
  group_by(CountryName) %>%  
  summarise(mean=mean(ResultValue,na.rm=TRUE), n=n())%>%
  arrange(desc(n),na.rm=TRUE)%>% 
  ggplot(aes(mean,CountryName ,fill=CountryName))+
  geom_col(show.legend = FALSE)+
  geom_text(aes(label = format(round(mean, 3), nsmall = 2)), vjust = 0.5,hjust=0)+xlim(c(0,0.25))+ylab("Country Name")+xlab("Mean Value of Pb (mg/Kg)")

Figure 3. Average Lead value (mg/kg) in samples according to the country of origin.

Average using only quantified samples (we adopt values only above 0.001mg/Kg):

GEMS2 %>% group_by(CountryName) %>%  summarise(mean=mean(ResultValue>=0.001,na.rm=TRUE), n=n())%>%arrange(desc(n),na.rm=TRUE)%>% kable()
CountryName mean n
WHO European Region 0.4118924 198592
Canada 0.5294337 48909
China 0.6740885 26633
United States of America 0.2584834 18378
Brazil 0.2769097 16166
Thailand 0.3493351 10624

In the form of a bar graph now:

GEMS2 %>% 
  group_by(CountryName) %>%  
  summarise(mean=mean(ResultValue>=0.001,na.rm=TRUE), n=n())%>%
  arrange(desc(n),na.rm=TRUE)%>% 
  ggplot(aes(mean,CountryName ,fill=CountryName))+
  geom_col(show.legend = FALSE)+
  geom_text(aes(label = format(round(mean, 3), nsmall = 2)), vjust = 0.5,hjust=0)+xlim(c(0,0.70))+ylab("Country Name")+xlab("Mean Value of Pb (mg/Kg)")

Figure 4. Average lead value (mg/kg) in quantified samples (above 0.001 mg/kg) according to the country of origin.

Initially, European values appear to be higher, specifically 4x greater than Brazilians, 10x greater than Canadians and 50x greater than those of the USA. However, when only the quantified values are taken into account, this average decreases considerably. One theory would be that countries such as the USA and Canada would have a larger number of undetected samples and thus the average would be reduced.

To check if the source of the difference may be the use of non-accredited laboratories, we will exclude this data in the “GEMS3” dataframe

GEMS3=subset(GEMS2,GEMS2$AnalyticalQAName=="Officially accredited")

Now repeating the test to see if there was a change in the averages:

GEMS3 %>% group_by(CountryName) %>%  summarise(mean=mean(ResultValue>=0.001,na.rm=TRUE), n=n())%>%arrange(desc(n),na.rm=TRUE)%>% ggplot(aes(mean,CountryName ,fill=CountryName))+geom_col(show.legend = FALSE)+geom_text(aes(label = format(round(mean, 3), nsmall = 2)), vjust = 0.5,hjust=0)+xlim(c(0,0.60))+ylab("Country Name")+xlab("Mean Value of Pb (mg/Kg)")

Figure 4. Average Lead value (mg/kg) in quantified samples (above 0.001 mg/kg) and that used accredited laboratories, separated by the country of origin.

Yes, most of the average values have decreased and in particular China, which showed an over-quantification of Lead in non-accredited laboratories. While European laboratories reported much lower figures.

When evaluating data only from quantified samples and accredited laboratories, Europe and Canada have very similar averages.

4.3 Distribution according to matrix (FoodCategory)

We will now analyze the contents of each matrix (Variable “FoodCategory” in GEMS). Below are all the results quantified (considering the 6 main contributing countries) in Lead per mg/kg (x-axis) for each type of food (y-axis):

ggplot(GEMS2, aes(x=FoodCategory,y=ResultValue))+
    geom_boxplot(show.legend = FALSE)+geom_hline(yintercept=0.1, linetype="solid", color = "red")+
    scale_y_log10()+
    scale_x_discrete(breaks=unique(GEMS$FoodCategory),labels = function(x) str_wrap(str_replace_all(x, "gjgj" , " "), width = 80))+
    coord_flip()+theme(axis.text.x = element_text(angle = 90,size =10))+xlab("Food Category")+ylab("Pb (mg/Kg)")

Figure 5: Boxplot of Lead content in mg/Kg in food according to the matrix. The boxes represent 50% of the data, the horizontal bars the other 50%. Dots beyond the bars represent outliers.

The red line represents the limit of 0.1 mg mg/kg of Pb adopted as a standard in several legislation [2]. Only 4 matrices (food) have their median at or above the limit: Herbs and spices, stimulating drinks, animal feed and products for special use. NOTE: Scale of the Y axis is in log10 to facilitate the visualization of the data.

Figure 6: Same as Figure 4 but showing the passage of time between the years 1995 and 2020

During the evolution of the years, a change in the averages is likely to be caused by seasonality.

We will now visualize the distribution in Boxplot form of each food (categorical data “FoodCategory”) separated in the six main contributing countries:

ggplot(GEMS2, aes(ResultValue,FoodCategory,  color=CountryName))+
      geom_boxplot(show.legend = FALSE)+scale_x_log10()+theme(axis.text.y = element_blank())+geom_vline(xintercept=0.1, linetype="solid", color = "red")+facet_wrap(~CountryName)+ylab("Food Category")+xlab("Pb (mg/Kg)")

Figure 7: Boxplot of Lead content in mg/Kg in food according to the matrix (Y axis).

The red line represents the limit of 0.1mg/kg, routinely adopted as the limit for Lead in foods internationally [4].

There is a much larger number of outliers in the European Union than in the second largest contributor (Canada). These outliers include very high concentrations (even 100mg / Kg!) Causing a lever and “pulling” the total averages upwards.

4.4. LOQ behaviour

The LOQ is a important tool to assess the analysis

summary(GEMS2$LOQ)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    0.01    0.02    0.06    0.05   66.67   45661 

So, from all 319302 samples, 25% have a LOQ below 0.01 mg/Kg, the median is on 0.02 mg/kg and 75% of samples have a LOQ bellow 0.05mg/kg. We found that 14% (45661) of samples didn’t report the LOQ.

Let’s graphically show this information using the ResulValue of Pb (mg/Kg) as the X variable.

ggplot(GEMS2,aes(x=ResultValue, fill=LOQ<=0.01))+
    geom_boxplot()+scale_x_log10()+geom_vline(xintercept=0.1, linetype="dashed", color = "red")+
  geom_vline(xintercept=1.21, linetype="dashed", color = "blue")

Boxplot from quantified samples according to LOQ, a value of 0.01mg/Kg were chosen since is 1/10 of the limit of 0.1mg/Kg. As seen, Lower detection limits have a lower mean value of quantified samples. Higher LOQ than 0.1mg/Kg of Pb have a mean quantified sample higher. Samples that didn’t report the LOQ have a mean quantified value between the two. This can be seen even more clearly on histograms:

ggplot(GEMS2,aes(x=ResultValue, fill=LOQ<=0.01))+
    geom_histogram()+scale_x_log10()+geom_vline(xintercept=0.1, linetype="dashed", color = "red")+
  geom_vline(xintercept=1.21, linetype="dashed", color = "blue")

It’s important to notice that only quantified samples appear on the graph above, and 56% of samples are not quantified.

prop.table(table(GEMS2$ResultValue==0.00000))

    FALSE      TRUE 
0.4366943 0.5633057 

But, are the quantified results with unreported LOQ equivalent to the ones with reported results?

First, the confirmation if the data is normal:

GEMS2res<-sample_n(GEMS2,5000)

shapiro.test(subset(GEMS2res$ResultValue,GEMS2res$LOQ>=0.01))

    Shapiro-Wilk normality test

data:  subset(GEMS2res$ResultValue, GEMS2res$LOQ >= 0.01)
W = 0.045942, p-value < 2.2e-16
shapiro.test(subset(GEMS2res$ResultValue,GEMS2res$LOQ<=0.01))

    Shapiro-Wilk normality test

data:  subset(GEMS2res$ResultValue, GEMS2res$LOQ <= 0.01)
W = 0.097989, p-value < 2.2e-16
shapiro.test(subset(GEMS2res$ResultValue,is.na(GEMS2res$LOQ)))

    Shapiro-Wilk normality test

data:  subset(GEMS2res$ResultValue, is.na(GEMS2res$LOQ))
W = 0.13748, p-value < 2.2e-16
GEMS2res<-GEMS2

All three sets of the ResultValue variable with LOQ lower than 0.01mg/Kg, higher than 0.01mg/Kg and non reported(NA) are normal.

Now, grouping each data according to the LOQ:

GEMS2res1 <- subset(GEMS2res,GEMS2res$LOQ>=0.01)
GEMS2res1$group <- "Higher than 0.01"
GEMS2res2 <- subset(GEMS2res,GEMS2res$LOQ<=0.01)
GEMS2res2$group <- "Lower than 0.01"
GEMS2res3 <- subset(GEMS2res,is.na(GEMS2res$LOQ))
GEMS2res3$group <- "NA"
GEMS2res_1<-rbind(GEMS2res1,GEMS2res2,GEMS2res3)

And performing ANOVA.

anova_GEMS2res_1 <- aov(ResultValue ~ group, GEMS2res_1)
summary(anova_GEMS2res_1)
PostHocTest(anova_GEMS2res_1, method = "hsd")

  Posthoc multiple comparisons of means : Tukey HSD 
    95% family-wise confidence level

$group
                                        diff      lwr.ci    upr.ci   pval    
Lower than 0.01-Higher than 0.01 -0.02801841 -0.16578416 0.1097473 0.8823    
NA-Higher than 0.01               0.26261832  0.08765773 0.4375789 0.0013 ** 
NA-Lower than 0.01                0.29063673  0.09536069 0.4859128 0.0014 ** 

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can note that there’s a different mean between the quantified samples with below and above the LOQ of 0.01mg/Kg of Pb but NOT between these groups and the unreported results.

Now, evaluating according to year of analysis:

ggplot(GEMS2res_1,aes(x=SamplingPeriodEnd,y=length(RowNum), fill=group))+
  geom_bar(position="stack", stat="identity")+ theme(axis.text.x = element_text(angle = 90),)+ggtitle("Number of samples analysed according to LOQ group")


GEMS2res_1 %>% group_by(SamplingPeriodEnd) %>% count(group,sort=TRUE) %>% ggplot()+geom_col(aes(SamplingPeriodEnd,n, fill=group))+ theme(axis.text.x = element_text(angle = 90),)+ggtitle("Number of samples analysed according to LOQ group")

After plotting the LOQ groups according to the Year we note that the unreported data probably come from years we this information wasn’t required.

We decided to keep the quantified data even if the LOQ wasn’t reported.

5. A deeper study of Lead levels in the European Union region

Since the WHO / Europe region has a large amount of data and many outliers/values above what is allowed, it was chosen to evaluate the matrices and their distribution over the years in more detail.

We will now examine the quantified samples (above 0.001mg/kg) in the “WHO European Region” in the form of a heatplot:

 subset(GEMS2, GEMS2$CountryName=="WHO European Region" & GEMS2$ResultValue) %>% 
  ggplot(aes(SamplingPeriodEndText,y=reorder(FoodCategory, ResultValue,function(x)+mean(x)),  fill = ResultValue))+   
  geom_tile(color = "white") +
  scale_fill_gradient(limits=c(0.001, 2),low = "blue", high ="red", trans="log10")+ 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90),)+ylab("FoodCategory")+
  scale_y_discrete(labels = function(x) str_wrap(str_replace_all(x, "gjgj" , " "), width = 60))+
  xlab("Year")+ylab("Food Category")

Figure 8. Heatplot from the 15-year period of analysis of Lead content in food at WHO/Europe. Red represents the content of 1mg / kg, blue 0.001mg/kg. It is noted that over the years samples of Herbs and spices, stimulating drinks and animal feed, continuously show higher results than other matrices. As already seen in the boxplot in figure 7.

The upper part of the Heatplot contains the foods with the highest average values over the years and the lowest, with the smallest.

We can now define the frequency of the highest 1%, 2% and 10% (in Pb concentration) in the total database:

quantile(subset(GEMS2$ResultValue,GEMS2$CountryName=="WHO European Region"),probs=(0.95),na.rm=TRUE)
  95% 
0.201 

95% of the quantified samples are below 0.2mg/Kg.

quantile(subset(GEMS2$ResultValue,GEMS2$CountryName=="WHO European Region"),probs=(0.98),na.rm=TRUE)
 98% 
0.58 

98% of the quantified samples are below 0.58mg/Kg.

quantile(subset(GEMS2$ResultValue,GEMS2$CountryName=="WHO European Region"),probs=(0.99),na.rm=TRUE)
 99% 
1.21 

99% of the quantified samples are below 1.21mg/Kg.

Observing these limits in the frequency histogram:

subset(GEMS3, GEMS3$CountryName=="WHO European Region" & GEMS3$ResultValue) %>%
  ggplot(aes(ResultValue, fill=FoodCategory))+
  geom_histogram( show.legend= FALSE)+
  geom_vline(xintercept=0.1, linetype="dashed", color = "red")+
  geom_vline(xintercept=1.21, linetype="dashed", color = "blue")+
  scale_x_log10()+xlab("Pb (mg/kg)")+ylab("Number of samples")

Figure 9: Frequency histogram of data referring to the European Union, x-axis on the logarithmic scale. The red line represents the value of 0.1mg/kg a very common limit allowed in the legislation and blue line, the range that covers 99% of the quantified samples (1.21mg/Kg).

Checking for normality.

shapiro.test(sample(subset(GEMS3$ResultValue, GEMS3$CountryName=="WHO European Region"),5000))

    Shapiro-Wilk normality test

data:  sample(subset(GEMS3$ResultValue, GEMS3$CountryName == "WHO European Region"),     5000)
W = 0.060034, p-value < 2.2e-16
t.test(subset(GEMS3$ResultValue, GEMS3$CountryName=="WHO European Region"))

    One Sample t-test

data:  subset(GEMS3$ResultValue, GEMS3$CountryName == "WHO European Region")
t = 4.2531, df = 97261, p-value = 2.11e-05
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.08983044 0.24338904
sample estimates:
mean of x 
0.1666097 

Although figure 9 shows the distribution of data with the logarithmic transformation, a normality test and a t test for a sample show that the set is normal and that all European samples have an average of 0.16mg / kg ± 0.08 mg / kg, 95 % confidence and p-value = 2.11e-05.

5.1 Dynamic relationship between analysis quality and sample origin

We now know that the vast majority of samples (60%) are below the limit of quantification and for simplicity is reported as 0.000. A minimum number is quantified and is below the limit of the legislation (0.1 mg/kg). The rest (40%) is quantified and 99% of the time it is below 1.22mg / Kg.

We will now try to visualize how this information relates to each other using the Sankey graph. This graph is similar to that of stacked bars, however, with lines that show the relationships between variables.

Let’s list the Origin of the samples (Domestic or imported), the quality control tool and the quantified lead content:

GEMS2res_1 %>%  group_by(group,AnalyticalQAName,FoodOriginName, ResultValue=ResultValue>=0.10) %>%  summarise(Samples=length(group)) -> Sankey4
ggplot(as.data.frame(Sankey4), aes(y = Samples, axis1 = group, axis2 = AnalyticalQAName, axis3 = FoodOriginName, axis4=ResultValue)) +
    geom_alluvium(aes(fill = group=="NA"), width = 1/6) +
  geom_stratum(width = 1/5, fill = "white", color = "grey50") +
  scale_x_continuous(breaks = 1:4, labels = c("LOQ", "Quality Control", "Sample source", "Above 0.10\ mg/Kg")) +
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +  theme(legend.position = "none")+ylab("Number of samples")

Figure 10: Sankey chart showing the relationship between the origin, the quality control technique used and the result in mg / Kg. It can be noted that either by origin, quality control or result there is no trend. It is also noted that between domestic and imported samples, they present the same ratio between quantified above 0.1 mg / kg and below the limit of quantification.

5.2 In search of matrices with higher levels

We will investigate only samples with higher levels of Lead: Herbs and spices, animal feed, stimulating drinks and Products for special nutritional use. First, how they relate to the time and amount of samples tested and quantified:

GEMS4=subset(GEMS2, GEMS2$CountryName=="WHO European Region" & GEMS2$FoodCategory=="ANIMAL FEED" | GEMS2$FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products" | GEMS2$FoodCategory=="Herbs, spices and condiments"  | GEMS2$FoodCategory=="Products for special nutritional use")
GEMS5=subset(GEMS2, GEMS2$CountryName=="WHO European Region")

GEMS4 %>% 
  group_by(SamplingPeriodEnd,FoodCategory) %>%
  count(more_than_1ppb=ResultValue<=0.001,sort=TRUE) %>%
  ggplot()+geom_col(aes(SamplingPeriodEnd,n,fill=more_than_1ppb))+ theme(axis.text.x = element_text(angle = 90),)+
  facet_wrap(~str_wrap(FoodCategory,width = 40))+ggtitle("Number of samples analysed according to matrix")



ggplot(GEMS4,aes(x=SamplingPeriodEnd,y=length(RowNum), fill=ResultValue<=0.001))+
  geom_bar(position="stack", stat="identity")+ theme(axis.text.x = element_text(angle = 90),)+
  facet_wrap(~str_wrap(FoodCategory,width = 40))+ggtitle("Number of samples analysed according to matrix")

Figure 11. Bar graphs grouping the 4 main classes with the highest average values of Lead in the period of 25 years. The number of quantified samples is show by the color difference (Quantified Red, Blue below the LOQ).


GEMS4 %>% 
  group_by(SamplingPeriodEnd,FoodCategory) %>%
  count(more_than_1ppb=ResultValue<=0.001,sort=TRUE) %>%
  ggplot()+geom_col(aes(SamplingPeriodEnd,n,fill=more_than_1ppb),position="fill")+ theme(axis.text.x = element_text(angle = 90),)+
  facet_wrap(~str_wrap(FoodCategory,width = 40))+ggtitle("Number of samples analysed according to matrix")


ggplot(GEMS4,aes(x=SamplingPeriodEnd,y=(RowNum), fill=ResultValue<=0.001))+
  geom_bar(position="fill", stat="identity")+ theme(axis.text.x = element_text(angle = 90),)+
  facet_wrap(~str_wrap(FoodCategory,width = 40))+ylab("ResultValue")+ggtitle("Ratio of quantified samples according to matrix")

Figure 12. Bar graphs grouping the 4 main classes with the highest average values of Lead in the period of 25 years. The values were standardized and each column corresponds to 100% of the values and the ratio between quantified and unquantified samples is demonstrated by the color difference (Quantified Red, Blue below the LOQ).

It is noted that samples of Herbs and spices are routinely quantified for a period of more than 10 years and even with variation in the number of samples, the quantified ratio is very similar along the years.

We know so far that these four food groups are often quantified. However, it is not clear whether some samples have very high levels and “drag” the average to higher mean values.

For this we will evaluate a boxplot with 99% of the data (to avoid outliers) and see where the quartiles are located:

ggplot(GEMS5, aes(x=FoodCategory,y=ResultValue))+
    geom_boxplot(show.legend = FALSE,alpha=1/10)+geom_hline(yintercept=0.1, linetype="solid", color = "red")+
        scale_x_discrete(breaks=unique(GEMS$FoodCategory),labels = function(x) str_wrap(str_replace_all(x, "gjgj" , " "), width = 80))+ylim(c(0,1.22))+
    coord_flip()+theme(axis.text.x = element_text(angle = 90,size =10))+xlab("Food Category")+ylab("Pb (mg/Kg)")+annotate("rect",ymin = 0,ymax=0.88, xmin = "Starchy roots and tubers",xmax="Sugar and confectionary (including cocoa products)", alpha=0.3, color="blue", fill="blue")

Figure 13. Boxplot with linear axis covering 99% of the samples in the range of 0.001 to 1.22mg/kg of Pb.

We can see that the worst case is that from stimulating drinks that include coffees and teas. The median is well within the limit of 0.1mg/kg and the third quartile (50% to 75%) stretches to more than 0.25mg/kg.

Through the cumulative distribution, we see that the problem is not outliers but a large amount of samples in different concentrations that is quantified in this class:

ggplot(GEMS5, aes(x=ResultValue, color=FoodCategory, label=FoodCategory),show.legend = FALSE)+
    stat_ecdf(geom = "line", aes(ResultValue),show.legend = FALSE) +xlab("Pb (mg/Kg)")+geom_vline(xintercept=0.1, linetype="solid", color = "red")+geom_vline(xintercept=1.22, linetype="solid", color = "blue")+xlim(c(0,1.22))+annotate("text", x = 0.52, y = 0.6, 
label = c("Stimulant beverages") , color="purple",size=5 )+annotate("segment", x = 0.30, xend = 0.25, y = 0.6, yend = 0.65, colour = "purple", size=1, alpha=0.9, arrow=arrow())+ylab("Cumulative distribution")

Figure 14. Graph with cumulative distribution of Lead content in food samples in the European Union. Red line represents the concentration of 0.1mg/kg of Pb and blue line 1.22mg/kg which is the content that comprises 99% of the samples.

We “investigated” whether within the class of stimulant drinks, which would be the worst matrix.

ggplot(subset(GEMS5,GEMS5$FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products") , aes(x=ResultValue, color=FoodName, label=FoodName))+
    stat_ecdf(geom = "line", aes(ResultValue)) +xlab("Pb (mg/Kg)")+geom_vline(xintercept=0.1, linetype="solid", color = "red")+geom_vline(xintercept=1.22, linetype="solid", color = "blue")+xlim(c(0,1.22))+ylab("Cumulative distribution")+ theme(legend.position = c(0.7, 0.3))

  
  #theme(  plot.background = element_rect(fill = "lightcyan2"), panel.background= element_rect(fill = "lightcyan2"))

Figure 15. Graph with cumulative distribution of Lead content in food samples in the European Union within the category “Stimulant drinks”. Red line represents the concentration of 0.1mg/kg of Pb and blue line 1.22mg/kg which is the content that includes 99% of the samples.

The worst case is the tea infusions, where more than 50% of the samples had levels above 0.25mg/kg of Lead or 2.5x more than the limit.

When filtering for “Tea (infusion)” only we can see it have the highest distribution of all.

95% of samples are below 2.48mg/kg. Ten times higher than the mean values for all other samples in EU Union.

5.3 Looking back to the worst case

Since we found the TEA as the sample with higher constant level food contaminated with Lead. Let’s visualize how this sample behaves with these 4 variable selected: Result, Quality tool, Origin and LOQ:

GEMS2res_1 %>% filter(CountryName=="WHO European Region") %>%
  filter(FoodName=="Tea (Infusion)") %>%
  group_by(ResultValue=ResultValue>=0.1,AnalyticalQAName,FoodOriginName,group) %>%
  summarise(Samples=length(AnalyticalQAName)) -> Sankey4
ggplot(as.data.frame(Sankey4), aes(y = Samples, axis1 = ResultValue, axis2 = AnalyticalQAName, axis3 = FoodOriginName, axis4=group)) +
  geom_alluvium(aes(fill = ResultValue), width = 1/6)+
  scale_x_discrete(expand = c(0.1,0.1))+
  geom_stratum(width = 1/5, fill = "white", color = "grey50") +
  scale_x_continuous(breaks = 1:4, labels = c( "Above 0.10\ mg/Kg?", "Quality Control", "Sample source","LOQ"))+
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +  theme(legend.position = "none")+ylab("Number of samples") 

Figure 15. Alluvial plot from Tea Samples in EU region. The vast majority of samples (more than 75%) are higher than 0.10mg/Kg and didn’t report the LOQ. Since most of non reported LOQ data are from old data, it’s interesting check if it all comes from previous years.

GEMS2res_1 %>% filter(CountryName=="WHO European Region") %>%
  filter(FoodName=="Tea (Infusion)") %>%
  group_by(ResultValue=ResultValue>=0.1,AnalyticalQAName,FoodOriginName, group, SamplingPeriodEnd) %>%
  summarise(Samples=length(group)) -> Sankey4
ggplot(as.data.frame(Sankey4), aes(y = Samples, axis1 = ResultValue, axis2 = AnalyticalQAName, axis3 = FoodOriginName, axis4=group, axis5=SamplingPeriodEnd)) +
  geom_alluvium(aes(fill = ResultValue), width = 1/5) +scale_x_discrete(expand = c(.1, .1))+
  geom_stratum(width = 1/5, fill = "white", color = "grey50") +
  scale_x_continuous(breaks = 1:5, labels = c("Above 0.10\ mg/Kg?", "Quality Control", "Sample source","LOQ" , "Year")) +
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +  theme(legend.position = "none")+ylab("Number of samples") 

Figure 16. Same as figure 15 except with year included(as SamplingPeriodEnd). No, we can not say that “Old samples (pre-2011) used to have more Lead”. To confirm we can normalize the content of ResultValue variable and plot it along the years.

GEMS2res_1 %>% filter(CountryName=="WHO European Region") %>%
  filter(FoodName=="Tea (Infusion)") %>%
ggplot(aes(x=SamplingPeriodEnd,y=(RowNum), fill=ResultValue>=0.1))+
  geom_bar(position="fill", stat="identity")+ theme(axis.text.x = element_text(angle = 90),)+
  facet_wrap(~str_wrap(FoodName,width = 40))+ylab("ResultValue")+ggtitle("Ratio of quantified samples according to matrix")

Figure 17. Stacked bar graph showing normalized number of samples above 0.1mg/Kg of Lead per year. No, the proportion of samples above 0.1 are not decreasing with time. To be sure, lets analyze using boxplot and linear regression, with the median of Result as the central line in each box.

GEMS2res_1 %>% filter(CountryName=="WHO European Region") %>% group_by(SamplingPeriodEnd) %>%
  filter(FoodName=="Tea (Infusion)") %>%
ggplot(aes(x=SamplingPeriodEnd,y=ResultValue))+scale_y_log10()+
  geom_boxplot(na.rm=TRUE)+ theme(axis.text.x = element_text(angle = 90),)+ylab("ResultValue")+geom_smooth(method = "lm", se=FALSE, color="blue", aes(SamplingPeriodEnd, ResultValue))+geom_smooth(method = "lm", color="blue",aes(group=1), na.rm=TRUE)

Figure 18. Boxplot from Tea samples in Eu Region by Year. No trend, mean content of Lead in food is stable for the period of 10 years.

6. Conclusion

Most foods with high lead values have a very limited intake: “ANIMAL FEED” (obviously), “Stimulant beverages” (occasional use), “Herbs, spices and condiments” (only as ingredients and in small quantities). Only “Products for special nutritional use” are critical as they may result in a higher intake for longer periods.

There are no indications that imported samples (to the European Union) are quantified more frequently than domestic ones, indicating contamination at source.

In the European Union (the region with more data available), 91% of all samples are below the 0,1mg/kg international limit and 99% are below 1.21mg/kg.

Stimulant Beverages, specially tea, present a high continuous source of Lead in the diet. Just to address how much, 50% of tea infusion samples are higher than 0,2mg/kg while 95% of other samples fall on this range.

In the 10 years with most data from European Union and the the most studies sample (Tea (Infusion)), no difference was noted in the mean value of contamination.

Some technical questions remain: were the tea samples (Stimulant beverages) digested from the dry mass? Were the samples identified as “infusion” analyzed as “ready for consumption”, ie direct reading? What is the real possibility of misplacement of units in the results such as ug/kg reported as mg/kg? Should they be treated and investigated as outliers or as transcription errors?

7. References:

https://www.fda.gov/food/metals-and-your-food/lead-food-foodwares-and-dietary-supplements

Other…

Tea investigation

How many tea samples we have under Tea (Infusion)

GEMS5 %>% filter(FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products") %>% group_by(LocalFoodName) %>% summarise(total=length(LocalFoodName)) %>% ggplot()+geom_col(aes(total,reorder(LocalFoodName,total,max),fill=LocalFoodName,),show.legend = FALSE)
`summarise()` ungrouping output (override with `.groups` argument)
Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
  display list redraw incomplete
2: In doTryCatch(return(expr), name, parentenv, handler) :
  invalid graphics state
3: In doTryCatch(return(expr), name, parentenv, handler) :
  invalid graphics state

Diferença médias: concentração Doméstico e Importada

Primeiro, vamos separar os dados de chá analizados na União Europáia:

GEMS_EU<-subset(GEMS2, GEMS2$CountryName=="WHO European Region" & FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products")

GEMS_EU %>% group_by(FoodOriginName, Quantificadas=ResultValue>=0.001) %>% summarise(Numero=length(ResultValue)) %>% ggplot() +geom_col(aes(FoodOriginName,Numero, fill=Quantificadas))
`summarise()` regrouping output by 'FoodOriginName' (override with `.groups` argument)

Temos a proporções de quantificadas entre amostras domesticas mais alta seguida pelas importadas e desconhecidas.

Abaixo podemos ver esses valores

GEMS_EU %>% group_by(FoodOriginName, Quantificadas=ResultValue>=0.001) %>% summarise(Numero=length(ResultValue))
`summarise()` regrouping output by 'FoodOriginName' (override with `.groups` argument)

Apesar de uma maior frequencia de quantificação, os valores médios são consideravelmente próximos, mediana e o quartil de 95% levemente acima para amostras domesticas:


GEMS_EU %>% group_by(FoodOriginName) %>% summarise(ResultValue=mean(ResultValue))

GEMS_EU %>% group_by(FoodOriginName) %>% summarise(ResultValue=median(ResultValue))

GEMS_EU %>% group_by(FoodOriginName) %>% summarise(ResultValue=quantile(ResultValue,0.95))

GEMS_EU %>% filter(ResultValue>0.0001) %>%  group_by(FoodOriginName) %>% ggplot()+ geom_boxplot(aes(FoodOriginName,ResultValue))+ylim(c(-0.001,10))+scale_y_log10()

GEMS_EU %>% filter(ResultValue>0.0001) %>%  group_by(FoodOriginName) %>% summarise(ResultValue=mean(ResultValue))

Não há diferença entre as infusões e o sólido?

GEMS_EU %>% filter(FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products") %>% ggplot()+ geom_boxplot(aes(ResultValue,FoodName))+scale_x_log10()

Média ResultValue, separando por país:

GEMS1 %>% filter(FoodName=="TEAS AND HERBAL TEAS (SOLID)") %>% group_by(CountryName) %>% summarise(ResultValue=mean(ResultValue))
`summarise()` ungrouping output (override with `.groups` argument)
options(scipen=10000)
GEMS1 %>% filter(FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products") %>% group_by(CountryName)%>%ggplot()+geom_boxplot(aes(ResultValue,CountryName))+scale_x_log10()+facet_wrap(~FoodName,scales="free")

The infusions appear to be expressed in dry mass, or at least converted after extraction on some countries like Singapure and EU. Data on USA, France, Hong Kong, Australia and New Zealand cluster close together on a range that is 100 times lower.

Let’s now investigate what would be the single most high level of lead in the GEMS database.

Worst Food Name (mean value):

  GEMS1 %>%
  group_by(FoodName) %>%
  summarise(ResultValue= mean(ResultValue,na.rm=TRUE)) %>% 
  arrange(desc(ResultValue)) %>% 
  top_n(15) %>% 
  ggplot()+ geom_col(aes(ResultValue,reorder(FoodName,ResultValue,max),fill=FoodName),show.legend = FALSE)
`summarise()` ungrouping output (override with `.groups` argument)
Selecting by ResultValue

Worst Local Food Name:

  GEMS1 %>%
  group_by(LocalFoodName) %>%
  summarise(ResultValue= mean(ResultValue,na.rm=TRUE)) %>% 
  arrange(desc(ResultValue)) %>% 
  top_n(25) %>% 
    ggplot()+ geom_col(aes(ResultValue,reorder(LocalFoodName,ResultValue,mean),fill=LocalFoodName),show.legend=FALSE)+scale_x_log10()
`summarise()` ungrouping output (override with `.groups` argument)
Selecting by ResultValue

2 Teas presentations appear as higher levels on FoodName, however data can be skewed due to lower samples and more presence of outliers.

When considering the final form on LocalFoodName, Tea forms appear 8 times.

Filtering samples with more than 5000 analyses performed. Summarizing with mean, sd, n, RSD, 95% percentil:

GEMS1 %>%
  group_by(FoodName) %>%
  summarise(mean= mean(ResultValue,na.rm=TRUE),
            median= median(ResultValue,na.rm=TRUE),
            n=n(),
            sd=sd(ResultValue,na.rm=TRUE),
            RSD=sd(ResultValue,na.rm=TRUE)*100/mean(ResultValue,na.rm=TRUE),
            percentil_95=quantile(ResultValue,0.95,na.rm=TRUE),
            ratio_ND=prop.table(table(ResultText=="ND"))[2]) %>%
  filter(n>=100)%>%
  arrange(desc(median))-> worst_case
`summarise()` ungrouping output (override with `.groups` argument)
kable(worst_case)
FoodName mean median n sd RSD percentil_95 ratio_ND
Tea, green, black (black, fermented and dried) 0.6850643 0.3300 5553 4.5257447 660.6307 2.200000 0.0628489
TEAS AND HERBAL TEAS (SOLID) 1.0469433 0.3100 5354 15.6918662 1498.8267 2.190000 0.1016063
Feed 1.2395738 0.0700 9349 22.3002025 1799.0218 2.186000 0.3076265
Mushrooms 0.0963046 0.0130 10005 0.3352460 348.1101 0.420000 0.3828086
Wine 0.0110209 0.0060 11079 0.0191751 173.9891 0.035000 0.2176189
Meat and meat products NES 0.7130358 0.0038 9764 37.5743209 5269.6262 0.270000 0.4909873
Cattle meat 0.0252862 0.0000 6686 1.2131144 4797.5382 0.034000 0.8232127
Cattle milk 0.0058318 0.0000 7861 0.0171696 294.4150 0.032650 0.7013103
Cereals and cereal-based products NES 0.0121510 0.0000 5193 0.0305157 251.1373 0.060000 0.5875217
FISHES 0.0115102 0.0000 10260 0.0533181 463.2267 0.060000 0.7422027
Fruit juice 0.0074039 0.0000 5543 0.0270615 365.5047 0.030000 0.5933610
Pig liver 0.0303456 0.0000 7610 0.1361239 448.5790 0.145000 0.5328515
Pig meat 0.0037852 0.0000 7018 0.0175946 464.8234 0.022000 0.8412653
Rice 0.0339760 0.0000 5162 0.1038827 305.7531 0.160850 0.5951182
Pig kidney 0.0274744 0.0000 9910 0.1514991 551.4194 0.120808 0.5821393

Worst food boxplot:

  GEMS1 %>%
  filter(FoodName %in% worst_case$FoodName)%>%
  group_by(FoodName) %>% 
   ggplot()+
    geom_boxplot(aes(ResultValue,FoodName))+
    geom_point(data=worst_case,aes(percentil_95,FoodName),color="red")+
    geom_point(data=worst_case,aes(mean,FoodName),color="blue")+xlim(-0.1,1)+scale_x_log10()
Scale for 'x' is already present. Adding another scale for 'x', which will
replace the existing scale.

Same situation, although Feed also appear to show high lead contamination and have the same 95% quantile (above 2mg/kg). The median is much low due to higher presence of ND samples (about 30%)

---
title: '(IN PROGRESS!!)25 years of Lead in food:'
author: "Fabiano Silva (drhtcl@gmail.com)"
output:
  html_notebook:
    toc: yes
  pdf_document:
    toc: yes
  html_document:
    toc_float: yes
    df_print: paged
subtitle: A study from the world distribution on Lead contamination on food using
  the GEMSFOODS(WHO) database during 1995-2020
---
```{css include=FALSE}
.blueCode {
background-color: #e6f0ff
}
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, fig.align='center', class.source="bg-success", error = TRUE)
```



# 1. Introduction

Lead (Pb) is the second most toxic metal after arsenic (As), comprising 0.002% of the earth's crust.

Lead, like other heavy metals, does not biodegrade or disappear from the environment over time. Although lead levels in the food supply declined dramatically between the 1970s and 1990s, low levels of lead continue to be detected in some foods due to the continued presence in the environment. Lead from the soil can be absorbed by plants grown to obtain fruit or vegetables or plants used as ingredients in food or animal feed.

Lead is poisonous to humans and can affect people of any age or health. Lead is especially harmful to vulnerable populations, including babies, young children, pregnant women and their fetuses and others with chronic health conditions. High levels of lead exposure can seriously damage the health and development of children, specifically the brain and nervous system. The neurological effects of high levels of lead exposure during early childhood include learning disabilities, behavioral difficulties and reduced IQ. Because lead can accumulate in the body, even low-level chronic exposure can be dangerous over time.

It is estimated that exposure to lead was responsible, in 2004, for 143 thousand deaths
and 0.6% of the global disease burden, taking into account mild mental under development and cardiovascular outcomes

# 1.1 Purpose

This work focus on:

1- investigate the content of Lead in foods using the R language as a technique for filtering, analyzing and modeling distributions according to the varieties of food consumed, countries, analytical conditions and dates.

2- Find out where and which foods are most contaminated and create possible hypotheses of cause.

# 2. Initial data and database ingestion:

The information was taken from the "GEMSFOODS" database of the World Health Organization (WHO) available at:

https://extranet.who.int/gemsfood/


All calculations and graphs were made in ** Rstudio version 1.3.959 ** and ** R base 3.6.3 ** and using the following libraries:

```{r message=FALSE, warning=FALSE,class.source="bg-success"}
library(readr)
library(ggplot2)
library(tidyverse)
library(knitr)
library(RColorBrewer)
library(ggalluvial)
library(treemapify)
library(gganimate)
library(ggrepel)
library(DescTools)
library(ggthemes)
```

The codes used to perform that work will be shown. All processing is based on more modern standards in reproducibility [ref] in research where individual data is not manipulated and the entire process (from database recognition to the final report) is done in an automated way and can be audited by the code used.

We started by consolidating the database, the files were downloaded in groups to facilitate the transfering:

```{r  message=FALSE, class.source="bg-success"}
setwd("/mnt/DADOS/Data Science/GEMFoods")
GEMS2000 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(1980-2000).csv", dec=",")
GEMS2010 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(2000-2010).csv", dec=",")
GEMS2014 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(2010-2014).csv", dec=",")
GEMS2016 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(2014-2016).csv", dec=",")
GEMS2018 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(2016-2018).csv", dec=",")
GEMS2020 <- read.csv("/mnt/DADOS/Data Science/GEMFoods/Gemsfood(2018-2020).csv", dec=",")
GEMS <- rbind(GEMS2000,GEMS2010,GEMS2014,GEMS2016,GEMS2018,GEMS2020)
```


## 3. Main variables:

Checking the information available:

```{r, collapse=TRUE}
names(GEMS)
```

The data used will be: 
"CountryName" the country of origin of the analytical data.
"FoodCategory" The type of food, that is, the analytical matrix such as "Vegetables" or "Animal Feed".
"FoodOriginName", which describes the origin as Domestic or imported.
"LOQ" the limit of quantification.
"ResultValue", the value obtained in the analysis when quantified.
"AnalyticalQAName", which describes the conditions of the analysis, if made in an accredited laboratory or with use of internal quality control only.
"FoodName" which is the exact description of the food as "Potato" or "Rice".

Before, the evaluation of the units of the Lead content should be made:


```{r, collapse=TRUE, }
summary(GEMS$UnitName)
```


When quantified, the result was reported in two different units, mg/kg and ug/kg.

All units will be changed to mg/kg and the values adjusted for the same unit, where 1ug/kg is equivalent to 0.001mg/kg:

```{r collapse=TRUE, eval=FALSE}
Sys.time()
GEMSx <-GEMS
  GEMS1_14 <- GEMS[,14]
  GEMS1_10 <- GEMS[,10]
  GEMS1_12 <- GEMS[,12]
 for(i in 1:nrow(GEMS)){
  if(GEMS1_12[i]=="ug/kg"){
    GEMS1_14[i] <- GEMS1_14[i]/1000
    GEMS1_10[i] <- GEMS1_10[i]/1000
    GEMS1_12[i] <- "mg/kg"
  } else {
NULL
  }
 }
GEMSx[,14]<- GEMS1_14
GEMSx[,10]<- GEMS1_10
GEMSx[,12]<- GEMS1_12

Sys.time()
```

NOTE: Due to long processing time, the code chunk above was previously executed and saved as a RDS file. Which will be read and stored in a new object(GEMS1). 

```{r}
GEMS1 <- readRDS("/mnt/DADOS/Data Science/GEMFoods/GEMS1.rds")
```

Confirming the change:

```{r, collapse=TRUE}
summary(GEMS1$UnitName)
```



The Limit of Quantification has also been changed (Column 14)



# 4. Exploratory analysis and search for the biggest contaminations:

## 4.1 Distribution of the data source

Let's see in a practical way which countries contributed the most to the WHO database using the CountryName variable and how many times a particular country appears:


```{r fig.align='center', out.width='130%'}
GEMS1 %>% count(CountryName,sort = TRUE) %>% top_n(20) %>% ggplot()+geom_col(aes(n,reorder(CountryName,n,max),fill=CountryName),show.legend = FALSE)+theme_economist()+xlab("Country Name")+ylab("Number of samples")

```
Figure 1. Number of samples per database contributor in the period from 1995 to 2020.
 
"WHO European Region", "China", "Canada", "United States of America", "Brazil" and "Thailand" are the main contributors. WHO/Europe have 54% of contributions and Canada in second with 13%.

From now on, a sample will be used containing only the 6 countries with the most contributions (GEMS2)

```{r include=FALSE}
GEMS2=subset(GEMS1,GEMS1$CountryName=="WHO European Region"| GEMS1$CountryName=="Canada"|GEMS1$CountryName=="China" |GEMS1$CountryName=="United States of America" | GEMS1$CountryName=="Brazil"| GEMS1$CountryName=="Thailand")
```

To look for any possible change in the quality of the tests according to the time, we will convert all dates to year.

```{r,collapse=TRUE}
GEMS2$SamplingPeriodEnd <- as.POSIXct(GEMS2$SamplingPeriodEnd, format = "%m/%d/%Y %H:%M:%S")
GEMS2$SamplingPeriodEnd <- format(GEMS2$SamplingPeriodEnd, format="%Y")
```

## 4.2 Analytical Quality of Tests

Evaluating the available information on which tools used for the quality of the tests, the variable "AnalyticalQAName" will be used:


```{r, collapse=TRUE,fig.align='center'}
ggplot(GEMS)+geom_bar(aes(reorder(AnalyticalQAName,AnalyticalQAName,function(x)+length(x))
                        ,fill=AnalyticalQAName), show.legend= FALSE, stat = "count")+
  coord_flip()+
  xlab("AnalyticalQAName")+xlab("Quality control Tool")+ylab("Number of samples")
```
Figure 2. Type of analytical quality control tool used in relation to the number of samples in the database.

We see that the equivalent of approximately 48% (175,133) of the tests were carried out in accredited laboratories and 16% (51139) had some internal quality control or proficiency test.


We now we can investigate whether the origin of the analytical data (whether or not to use an accredited laboratory) interferes with the mean/spread within the data for each country. First, we calculate the average among all samples but taking into account all quantified results, first a summary of the total data:

total average:

```{r message=FALSE, warning=FALSE, collapse=TRUE, EVAL=TRUE}
GEMS2 %>% group_by(CountryName) %>%  summarise(mean=mean(ResultValue,na.rm=TRUE), n=n()) %>% kable()
```
Plotting:

```{r fig.align='center', message=FALSE, warning=FALSE, collapse=TRUE, EVAL=TRUE}
GEMS2 %>% 
  group_by(CountryName) %>%  
  summarise(mean=mean(ResultValue,na.rm=TRUE), n=n())%>%
  arrange(desc(n),na.rm=TRUE)%>% 
  ggplot(aes(mean,CountryName ,fill=CountryName))+
  geom_col(show.legend = FALSE)+
  geom_text(aes(label = format(round(mean, 3), nsmall = 2)), vjust = 0.5,hjust=0)+xlim(c(0,0.25))+ylab("Country Name")+xlab("Mean Value of Pb (mg/Kg)")
```

Figure 3. Average Lead value (mg/kg) in samples according to the country of origin.

Average using only quantified samples (we adopt values only above 0.001mg/Kg):

```{r EVAL=FALSE, message=FALSE, warning=FALSE, collapse=TRUE}
GEMS2 %>% group_by(CountryName) %>%  summarise(mean=mean(ResultValue>=0.001,na.rm=TRUE), n=n())%>%arrange(desc(n),na.rm=TRUE)
```

In the form of a bar graph now:


```{r EVAL=TRUE, message=FALSE, warning=FALSE, collapse=TRUE, fig.align='center'}
GEMS2 %>% 
  group_by(CountryName) %>%  
  summarise(mean=mean(ResultValue>=0.001,na.rm=TRUE), n=n())%>%
  arrange(desc(n),na.rm=TRUE)%>% 
  ggplot(aes(mean,CountryName ,fill=CountryName))+
  geom_col(show.legend = FALSE)+
  geom_text(aes(label = format(round(mean, 3), nsmall = 2)), vjust = 0.5,hjust=0)+xlim(c(0,0.70))+ylab("Country Name")+xlab("Mean Value of Pb (mg/Kg)")
```

Figure 4. Average lead value (mg/kg) in quantified samples (above 0.001 mg/kg) according to the country of origin.

Initially, European values appear to be higher, specifically 4x greater than Brazilians, 10x greater than Canadians and 50x greater than those of the USA. However, when only the quantified values are taken into account, this average decreases considerably. One theory would be that countries such as the USA and Canada would have a larger number of undetected samples and thus the average would be reduced.

To check if the source of the difference may be the use of non-accredited laboratories, we will exclude this data in the "GEMS3" dataframe

```{r, collapse=TRUE}
GEMS3=subset(GEMS2,GEMS2$AnalyticalQAName=="Officially accredited")
```

Now repeating the test to see if there was a change in the averages:


```{r message=FALSE, warning=FALSE, collapse=TRUE, EVAL=TRUE,fig.align='center'}
GEMS3 %>% group_by(CountryName) %>%  summarise(mean=mean(ResultValue>=0.001,na.rm=TRUE), n=n())%>%arrange(desc(n),na.rm=TRUE)%>% ggplot(aes(mean,CountryName ,fill=CountryName))+geom_col(show.legend = FALSE)+geom_text(aes(label = format(round(mean, 3), nsmall = 2)), vjust = 0.5,hjust=0)+xlim(c(0,0.60))+ylab("Country Name")+xlab("Mean Value of Pb (mg/Kg)")
```

Figure 4. Average Lead value (mg/kg) in quantified samples (above 0.001 mg/kg) and that used accredited laboratories, separated by the country of origin.

Yes, most of the average values have decreased and in particular China, which showed an over-quantification of Lead in non-accredited laboratories. While European laboratories reported much lower figures.

When evaluating data only from quantified samples and accredited laboratories, Europe and Canada have very similar averages.


## 4.3 Distribution according to matrix (FoodCategory)

We will now analyze the contents of each matrix (Variable "FoodCategory" in GEMS).
Below are all the results quantified (considering the 6 main contributing countries) in Lead per mg/kg (x-axis) for each type of food (y-axis):


```{r fig.align='center', message=FALSE, warning=FALSE, collapse=TRUE}
ggplot(GEMS2, aes(x=FoodCategory,y=ResultValue))+
    geom_boxplot(show.legend = FALSE)+geom_hline(yintercept=0.1, linetype="solid", color = "red")+
    scale_y_log10()+
    scale_x_discrete(breaks=unique(GEMS$FoodCategory),labels = function(x) str_wrap(str_replace_all(x, "gjgj" , " "), width = 80))+
    coord_flip()+theme(axis.text.x = element_text(angle = 90,size =10))+xlab("Food Category")+ylab("Pb (mg/Kg)")
```



Figure 5: Boxplot of Lead content in mg/Kg in food according to the matrix. The boxes represent 50% of the data, the horizontal bars the other 50%. Dots beyond the bars represent outliers.

The red line represents the limit of 0.1 mg mg/kg of Pb adopted as a standard in several legislation [2]. Only 4 matrices (food) have their median at or above the limit: Herbs and spices, stimulating drinks, animal feed and products for special use. NOTE: Scale of the Y axis is in log10 to facilitate the visualization of the data.

![Figure 6: Same as Figure 4 but showing the passage of time between the years 1995 and 2020](/mnt/DADOS/Data Science/boxplot.gif){fig.align='center'}


During the evolution of the years, a change in the averages is likely to be caused by seasonality.


We will now visualize the distribution in Boxplot form of each food (categorical data "FoodCategory") separated in the six main contributing countries:


```{r message=FALSE, warning=FALSE, collapse=TRUE,fig.align='center'}
ggplot(GEMS2, aes(ResultValue,FoodCategory,  color=CountryName))+
      geom_boxplot(show.legend = FALSE)+scale_x_log10()+theme(axis.text.y = element_blank())+geom_vline(xintercept=0.1, linetype="solid", color = "red")+facet_wrap(~CountryName)+ylab("Food Category")+xlab("Pb (mg/Kg)")
```
Figure 7: Boxplot of Lead content in mg/Kg in food according to the matrix (Y axis).

The red line represents the limit of 0.1mg/kg, routinely adopted as the limit for Lead in foods internationally [4].

There is a much larger number of outliers in the European Union than in the second largest contributor (Canada). These outliers include very high concentrations (even 100mg / Kg!) Causing a lever and "pulling" the total averages upwards.

# 4.4. LOQ behaviour

The LOQ is a important tool to assess the analysis 

```{r EVAL=FALSE, message=FALSE, warning=FALSE, collapse=TRUE}
summary(GEMS2$LOQ)
```
So, from all 319302 samples, 25%  have a LOQ below 0.01 mg/Kg, the median is on 0.02 mg/kg and 75% of samples have a LOQ bellow 0.05mg/kg. We found that 14% (45661) of samples didn't report the LOQ.

Let's graphically show this information using the ResulValue of Pb (mg/Kg) as the X variable.

```{r fig.align='center', message=FALSE, warning=FALSE, collapse=TRUE}
ggplot(GEMS2,aes(x=ResultValue, fill=LOQ<=0.01))+
    geom_boxplot()+scale_x_log10()+geom_vline(xintercept=0.1, linetype="dashed", color = "red")+
  geom_vline(xintercept=1.21, linetype="dashed", color = "blue")
```
Boxplot from quantified samples according to LOQ, a value of 0.01mg/Kg were chosen since is 1/10 of the limit of 0.1mg/Kg.
As seen, Lower detection limits have a lower mean value of quantified samples. Higher LOQ than 0.1mg/Kg of Pb have a mean quantified sample higher. Samples that didn't report the LOQ have a mean quantified value between the two. This can be seen even more clearly on histograms:

```{r fig.align='center', message=FALSE, warning=FALSE, collapse=TRUE}
ggplot(GEMS2,aes(x=ResultValue, fill=LOQ<=0.01))+
    geom_histogram()+scale_x_log10()+geom_vline(xintercept=0.1, linetype="dashed", color = "red")+
  geom_vline(xintercept=1.21, linetype="dashed", color = "blue")
```

It's important to notice that only quantified samples appear on the graph above, and 56% of samples are not quantified.
```{r EVAL=FALSE, message=FALSE, warning=FALSE, collapse=TRUE}
prop.table(table(GEMS2$ResultValue==0.00000))
```

But, are the quantified results with unreported LOQ equivalent to the ones with reported results?

First, the confirmation if the data is normal:

```{r, collapse=TRUE, class.source="bg-success"}
GEMS2res<-sample_n(GEMS2,5000)

shapiro.test(subset(GEMS2res$ResultValue,GEMS2res$LOQ>=0.01))
shapiro.test(subset(GEMS2res$ResultValue,GEMS2res$LOQ<=0.01))
shapiro.test(subset(GEMS2res$ResultValue,is.na(GEMS2res$LOQ)))
GEMS2res<-GEMS2

```
All three sets of the ResultValue variable with LOQ lower than 0.01mg/Kg, higher than 0.01mg/Kg and non reported(NA) are normal. 

Now, grouping each data according to the LOQ:

```{r, , collapse=TRUE, class.source="bg-success"}
GEMS2res1 <- subset(GEMS2res,GEMS2res$LOQ>=0.01)
GEMS2res1$group <- "Higher than 0.01"
GEMS2res2 <- subset(GEMS2res,GEMS2res$LOQ<=0.01)
GEMS2res2$group <- "Lower than 0.01"
GEMS2res3 <- subset(GEMS2res,is.na(GEMS2res$LOQ))
GEMS2res3$group <- "NA"
GEMS2res_1<-rbind(GEMS2res1,GEMS2res2,GEMS2res3)
```

And performing ANOVA.

```{r, , collapse=TRUE}
anova_GEMS2res_1 <- aov(ResultValue ~ group, GEMS2res_1)
summary(anova_GEMS2res_1)
```

```{r, , collapse=TRUE}
PostHocTest(anova_GEMS2res_1, method = "hsd")
```
We can note that there's a different mean between the quantified samples with below and above the LOQ of 0.01mg/Kg of Pb but NOT between these groups and the unreported results.

Now, evaluating according to year of analysis:

```{r,fig.align='center'}
ggplot(GEMS2res_1,aes(x=SamplingPeriodEnd,y=length(RowNum), fill=group))+
  geom_bar(position="stack", stat="identity")+ theme(axis.text.x = element_text(angle = 90),)+ggtitle("Number of samples analysed according to LOQ group")

GEMS2res_1 %>% group_by(SamplingPeriodEnd) %>% count(group,sort=TRUE) %>% ggplot()+geom_col(aes(SamplingPeriodEnd,n, fill=group))+ theme(axis.text.x = element_text(angle = 90),)+ggtitle("Number of samples analysed according to LOQ group")

```

After plotting the LOQ groups according to the Year we note that the unreported data probably come from years we this information wasn't required. 

We decided to keep the quantified data even if the LOQ wasn't reported.

# 5. A deeper study of Lead levels in the European Union region


Since the WHO / Europe region has a large amount of data and many outliers/values above what is allowed, it was chosen to evaluate the matrices and their distribution over the years in more detail.

We will now examine the quantified samples (above 0.001mg/kg) in the "WHO European Region" in the form of a heatplot:


```{r fig.align='center', message=FALSE, warning=FALSE}
 subset(GEMS2, GEMS2$CountryName=="WHO European Region" & GEMS2$ResultValue) %>% 
  ggplot(aes(SamplingPeriodEndText,y=reorder(FoodCategory, ResultValue,function(x)+mean(x)),  fill = ResultValue))+   
  geom_tile(color = "white") +
  scale_fill_gradient(limits=c(0.001, 2),low = "blue", high ="red", trans="log10")+ 
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90),)+ylab("FoodCategory")+
  scale_y_discrete(labels = function(x) str_wrap(str_replace_all(x, "gjgj" , " "), width = 60))+
  xlab("Year")+ylab("Food Category")
```

Figure 8. Heatplot from the 15-year period of analysis of Lead content in food at WHO/Europe. Red represents the content of 1mg / kg, blue 0.001mg/kg. It is noted that over the years samples of Herbs and spices, stimulating drinks and animal feed, continuously show higher results than other matrices. As already seen in the boxplot in figure 7.

The upper part of the Heatplot contains the foods with the highest average values over the years and the lowest, with the smallest.

We can now define the frequency of the highest 1%, 2% and 10% (in Pb concentration) in the total database:

```{r,collapse=TRUE}
quantile(subset(GEMS2$ResultValue,GEMS2$CountryName=="WHO European Region"),probs=(0.95),na.rm=TRUE)

```

   
95% of the quantified samples are below 0.2mg/Kg.

```{r, collapse=TRUE}
quantile(subset(GEMS2$ResultValue,GEMS2$CountryName=="WHO European Region"),probs=(0.98),na.rm=TRUE)
```

98% of the quantified samples are below 0.58mg/Kg.


```{r,collapse=TRUE }
quantile(subset(GEMS2$ResultValue,GEMS2$CountryName=="WHO European Region"),probs=(0.99),na.rm=TRUE)
```

99% of the quantified samples are below 1.21mg/Kg.


Observing these limits in the frequency histogram:


```{r message=FALSE, warning=FALSE,fig.align='center'}
subset(GEMS3, GEMS3$CountryName=="WHO European Region" & GEMS3$ResultValue) %>%
  ggplot(aes(ResultValue, fill=FoodCategory))+
  geom_histogram( show.legend= FALSE)+
  geom_vline(xintercept=0.1, linetype="dashed", color = "red")+
  geom_vline(xintercept=1.21, linetype="dashed", color = "blue")+
  scale_x_log10()+xlab("Pb (mg/kg)")+ylab("Number of samples")
```

Figure 9: Frequency histogram of data referring to the European Union, x-axis on the logarithmic scale. The red line represents the value of 0.1mg/kg a very common limit allowed in the legislation and blue line, the range that covers 99% of the quantified samples (1.21mg/Kg).

Checking for normality.

```{r, collapse=TRUE}
shapiro.test(sample(subset(GEMS3$ResultValue, GEMS3$CountryName=="WHO European Region"),5000))
t.test(subset(GEMS3$ResultValue, GEMS3$CountryName=="WHO European Region"))
```

Although figure 9 shows the distribution of data with the logarithmic transformation, a normality test and a t test for a sample show that the set is normal and that all European samples have an average of 0.16mg / kg ± 0.08 mg / kg, 95 % confidence and p-value = 2.11e-05.



## 5.1 Dynamic relationship between analysis quality and sample origin

We now know that the vast majority of samples (60%) are below the limit of quantification and for simplicity is reported as 0.000. A minimum number is quantified and is below the limit of the legislation (0.1 mg/kg). The rest (40%) is quantified and 99% of the time it is below 1.22mg / Kg.

We will now try to visualize how this information relates to each other using the Sankey graph. This graph is similar to that of stacked bars, however, with lines that show the relationships between variables.

Let's list the Origin of the samples (Domestic or imported), the quality control tool and the quantified lead content:

```{r message=FALSE, warning=FALSE}
GEMS2res_1 %>%  group_by(group,AnalyticalQAName,FoodOriginName, ResultValue=ResultValue>=0.10) %>%  summarise(Samples=length(group)) -> Sankey4
ggplot(as.data.frame(Sankey4), aes(y = Samples, axis1 = group, axis2 = AnalyticalQAName, axis3 = FoodOriginName, axis4=ResultValue)) +
    geom_alluvium(aes(fill = group=="NA"), width = 1/6) +
  geom_stratum(width = 1/5, fill = "white", color = "grey50") +
  scale_x_continuous(breaks = 1:4, labels = c("LOQ", "Quality Control", "Sample source", "Above 0.10\ mg/Kg")) +
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +  theme(legend.position = "none")+ylab("Number of samples")
```

Figure 10: Sankey chart showing the relationship between the origin, the quality control technique used and the result in mg / Kg.
It can be noted that either by origin, quality control or result there is no trend.
It is also noted that between domestic and imported samples, they present the same ratio between quantified above 0.1 mg / kg and below the limit of quantification.


## 5.2 In search of matrices with higher levels

We will investigate only samples with higher levels of Lead: Herbs and spices, animal feed, stimulating drinks and Products for special nutritional use. First, how they relate to the time and amount of samples tested and quantified:

```{r}
GEMS4=subset(GEMS2, GEMS2$CountryName=="WHO European Region" & GEMS2$FoodCategory=="ANIMAL FEED" | GEMS2$FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products" | GEMS2$FoodCategory=="Herbs, spices and condiments"  | GEMS2$FoodCategory=="Products for special nutritional use")
```


```{r}
GEMS5=subset(GEMS2, GEMS2$CountryName=="WHO European Region")
```


```{r,fig.align='center'}

GEMS4 %>% 
  group_by(SamplingPeriodEnd,FoodCategory) %>%
  count(more_than_1ppb=ResultValue<=0.001,sort=TRUE) %>%
  ggplot()+geom_col(aes(SamplingPeriodEnd,n,fill=more_than_1ppb))+ theme(axis.text.x = element_text(angle = 90),)+
  facet_wrap(~str_wrap(FoodCategory,width = 40))+ggtitle("Number of samples analysed according to matrix")


ggplot(GEMS4,aes(x=SamplingPeriodEnd,y=length(RowNum), fill=ResultValue<=0.001))+
  geom_bar(position="stack", stat="identity")+ theme(axis.text.x = element_text(angle = 90),)+
  facet_wrap(~str_wrap(FoodCategory,width = 40))+ggtitle("Number of samples analysed according to matrix")
```


Figure 11. Bar graphs grouping the 4 main classes with the highest average values of Lead in the period of 25 years. The number of quantified samples is show by the color difference (Quantified Red, Blue below the LOQ).

```{r,fig.align='center'}

GEMS4 %>% 
  group_by(SamplingPeriodEnd,FoodCategory) %>%
  count(more_than_1ppb=ResultValue<=0.001,sort=TRUE) %>%
  ggplot()+geom_col(aes(SamplingPeriodEnd,n,fill=more_than_1ppb),position="fill")+ theme(axis.text.x = element_text(angle = 90),)+
  facet_wrap(~str_wrap(FoodCategory,width = 40))+ggtitle("Number of samples analysed according to matrix")

ggplot(GEMS4,aes(x=SamplingPeriodEnd,y=(RowNum), fill=ResultValue<=0.001))+
  geom_bar(position="fill", stat="identity")+ theme(axis.text.x = element_text(angle = 90),)+
  facet_wrap(~str_wrap(FoodCategory,width = 40))+ylab("ResultValue")+ggtitle("Ratio of quantified samples according to matrix")

```

Figure 12. Bar graphs grouping the 4 main classes with the highest average values of Lead in the period of 25 years. The values were standardized and each column corresponds to 100% of the values and the ratio between quantified and unquantified samples is demonstrated by the color difference (Quantified Red, Blue below the LOQ).

It is noted that samples of Herbs and spices are routinely quantified for a period of more than 10 years and even with variation in the number of samples, the quantified ratio is very similar along the years. 


We know so far that these four food groups are often quantified. However, it is not clear whether some samples have very high levels and "drag" the average to higher mean values.

For this we will evaluate a boxplot with 99% of the data (to avoid outliers) and see where the quartiles are located:

```{r message=FALSE, warning=FALSE, collapse=TRUE,fig.align='center'}
ggplot(GEMS5, aes(x=FoodCategory,y=ResultValue))+
    geom_boxplot(show.legend = FALSE,alpha=1/10)+geom_hline(yintercept=0.1, linetype="solid", color = "red")+
        scale_x_discrete(breaks=unique(GEMS$FoodCategory),labels = function(x) str_wrap(str_replace_all(x, "gjgj" , " "), width = 80))+ylim(c(0,1.22))+
    coord_flip()+theme(axis.text.x = element_text(angle = 90,size =10))+xlab("Food Category")+ylab("Pb (mg/Kg)")+annotate("rect",ymin = 0,ymax=0.88, xmin = "Starchy roots and tubers",xmax="Sugar and confectionary (including cocoa products)", alpha=0.3, color="blue", fill="blue")

```

Figure 13. Boxplot with linear axis covering 99% of the samples in the range of 0.001 to 1.22mg/kg of Pb.

We can see that the worst case is that from stimulating drinks that include coffees and teas. The median is well within the limit of 0.1mg/kg and the third quartile (50% to 75%) stretches to more than 0.25mg/kg.

Through the cumulative distribution, we see that the problem is not outliers but a large amount of samples in different concentrations that is quantified in this class:


```{r message=FALSE, warning=FALSE}
ggplot(GEMS5, aes(x=ResultValue, color=FoodCategory, label=FoodCategory),show.legend = FALSE)+
    stat_ecdf(geom = "line", aes(ResultValue),show.legend = FALSE) +xlab("Pb (mg/Kg)")+geom_vline(xintercept=0.1, linetype="solid", color = "red")+geom_vline(xintercept=1.22, linetype="solid", color = "blue")+xlim(c(0,1.22))+annotate("text", x = 0.52, y = 0.6, 
label = c("Stimulant beverages") , color="purple",size=5 )+annotate("segment", x = 0.30, xend = 0.25, y = 0.6, yend = 0.65, colour = "purple", size=1, alpha=0.9, arrow=arrow())+ylab("Cumulative distribution")

```

Figure 14. Graph with cumulative distribution of Lead content in food samples in the European Union. Red line represents the concentration of 0.1mg/kg of Pb and blue line 1.22mg/kg which is the content that comprises 99% of the samples.

We "investigated" whether within the class of stimulant drinks, which would be the worst matrix.


```{r message=FALSE, warning=FALSE}
ggplot(subset(GEMS5,GEMS5$FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products") , aes(x=ResultValue, color=FoodName, label=FoodName))+
    stat_ecdf(geom = "line", aes(ResultValue)) +xlab("Pb (mg/Kg)")+geom_vline(xintercept=0.1, linetype="solid", color = "red")+geom_vline(xintercept=1.22, linetype="solid", color = "blue")+xlim(c(0,1.22))+ylab("Cumulative distribution")+ theme(legend.position = c(0.7, 0.3))
  
  #theme(  plot.background = element_rect(fill = "lightcyan2"), panel.background= element_rect(fill = "lightcyan2"))

```


Figure 15. Graph with cumulative distribution of Lead content in food samples in the European Union within the category "Stimulant drinks". Red line represents the concentration of 0.1mg/kg of Pb and blue line 1.22mg/kg which is the content that includes 99% of the samples.

The worst case is the tea infusions, where more than 50% of the samples had levels above 0.25mg/kg of Lead or 2.5x more than the limit.


```{r message=FALSE, warning=FALSE}
ggplot(subset(GEMS5,GEMS5$FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products" & GEMS5$FoodName=="Tea (Infusion)" ) , aes(x=ResultValue, color=LocalFoodName))+
    stat_ecdf(geom = "line", aes(ResultValue)) +xlab("Pb (mg/Kg)")+geom_vline(xintercept=0.1, linetype="solid", color = "red")+geom_vline(xintercept=1.22, linetype="solid", color = "blue")+xlim(c(0,1.22))+ylab("Cumulative distribution")+ theme(legend.position = c(0.7, 0.3))
  
  #theme(  plot.background = element_rect(fill = "lightcyan2"), panel.background= element_rect(fill = "lightcyan2"))

```

When filtering for "Tea (infusion)" only we can see it have the highest distribution of all.

```{r}

GEMS_EU %>% filter(GEMS_EU$FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products" & GEMS_EU$FoodName=="Tea (Infusion)" ) %>% summarize(quantile(ResultValue,0.95)) 
```
95% of samples are below 2.48mg/kg. Ten times higher than the mean values for all other samples in EU  Union.


## 5.3 Looking back to the worst case

Since we found the TEA as the sample with higher constant level food contaminated with Lead.
Let's visualize how this sample behaves with these 4 variable selected: Result, Quality tool, Origin and LOQ: 


```{r message=FALSE, warning=FALSE}
GEMS2res_1 %>% filter(CountryName=="WHO European Region") %>%
  filter(FoodName=="Tea (Infusion)") %>%
  group_by(ResultValue=ResultValue>=0.1,AnalyticalQAName,FoodOriginName,group) %>%
  summarise(Samples=length(AnalyticalQAName)) -> Sankey4
ggplot(as.data.frame(Sankey4), aes(y = Samples, axis1 = ResultValue, axis2 = AnalyticalQAName, axis3 = FoodOriginName, axis4=group)) +
  geom_alluvium(aes(fill = ResultValue), width = 1/6)+
  scale_x_discrete(expand = c(0.1,0.1))+
  geom_stratum(width = 1/5, fill = "white", color = "grey50") +
  scale_x_continuous(breaks = 1:4, labels = c( "Above 0.10\ mg/Kg?", "Quality Control", "Sample source","LOQ"))+
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +  theme(legend.position = "none")+ylab("Number of samples") 
```

Figure 15. Alluvial plot from Tea Samples in EU region. The vast majority of samples (more than 75%) are higher than 0.10mg/Kg and didn't report the LOQ. 
Since most of non reported LOQ data are from old data, it's interesting check if it all comes from previous years.

```{r message=FALSE, warning=FALSE}
GEMS2res_1 %>% 
  filter(CountryName=="WHO European Region") %>%
  filter(FoodName=="Tea (Infusion)") %>%
  group_by(ResultValue=ResultValue>=0.1,AnalyticalQAName,FoodOriginName, group, SamplingPeriodEnd) %>%
  summarise(Samples=length(group)) -> Sankey4
ggplot(as.data.frame(Sankey4), aes(y = Samples, axis1 = ResultValue, axis2 = AnalyticalQAName, axis3 = FoodOriginName, axis4=group, axis5=SamplingPeriodEnd)) +
  geom_alluvium(aes(fill = ResultValue), width = 1/5) +scale_x_discrete(expand = c(.1, .1))+
  geom_stratum(width = 1/5, fill = "white", color = "grey50") +
  scale_x_continuous(breaks = 1:5, labels = c("Above 0.10\ mg/Kg?", "Quality Control", "Sample source","LOQ" , "Year")) +
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +  theme(legend.position = "none")+ylab("Number of samples") 

```
Figure 16. Same as figure 15 except with year included(as SamplingPeriodEnd). No, we can not say that "Old samples (pre-2011) used to have more Lead". To confirm we can normalize the content of ResultValue variable and plot it along the years.
 
```{r}
GEMS2res_1 %>% filter(CountryName=="WHO European Region") %>%
  filter(FoodName=="Tea (Infusion)") %>%
ggplot(aes(x=SamplingPeriodEnd,y=(RowNum), fill=ResultValue>=0.1))+
  geom_bar(position="fill", stat="identity")+ theme(axis.text.x = element_text(angle = 90),)+
  facet_wrap(~str_wrap(FoodName,width = 40))+ylab("ResultValue")+ggtitle("Ratio of quantified samples according to matrix")
```
 
 
Figure 17. Stacked bar graph showing normalized number of samples above 0.1mg/Kg of Lead per year. 
No, the proportion of samples above 0.1 are not decreasing with time. To be sure, lets analyze using boxplot and linear regression, with the median of Result as the central line in each box. 


```{r message=FALSE, warning=FALSE}
GEMS2res_1 %>% filter(CountryName=="WHO European Region") %>% group_by(SamplingPeriodEnd) %>%
  filter(FoodName=="Tea (Infusion)") %>%
ggplot(aes(x=SamplingPeriodEnd,y=ResultValue))+scale_y_log10()+
  geom_boxplot(na.rm=TRUE)+ theme(axis.text.x = element_text(angle = 90),)+ylab("ResultValue")+geom_smooth(method = "lm", se=FALSE, color="blue", aes(SamplingPeriodEnd, ResultValue))+geom_smooth(method = "lm", color="blue",aes(group=1), na.rm=TRUE)
```
Figure 18. Boxplot from Tea samples in Eu Region by Year. No trend, mean content of Lead in food is stable for the period of 10 years.




# 6. Conclusion

Most foods with high lead values have a very limited intake:
"ANIMAL FEED" (obviously), "Stimulant beverages" (occasional use), "Herbs, spices and condiments" (only as ingredients and in small quantities). Only "Products for special nutritional use" are critical as they may result in a higher intake for longer periods.

There are no indications that imported samples (to the European Union) are quantified more frequently than domestic ones, indicating contamination at source.

In the European Union (the region with more data available), 91% of all samples are below the 0,1mg/kg international limit and 99% are below 1.21mg/kg.

Stimulant Beverages, specially tea, present a high continuous source of Lead in the diet. Just to address how much, 50% of tea infusion samples are higher than 0,2mg/kg while 95% of other samples fall on this range.

In the 10 years with most data from European Union and the the most studies sample (Tea (Infusion)), no difference was noted in the mean value of contamination.

Some technical questions remain: were the tea samples (Stimulant beverages) digested from the dry mass? Were the samples identified as "infusion" analyzed as "ready for consumption", ie direct reading?
What is the real possibility of misplacement of units in the results such as ug/kg reported as mg/kg? Should they be treated and investigated as outliers or as transcription errors?



# 7. References:

https://www.fda.gov/food/metals-and-your-food/lead-food-foodwares-and-dietary-supplements



# Other...

## Tea investigation

How many tea samples we have under Tea (Infusion)
```{r}
GEMS5 %>% filter(FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products") %>% group_by(LocalFoodName) %>% summarise(total=length(LocalFoodName)) %>% ggplot()+geom_col(aes(total,reorder(LocalFoodName,total,max),fill=LocalFoodName,),show.legend = FALSE)
```
 
## Diferença médias: concentração Doméstico e Importada

Primeiro, vamos separar os dados de chá analizados na União Europáia:

```{r}
GEMS_EU<-subset(GEMS2, GEMS2$CountryName=="WHO European Region" & FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products")

GEMS_EU %>% group_by(FoodOriginName, Quantificadas=ResultValue>=0.001) %>% summarise(Numero=length(ResultValue)) %>% ggplot() +geom_col(aes(FoodOriginName,Numero, fill=Quantificadas))

```
Temos a proporções de quantificadas entre amostras domesticas mais alta seguida pelas importadas e desconhecidas.

Abaixo podemos ver esses valores


```{r}
GEMS_EU %>% group_by(FoodOriginName, Quantificadas=ResultValue>=0.001) %>% summarise(Numero=length(ResultValue))
```
Apesar de uma maior frequencia de quantificação, os valores médios são consideravelmente próximos, mediana e o quartil de 95%  levemente acima para amostras domesticas:


```{r message=FALSE, warning=FALSE}

GEMS_EU %>% group_by(FoodOriginName) %>% summarise(mean=mean(ResultValue))

GEMS_EU %>% group_by(FoodOriginName) %>% summarise(median=median(ResultValue))

GEMS_EU %>% group_by(FoodOriginName) %>% summarise(quantile=quantile(ResultValue,0.95))

GEMS_EU %>% filter(ResultValue>0.0001) %>%  group_by(FoodOriginName) %>% ggplot()+ geom_boxplot(aes(FoodOriginName,ResultValue))+ylim(c(-0.001,10))+scale_y_log10()

```



```{r message=FALSE, warning=FALSE}
GEMS_EU %>% filter(ResultValue>0.001) %>%  group_by(FoodOriginName) %>% summarise(ResultValue=mean(ResultValue))
```


Não há diferença entre as infusões e o sólido?
```{r}
GEMS_EU %>% filter(FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products") %>% ggplot()+ geom_boxplot(aes(ResultValue,FoodName))+scale_x_log10()
```




Média ResultValue, separando por país:

```{r}
GEMS1 %>% filter(FoodName=="TEAS AND HERBAL TEAS (SOLID)") %>% group_by(CountryName) %>% summarise(ResultValue=mean(ResultValue))
``` 

```{r}
options(scipen=10000)
GEMS1 %>% filter(FoodCategory=="Stimulant beverages, dried and diluted excluding cocoa products") %>% group_by(CountryName)%>%ggplot()+geom_boxplot(aes(ResultValue,CountryName))+scale_x_log10()+facet_wrap(~FoodName,scales="free")
``` 
The infusions appear to be expressed in dry mass, or at least converted after extraction on some countries like Singapure and EU. Data on USA, France, Hong Kong, Australia and New Zealand cluster close together on a range that is 100 times lower.

Let's now investigate what would be the single most high level of lead in the GEMS database.

# Worst Food Name (mean value):


```{r}
  GEMS1 %>%
  group_by(FoodName) %>%
  summarise(ResultValue= mean(ResultValue,na.rm=TRUE)) %>% 
  arrange(desc(ResultValue)) %>% 
  top_n(15) %>% 
  ggplot()+ geom_col(aes(ResultValue,reorder(FoodName,ResultValue,max),fill=FoodName),show.legend = FALSE)
```




Worst Local Food Name:

```{r}
  GEMS1 %>%
  group_by(LocalFoodName) %>%
  summarise(ResultValue= mean(ResultValue,na.rm=TRUE)) %>% 
  arrange(desc(ResultValue)) %>% 
  top_n(25) %>% 
    ggplot()+ geom_col(aes(ResultValue,reorder(LocalFoodName,ResultValue,mean),fill=LocalFoodName),show.legend=FALSE)+scale_x_log10()
```

2 Teas presentations appear as higher levels on FoodName, however data can be skewed due to lower samples and more presence of outliers.

When considering the final form on LocalFoodName, Tea forms appear 8 times.

Filtering samples with more than 5000 analyses performed. Summarizing with mean, sd, n, RSD, 95% percentil:


```{r}
GEMS1 %>%
  group_by(FoodName) %>%
  summarise(mean= mean(ResultValue,na.rm=TRUE),
            median= median(ResultValue,na.rm=TRUE),
            n=n(),
            sd=sd(ResultValue,na.rm=TRUE),
            RSD=sd(ResultValue,na.rm=TRUE)*100/mean(ResultValue,na.rm=TRUE),
            percentil_95=quantile(ResultValue,0.95,na.rm=TRUE),
            ratio_ND=prop.table(table(ResultText=="ND"))[2]) %>%
  filter(n>=100)%>%
  arrange(desc(median))-> worst_case


%>%
  top_n(15,median)
```
```{r}
kable(worst_case)
```



# Worst food boxplot:

```{r}
  GEMS1 %>%
  filter(FoodName %in% worst_case$FoodName)%>%
  group_by(FoodName) %>% 
   ggplot()+
    geom_boxplot(aes(ResultValue,FoodName))+
    geom_point(data=worst_case,aes(percentil_95,FoodName),color="red")+
    geom_point(data=worst_case,aes(mean,FoodName),color="blue")+xlim(-0.1,1)+scale_x_log10()


```


```{r}
  GEMS1 %>%
  filter(FoodName %in% worst_case$FoodName)%>%
  group_by(FoodName) %>% 
   ggplot()+
    geom_boxplot(aes(ResultValue,FoodName),alpha=0.1)+
    geom_point(data=worst_case,aes(percentil_95,FoodName),color="red")+
    geom_point(data=worst_case,aes(mean,FoodName),color="blue")+xlim(-0.1,3)

```

Same situation, although Feed also appear to show high lead contamination and have the same 95% quantile (above 2mg/kg). The median is much low due to higher presence of ND samples (about 30%)

