Open Food Data Basic EDA

This is document going over some basic functions in R to perform Exploratory Data Analysis (EDA). The dataset we will use is a subset of the Open Food Data prepared by Kaggle. Let’s load the sample and compute the number of rows and columns:

df = read.csv("~/Documents/kaggle/food/data/sample10K.csv", header = TRUE, stringsAsFactors = FALSE)
dim(df)

## [1] 10000     9

So it looks like there are 10000 rows (observations) and 9 columns (variables). Let’s look at the first few rows to get a feel for what we have here

head(df)

##           code                                       product_name
## 1 3.421557e+12            Terres et Céréales bio Couscous Complet
## 2 8.480018e+12 Tableta de chocolate negro con avellanas 55% cacao
## 3 5.019124e+12                      Butter ghee (beurre clarifié)
## 4 3.555081e+12                   nonettes aux miel et aux amandes
## 5 3.564700e+12             P'tit-beurre tablette chocolat au lait
## 6 3.250392e+12                                 Caviar d'aubergine
##   nutrition_grade_fr countries_en proteins_100g carbohydrates_100g
## 1                          France            NA                 NA
## 2                  e        Spain           7.5                 40
## 3                        Portugal            NA                 NA
## 4                  d       France           3.8                 73
## 5                  e       France           6.5                 65
## 6                          France            NA                 NA
##   sugars_100g trans_fat_100g carbon_footprint_100g
## 1          NA             NA                    NA
## 2          37             NA                    NA
## 3          NA             NA                    NA
## 4          45             NA                    NA
## 5          39             NA                    NA
## 6          NA             NA                    NA

We need to verify assumptions for each variable, but forutnately from first glance it appears that the names for the variables are sensible with the values we observe for observations (in addition to pesky blanks and NA’s: that is, missing values). It looks like the code variable is a unique ID for each observation, product_name is probably the name of the product, nutrition_grade_fr is some sort of grade (like you get on an exam), countries_en is the name of the country, and then the rest are numeric variables corresponding to nutrition (with the last variable being the carbon footprint of the food item).

Let’s now more rigorously verify some of the assumptions we have made. Let’s compute some summary statistics on the numeric data types that we have. For a start, let’s see what the average protein is:

mean(df$proteins_100g)

## [1] NA

We already see from the warning that one of our assumptions that the proteins_100g is a numeric data type is not true. It looks like this is because of the way the missing values were encoded for this variable (as <NA>) that R treated the entire column as a character variable

class(df$proteins_100g)

## [1] "numeric"

An easy fix for this is to convert these to variables explicitly to numeric types (in general, R does a good job of recognizing to convert numeric values)

# there is probably a more elegant way to do this, but for demonstration this is fine
numeric.vars = c("proteins_100g", "carbohydrates_100g", "sugars_100g","trans_fat_100g","carbon_footprint_100g") 
df[,which(names(df) %in% numeric.vars)] = lapply(df[,which(names(df) %in% numeric.vars)], as.numeric)

The warning message is telling us that NA’s are being introduced (most likeley just replacing the <NA> values that we were seeing before), but that’s OK since that’s exactly what we want anyway. Now that we are more confident that the numeric variables are indeed numeric types, let’s use the handy summary function tht R provides to get some summary statistics.

summary(df)

##       code          product_name       nutrition_grade_fr
##  Min.   :7.00e+00   Length:10000       Length:10000      
##  1st Qu.:3.12e+12   Class :character   Class :character  
##  Median :3.35e+12   Mode  :character   Mode  :character  
##  Mean   :4.57e+12                                        
##  3rd Qu.:5.01e+12                                        
##  Max.   :5.41e+15                                        
##                                                          
##  countries_en       proteins_100g    carbohydrates_100g  sugars_100g    
##  Length:10000       Min.   : 0.000   Min.   :  0.00     Min.   :  0.00  
##  Class :character   1st Qu.: 1.600   1st Qu.:  4.20     1st Qu.:  1.10  
##  Mode  :character   Median : 5.800   Median : 13.80     Median :  4.10  
##                     Mean   : 7.436   Mean   : 27.58     Mean   : 12.77  
##                     3rd Qu.:10.100   3rd Qu.: 53.50     3rd Qu.: 15.40  
##                     Max.   :86.000   Max.   :100.80     Max.   :100.80  
##                     NA's   :4483     NA's   :4480       NA's   :5027    
##  trans_fat_100g  carbon_footprint_100g
##  Min.   :0.000   Min.   :   0.0       
##  1st Qu.:0.000   1st Qu.: 121.5       
##  Median :0.000   Median : 185.0       
##  Mean   :0.113   Mean   : 336.5       
##  3rd Qu.:0.000   3rd Qu.: 364.0       
##  Max.   :7.140   Max.   :2380.0       
##  NA's   :9803    NA's   :9965

Seems like we got some reasonable summary stats for the numeric variables! Next we want to know if there are any relationships between these numeric variables. For this, we compute the correlation matrix:

options(width = 400)
cor(df[,which(names(df) %in% numeric.vars)], use = "pairwise.complete.obs")

##                       proteins_100g carbohydrates_100g sugars_100g trans_fat_100g carbon_footprint_100g
## proteins_100g            1.00000000        -0.08166830 -0.21932678    -0.08442803             0.5048197
## carbohydrates_100g      -0.08166830         1.00000000  0.62929096    -0.03701688            -0.2130507
## sugars_100g             -0.21932678         0.62929096  1.00000000     0.02142456            -0.1617343
## trans_fat_100g          -0.08442803        -0.03701688  0.02142456     1.00000000                    NA
## carbon_footprint_100g    0.50481968        -0.21305073 -0.16173433             NA             1.0000000

Now we have to see if we can understand the non-numeric variables. Let’s look at the number of unique elements:

num.unique = sapply(df[,-which(names(df) %in% numeric.vars)], FUN = function(x) length(unique(x)))
num.unique

##               code       product_name nutrition_grade_fr       countries_en 
##              10000               8709                  6                164

We see that the number of unique values for code is the same as the number of rows (so indeed it is a unique ID), and the product_name, nutrition_grade_fr, countries_en are not unique. Let’s count the number of unique values for nutrition_grade_fr and countries_en:

table(df$nutrition_grade_fr)

## 
##         a    b    c    d    e 
## 5216  892  792 1063 1242  795

Looking at the nutrition_grade_fr first, we see that over half the observations are blank (that is, missing). The distribution between the grades seems to be bent towards grade d the most. Next we look at nutrition_grade_fr:

options(width=40)
table(df$countries_en)

## 
##                                                                              
##                                                                           34 
##                                                                Albania,Italy 
##                                                                            2 
##                                                                      Algeria 
##                                                                            8 
##                            Algeria,Belgium,Canada,France,Morocco,Switzerland 
##                                                                            1 
##                                                                      Andorra 
##                                                                            1 
##                                                                    Argentina 
##                                                                            3 
##                                                                        Aruba 
##                                                                            1 
##                                                                    Australia 
##                                                                          162 
##                                                            Australia,Germany 
##                                                                            3 
##                                                Australia,Germany,Netherlands 
##                                                                            1 
##                                                           Australia,Malaysia 
##                                                                            1 
##                                     Australia,New Zealand,Singapore,Thailand 
##                                                                            1 
##                                                        Australia,Philippines 
##                                                                            1 
##                                            Australia,Thailand,United Kingdom 
##                                                                            1 
##                                                     Australia,United Kingdom 
##                                                                            1 
##                                       Australia,United Kingdom,United States 
##                                                                            1 
##                                                      Australia,United States 
##                                                                            2 
##                                                                      Austria 
##                                                                           28 
##                                     Austria,France,Germany,Italy,Switzerland 
##                                                                            1 
##                                                  Austria,Germany,Netherlands 
##                                                                            5 
##                                                                      Bahrain 
##                                                                            1 
##                                                                      Belgium 
##                                                                          215 
##                                                              Belgium,Denmark 
##                                                                            1 
##                                               Belgium,Denmark,France,Germany 
##                                                                            1 
##                                           Belgium,Finland,France,Italy,Spain 
##                                                                            1 
##                                                               Belgium,France 
##                                                                           21 
##        Belgium,France,Germany,Hungary,Italy,Netherlands,Spain,United Kingdom 
##                                                                            1 
##                                     Belgium,France,Germany,Italy,Netherlands 
##                                                                            1 
##                                           Belgium,France,Germany,Switzerland 
##                                                                            1 
##                                    Belgium,France,Netherlands,United Kingdom 
##                                                                            1 
##                                                              Belgium,Germany 
##                                                                            1 
##                                                        Belgium,United States 
##                                                                            1 
##                                                                       Brazil 
##                                                                           44 
##                                                                     Bulgaria 
##                                                                            3 
##                                                              Bulgaria,France 
##                                                                            2 
##                                                                 Burkina Faso 
##                                                                            1 
##                                                                      Burundi 
##                                                                            1 
##                                                                     Cambodia 
##                                                                            4 
##                                                                       Canada 
##                                                                           66 
##                                                             Canada,fr:Quebec 
##                                                                            2 
##                                                              Canada,Portugal 
##                                                                            1 
##                                                         Canada,United States 
##                                                                            2 
##                                                                        China 
##                                                                           21 
##                                                                     Colombia 
##                                                                            1 
##                                                                   Costa Rica 
##                                                                            1 
##                                                                Côte d'Ivoire 
##                                                                            2 
##                                                                         Cuba 
##                                                                            1 
##                                                                       Cyprus 
##                                                                            1 
##                                                               Czech Republic 
##                                                                           20 
##                                                        Czech Republic,France 
##                                                                            1 
##                                                                      Denmark 
##                                                                           36 
##                                                               Denmark,France 
##                                                                           14 
##                                          Denmark,France,Portugal,Switzerland 
##                                                                            1 
##                                                Denmark,France,United Kingdom 
##                                                                            2 
##                                                              Denmark,Germany 
##                                                                            1 
##                                                               European Union 
##                                                                            1 
##                                                               Finland,France 
##                                                                            1 
##                                         Finland,France,Sweden,United Kingdom 
##                                                                            1 
##                                                                    fr:Quebec 
##                                                                            5 
##                                                                      fr:日本 
##                                                                            1 
##                                                                       France 
##                                                                         6582 
##                                                             France,fr:Europe 
##                                                                           41 
##                                                         France,French Guiana 
##                                                                            1 
##                                        France,French Polynesia,New Caledonia 
##                                                                            1 
##                                                               France,Germany 
##                                                                           16 
##                                                         France,Germany,Spain 
##                                                                            1 
##                                                                France,Greece 
##                                                                            1 
##                                                            France,Guadeloupe 
##                                                                            3 
## France,Guadeloupe,Italy,Netherlands,Norway,Sweden,Switzerland,United Kingdom 
##                                                                            1 
##                                                France,Ireland,United Kingdom 
##                                                                            1 
##                                                                 France,Italy 
##                                                                            1 
##                                                           France,Italy,Spain 
##                                                                            1 
##                                                  France,Italy,United Kingdom 
##                                                                            1 
##                                                                 France,Japan 
##                                                                            1 
##                                                                France,Kuwait 
##                                                                            1 
##                                                            France,Luxembourg 
##                                                                            1 
##                                                            France,Martinique 
##                                                                            1 
##                                                             France,Mauritius 
##                                                                            2 
##                                                               France,Morocco 
##                                                                            3 
##                                                           France,Netherlands 
##                                                                            1 
##                                                         France,New Caledonia 
##                                                                            1 
##                                                               France,Réunion 
##                                                                            3 
##                                                               France,Romania 
##                                                                            1 
##                                                                France,Russia 
##                                                                            4 
##                                                                 France,Spain 
##                                                                           12 
##                                                  France,Spain,United Kingdom 
##                                                                            2 
##                                                                France,Sweden 
##                                                                            2 
##                                                           France,Switzerland 
##                                                                            2 
##                                                        France,United Kingdom 
##                                                                           17 
##                                                         France,United States 
##                                                                            1 
##                                                                French Guiana 
##                                                                           11 
##                                                             French Polynesia 
##                                                                            2 
##                                                                      Germany 
##                                                                          432 
##                                                                Germany,Spain 
##                                                                            7 
##                                                          Germany,Switzerland 
##                                                                            4 
##                                                       Germany,United Kingdom 
##                                                                            1 
##                                         Germany,United Kingdom,United States 
##                                                                            3 
##                                                                       Greece 
##                                                                            4 
##                                                                   Guadeloupe 
##                                                                           14 
##                                                        Guadeloupe,Martinique 
##                                                                            1 
##                                                                    Hong Kong 
##                                                                            8 
##                                                                      Hungary 
##                                                                            1 
##                                                                      Iceland 
##                                                                            6 
##                                                                        India 
##                                                                            2 
##                                                                    Indonesia 
##                                                                            6 
##                                                                         Iran 
##                                                                            1 
##                                                                      Ireland 
##                                                                           12 
##                                                                       Israel 
##                                                                            2 
##                                                                        Italy 
##                                                                           85 
##                                                                Italy,Romania 
##                                                                            1 
##                                                                        Japan 
##                                                                           19 
##                                                                   Kazakhstan 
##                                                                            2 
##                                                                       Kuwait 
##                                                                            1 
##                                                                       Latvia 
##                                                                            1 
##                                                                      Lebanon 
##                                                                            2 
##                                                             Lithuania,Sweden 
##                                                                            1 
##                                                                   Luxembourg 
##                                                                           15 
##                                                                        Malta 
##                                                                            1 
##                                                                   Martinique 
##                                                                            4 
##                                                                    Mauritius 
##                                                                            1 
##                                                                      Mayotte 
##                                                                            1 
##                                                                       Mexico 
##                                                                            5 
##                                                                      Moldova 
##                                                                            1 
##                                                                     Mongolia 
##                                                                            2 
##                                                                      Morocco 
##                                                                           10 
##                                                                   Mozambique 
##                                                                            1 
##                                                                  Netherlands 
##                                                                           19 
##                                                                New Caledonia 
##                                                                            5 
##                                                                  New Zealand 
##                                                                            6 
##                                                                         Peru 
##                                                                            1 
##                                                                  Philippines 
##                                                                            2 
##                                                                       Poland 
##                                                                            8 
##                                                        Poland,United Kingdom 
##                                                                            4 
##                                                                     Portugal 
##                                                                           85 
##                                                               Portugal,Spain 
##                                                                            1 
##                                                                        Qatar 
##                                                                            1 
##                                                                      Réunion 
##                                                                           13 
##                                                                      Romania 
##                                                                            8 
##                                                                       Russia 
##                                                                            4 
##                                                    Saint Pierre and Miquelon 
##                                                                            4 
##                                                                 Saudi Arabia 
##                                                                            5 
##                                                      Scotland,United Kingdom 
##                                                                            1 
##                                                                      Senegal 
##                                                                            3 
##                                                                       Serbia 
##                                                                            1 
##                                                                    Singapore 
##                                                                            1 
##                                                                     Slovenia 
##                                                                            1 
##                                                                 South Africa 
##                                                                            3 
##                                                                  South Korea 
##                                                                            1 
##                                                                        Spain 
##                                                                          529 
##                                                              Spain,es:Europe 
##                                                                            2 
##                                                                       Sweden 
##                                                                            5 
##                                                                  Switzerland 
##                                                                          222 
##                                                    Switzerland,United States 
##                                                                            1 
##                                                                       Taiwan 
##                                                                           10 
##                                                                     Thailand 
##                                                                           10 
##                                                      Thailand,United Kingdom 
##                                                                            1 
##                                                                         Togo 
##                                                                            1 
##                                                                      Tunisia 
##                                                                           15 
##                                                                       Turkey 
##                                                                            1 
##                                                         United Arab Emirates 
##                                                                            4 
##                                                               United Kingdom 
##                                                                          535 
##                                                 United Kingdom,United States 
##                                                                            2 
##                                                                United States 
##                                                                          366 
##                                                                      Vietnam 
##                                                                            2

Comprehending this output is a bit of a mess. For a start, we should probably not display the information in alphabetical order, but rather in terms of the frequency count. Still, even then, we will have a list of over 100 rows. This is where it would be much nicer to display this output as a bar plot.

countries.counts = as.data.frame(table(df$countries_en))
countries.counts = countries.counts[order(countries.counts$Freq, decreasing = TRUE),]
head(countries.counts)

##               Var1 Freq
## 61          France 6582
## 161 United Kingdom  535
## 149          Spain  529
## 93         Germany  432
## 163  United States  366
## 152    Switzerland  222

We see that most of the food products in this dataset originate from France. The bar plot will give us a nicer overview of this. We will use the excellent ggplot2 package for visualization since roughly 95% of the time the default settings produce very good results.

library(ggplot2)
library(plotly)
ggplot(countries.counts, aes(x = reorder(Var1, order(Freq, decreasing = TRUE)), y = Freq)) + geom_bar(stat = "identity") + xlab("")-> p
ggplotly(p)

Putting the bars on a log scale will show us more details:

ggplot(countries.counts, aes(x = reorder(Var1, order(Freq, decreasing = TRUE)), y = Freq)) + geom_bar(stat = "identity") + xlab("") + scale_y_log10() -> p
ggplotly(p)

We see from this plot that many of the countries here are actually multiple countries or are duplicates. These types of distributions are referred to as “Power Law” distributions and come up very often. Typically we use the “80-20” rule to focus on a subset of the data.

While the bar graph is used to undertand the distribution of categorical variables, histograms are typically used to understand the distribution of numeric variables. Here we take a look at how the carbohydrates in our data set is distributed:

ggplot(df, aes(carbohydrates_100g)) + geom_histogram(bins = 50) -> p
ggplotly(p)

Let’s now take a look at some of the numeric variables, but graphically instead of using the summary stats we used before.

ggplot(df, aes(x = nutrition_grade_fr, y = carbohydrates_100g)) + geom_boxplot() -> p
ggplotly(p)

This boxpot shows us the relationship between the nutrition grade and sugars, but what about the other variables? The scatter plot allows us to view more than one variable so we will use that. Since over half of the observations does not have a nutrition grade, we first clean up this data by removing all entries that do not have a grade:

df.grades = subset(df, nutrition_grade_fr != "")

Now that we have removed the food items that do not have a grade, lets look at scatter plot of sugars and carbohydrates to see if we find any relationships:

ggplot(df.grades, aes(x = sugars_100g, y = carbohydrates_100g, colour = nutrition_grade_fr)) + geom_point() -> p
ggplotly(p)

Open Food Data Basic EDA

Mehrdad Yazdani

February 22, 2016