This is document going over some basic functions in R to perform Exploratory Data Analysis (EDA). The dataset we will use is a subset of the Open Food Data prepared by Kaggle. Let’s load the sample and compute the number of rows and columns:
df = read.csv("~/Documents/kaggle/food/data/sample10K.csv", header = TRUE, stringsAsFactors = FALSE)
dim(df)
## [1] 10000 9
So it looks like there are 10000 rows (observations) and 9 columns (variables). Let’s look at the first few rows to get a feel for what we have here
head(df)
## code product_name
## 1 3.421557e+12 Terres et Céréales bio Couscous Complet
## 2 8.480018e+12 Tableta de chocolate negro con avellanas 55% cacao
## 3 5.019124e+12 Butter ghee (beurre clarifié)
## 4 3.555081e+12 nonettes aux miel et aux amandes
## 5 3.564700e+12 P'tit-beurre tablette chocolat au lait
## 6 3.250392e+12 Caviar d'aubergine
## nutrition_grade_fr countries_en proteins_100g carbohydrates_100g
## 1 France NA NA
## 2 e Spain 7.5 40
## 3 Portugal NA NA
## 4 d France 3.8 73
## 5 e France 6.5 65
## 6 France NA NA
## sugars_100g trans_fat_100g carbon_footprint_100g
## 1 NA NA NA
## 2 37 NA NA
## 3 NA NA NA
## 4 45 NA NA
## 5 39 NA NA
## 6 NA NA NA
We need to verify assumptions for each variable, but forutnately from first glance it appears that the names for the variables are sensible with the values we observe for observations (in addition to pesky blanks and NA’s: that is, missing values). It looks like the code variable is a unique ID for each observation, product_name is probably the name of the product, nutrition_grade_fr is some sort of grade (like you get on an exam), countries_en is the name of the country, and then the rest are numeric variables corresponding to nutrition (with the last variable being the carbon footprint of the food item).
Let’s now more rigorously verify some of the assumptions we have made. Let’s compute some summary statistics on the numeric data types that we have. For a start, let’s see what the average protein is:
mean(df$proteins_100g)
## [1] NA
We already see from the warning that one of our assumptions that the proteins_100g is a numeric data type is not true. It looks like this is because of the way the missing values were encoded for this variable (as <NA>) that R treated the entire column as a character variable
class(df$proteins_100g)
## [1] "numeric"
An easy fix for this is to convert these to variables explicitly to numeric types (in general, R does a good job of recognizing to convert numeric values)
# there is probably a more elegant way to do this, but for demonstration this is fine
numeric.vars = c("proteins_100g", "carbohydrates_100g", "sugars_100g","trans_fat_100g","carbon_footprint_100g")
df[,which(names(df) %in% numeric.vars)] = lapply(df[,which(names(df) %in% numeric.vars)], as.numeric)
The warning message is telling us that NA’s are being introduced (most likeley just replacing the <NA> values that we were seeing before), but that’s OK since that’s exactly what we want anyway. Now that we are more confident that the numeric variables are indeed numeric types, let’s use the handy summary function tht R provides to get some summary statistics.
summary(df)
## code product_name nutrition_grade_fr
## Min. :7.00e+00 Length:10000 Length:10000
## 1st Qu.:3.12e+12 Class :character Class :character
## Median :3.35e+12 Mode :character Mode :character
## Mean :4.57e+12
## 3rd Qu.:5.01e+12
## Max. :5.41e+15
##
## countries_en proteins_100g carbohydrates_100g sugars_100g
## Length:10000 Min. : 0.000 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.: 1.600 1st Qu.: 4.20 1st Qu.: 1.10
## Mode :character Median : 5.800 Median : 13.80 Median : 4.10
## Mean : 7.436 Mean : 27.58 Mean : 12.77
## 3rd Qu.:10.100 3rd Qu.: 53.50 3rd Qu.: 15.40
## Max. :86.000 Max. :100.80 Max. :100.80
## NA's :4483 NA's :4480 NA's :5027
## trans_fat_100g carbon_footprint_100g
## Min. :0.000 Min. : 0.0
## 1st Qu.:0.000 1st Qu.: 121.5
## Median :0.000 Median : 185.0
## Mean :0.113 Mean : 336.5
## 3rd Qu.:0.000 3rd Qu.: 364.0
## Max. :7.140 Max. :2380.0
## NA's :9803 NA's :9965
Seems like we got some reasonable summary stats for the numeric variables! Next we want to know if there are any relationships between these numeric variables. For this, we compute the correlation matrix:
options(width = 400)
cor(df[,which(names(df) %in% numeric.vars)], use = "pairwise.complete.obs")
## proteins_100g carbohydrates_100g sugars_100g trans_fat_100g carbon_footprint_100g
## proteins_100g 1.00000000 -0.08166830 -0.21932678 -0.08442803 0.5048197
## carbohydrates_100g -0.08166830 1.00000000 0.62929096 -0.03701688 -0.2130507
## sugars_100g -0.21932678 0.62929096 1.00000000 0.02142456 -0.1617343
## trans_fat_100g -0.08442803 -0.03701688 0.02142456 1.00000000 NA
## carbon_footprint_100g 0.50481968 -0.21305073 -0.16173433 NA 1.0000000
Now we have to see if we can understand the non-numeric variables. Let’s look at the number of unique elements:
num.unique = sapply(df[,-which(names(df) %in% numeric.vars)], FUN = function(x) length(unique(x)))
num.unique
## code product_name nutrition_grade_fr countries_en
## 10000 8709 6 164
We see that the number of unique values for code is the same as the number of rows (so indeed it is a unique ID), and the product_name, nutrition_grade_fr, countries_en are not unique. Let’s count the number of unique values for nutrition_grade_fr and countries_en:
table(df$nutrition_grade_fr)
##
## a b c d e
## 5216 892 792 1063 1242 795
Looking at the nutrition_grade_fr first, we see that over half the observations are blank (that is, missing). The distribution between the grades seems to be bent towards grade d the most. Next we look at nutrition_grade_fr:
options(width=40)
table(df$countries_en)
##
##
## 34
## Albania,Italy
## 2
## Algeria
## 8
## Algeria,Belgium,Canada,France,Morocco,Switzerland
## 1
## Andorra
## 1
## Argentina
## 3
## Aruba
## 1
## Australia
## 162
## Australia,Germany
## 3
## Australia,Germany,Netherlands
## 1
## Australia,Malaysia
## 1
## Australia,New Zealand,Singapore,Thailand
## 1
## Australia,Philippines
## 1
## Australia,Thailand,United Kingdom
## 1
## Australia,United Kingdom
## 1
## Australia,United Kingdom,United States
## 1
## Australia,United States
## 2
## Austria
## 28
## Austria,France,Germany,Italy,Switzerland
## 1
## Austria,Germany,Netherlands
## 5
## Bahrain
## 1
## Belgium
## 215
## Belgium,Denmark
## 1
## Belgium,Denmark,France,Germany
## 1
## Belgium,Finland,France,Italy,Spain
## 1
## Belgium,France
## 21
## Belgium,France,Germany,Hungary,Italy,Netherlands,Spain,United Kingdom
## 1
## Belgium,France,Germany,Italy,Netherlands
## 1
## Belgium,France,Germany,Switzerland
## 1
## Belgium,France,Netherlands,United Kingdom
## 1
## Belgium,Germany
## 1
## Belgium,United States
## 1
## Brazil
## 44
## Bulgaria
## 3
## Bulgaria,France
## 2
## Burkina Faso
## 1
## Burundi
## 1
## Cambodia
## 4
## Canada
## 66
## Canada,fr:Quebec
## 2
## Canada,Portugal
## 1
## Canada,United States
## 2
## China
## 21
## Colombia
## 1
## Costa Rica
## 1
## Côte d'Ivoire
## 2
## Cuba
## 1
## Cyprus
## 1
## Czech Republic
## 20
## Czech Republic,France
## 1
## Denmark
## 36
## Denmark,France
## 14
## Denmark,France,Portugal,Switzerland
## 1
## Denmark,France,United Kingdom
## 2
## Denmark,Germany
## 1
## European Union
## 1
## Finland,France
## 1
## Finland,France,Sweden,United Kingdom
## 1
## fr:Quebec
## 5
## fr:日本
## 1
## France
## 6582
## France,fr:Europe
## 41
## France,French Guiana
## 1
## France,French Polynesia,New Caledonia
## 1
## France,Germany
## 16
## France,Germany,Spain
## 1
## France,Greece
## 1
## France,Guadeloupe
## 3
## France,Guadeloupe,Italy,Netherlands,Norway,Sweden,Switzerland,United Kingdom
## 1
## France,Ireland,United Kingdom
## 1
## France,Italy
## 1
## France,Italy,Spain
## 1
## France,Italy,United Kingdom
## 1
## France,Japan
## 1
## France,Kuwait
## 1
## France,Luxembourg
## 1
## France,Martinique
## 1
## France,Mauritius
## 2
## France,Morocco
## 3
## France,Netherlands
## 1
## France,New Caledonia
## 1
## France,Réunion
## 3
## France,Romania
## 1
## France,Russia
## 4
## France,Spain
## 12
## France,Spain,United Kingdom
## 2
## France,Sweden
## 2
## France,Switzerland
## 2
## France,United Kingdom
## 17
## France,United States
## 1
## French Guiana
## 11
## French Polynesia
## 2
## Germany
## 432
## Germany,Spain
## 7
## Germany,Switzerland
## 4
## Germany,United Kingdom
## 1
## Germany,United Kingdom,United States
## 3
## Greece
## 4
## Guadeloupe
## 14
## Guadeloupe,Martinique
## 1
## Hong Kong
## 8
## Hungary
## 1
## Iceland
## 6
## India
## 2
## Indonesia
## 6
## Iran
## 1
## Ireland
## 12
## Israel
## 2
## Italy
## 85
## Italy,Romania
## 1
## Japan
## 19
## Kazakhstan
## 2
## Kuwait
## 1
## Latvia
## 1
## Lebanon
## 2
## Lithuania,Sweden
## 1
## Luxembourg
## 15
## Malta
## 1
## Martinique
## 4
## Mauritius
## 1
## Mayotte
## 1
## Mexico
## 5
## Moldova
## 1
## Mongolia
## 2
## Morocco
## 10
## Mozambique
## 1
## Netherlands
## 19
## New Caledonia
## 5
## New Zealand
## 6
## Peru
## 1
## Philippines
## 2
## Poland
## 8
## Poland,United Kingdom
## 4
## Portugal
## 85
## Portugal,Spain
## 1
## Qatar
## 1
## Réunion
## 13
## Romania
## 8
## Russia
## 4
## Saint Pierre and Miquelon
## 4
## Saudi Arabia
## 5
## Scotland,United Kingdom
## 1
## Senegal
## 3
## Serbia
## 1
## Singapore
## 1
## Slovenia
## 1
## South Africa
## 3
## South Korea
## 1
## Spain
## 529
## Spain,es:Europe
## 2
## Sweden
## 5
## Switzerland
## 222
## Switzerland,United States
## 1
## Taiwan
## 10
## Thailand
## 10
## Thailand,United Kingdom
## 1
## Togo
## 1
## Tunisia
## 15
## Turkey
## 1
## United Arab Emirates
## 4
## United Kingdom
## 535
## United Kingdom,United States
## 2
## United States
## 366
## Vietnam
## 2
Comprehending this output is a bit of a mess. For a start, we should probably not display the information in alphabetical order, but rather in terms of the frequency count. Still, even then, we will have a list of over 100 rows. This is where it would be much nicer to display this output as a bar plot.
countries.counts = as.data.frame(table(df$countries_en))
countries.counts = countries.counts[order(countries.counts$Freq, decreasing = TRUE),]
head(countries.counts)
## Var1 Freq
## 61 France 6582
## 161 United Kingdom 535
## 149 Spain 529
## 93 Germany 432
## 163 United States 366
## 152 Switzerland 222
We see that most of the food products in this dataset originate from France. The bar plot will give us a nicer overview of this. We will use the excellent ggplot2 package for visualization since roughly 95% of the time the default settings produce very good results.
library(ggplot2)
library(plotly)
ggplot(countries.counts, aes(x = reorder(Var1, order(Freq, decreasing = TRUE)), y = Freq)) + geom_bar(stat = "identity") + xlab("")-> p
ggplotly(p)
Putting the bars on a log scale will show us more details:
ggplot(countries.counts, aes(x = reorder(Var1, order(Freq, decreasing = TRUE)), y = Freq)) + geom_bar(stat = "identity") + xlab("") + scale_y_log10() -> p
ggplotly(p)
We see from this plot that many of the countries here are actually multiple countries or are duplicates. These types of distributions are referred to as “Power Law” distributions and come up very often. Typically we use the “80-20” rule to focus on a subset of the data.
While the bar graph is used to undertand the distribution of categorical variables, histograms are typically used to understand the distribution of numeric variables. Here we take a look at how the carbohydrates in our data set is distributed:
ggplot(df, aes(carbohydrates_100g)) + geom_histogram(bins = 50) -> p
ggplotly(p)
Let’s now take a look at some of the numeric variables, but graphically instead of using the summary stats we used before.
ggplot(df, aes(x = nutrition_grade_fr, y = carbohydrates_100g)) + geom_boxplot() -> p
ggplotly(p)
This boxpot shows us the relationship between the nutrition grade and sugars, but what about the other variables? The scatter plot allows us to view more than one variable so we will use that. Since over half of the observations does not have a nutrition grade, we first clean up this data by removing all entries that do not have a grade:
df.grades = subset(df, nutrition_grade_fr != "")
Now that we have removed the food items that do not have a grade, lets look at scatter plot of sugars and carbohydrates to see if we find any relationships:
ggplot(df.grades, aes(x = sugars_100g, y = carbohydrates_100g, colour = nutrition_grade_fr)) + geom_point() -> p
ggplotly(p)