I love cereal, but it’s hard to figure out what cereal is best for me to choose for my dietary needs. I hate looking at nutrition labels in the store so I’m hoping that I can find one metric that I can use to base my choices on.
Question for analysis: What is the most useful metric to look at for a healthy cereal?
I was originally answering this question with the data at https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/Stat2Data/Cereal.csv but that data only included information on calories, sugar, and fiber when fat content is a vital factor to consider. Because of that I found another data source (http://www.acaloriecounter.com/breakfast-cereal.php) that had more variables and more observations.
mydata <- read.csv("https://raw.githubusercontent.com/ajbentley/cuny_ms_ds/master/cereal_nutrition.csv", header = TRUE)
df <- data.frame(mydata)
head(df)
## Breakfast.Cereal Brand size Calories Total.Fat
## 1 Rice Chex General Mills 1.00 100 0.5
## 2 Rice Krispies Kellogg's 1.25 120 0.0
## 3 Rice Krispies Treats Cereal Kellogg's 0.75 120 1.5
## 4 Frosted Krispies Kellogg's 0.75 110 0.0
## 5 Trix General Mills 1.00 120 1.5
## 6 Lucky Charms General Mills 0.75 110 1.0
## Saturated.Fat Carbs Fiber Sugar Protein transfat hfcs
## 1 0 23 0 2 2 NO NO
## 2 0 29 0 3 2 NO YES
## 3 0 26 0 9 1 YES YES
## 4 0 27 0 12 1 NO YES
## 5 0 28 1 13 1 NO YES
## 6 0 22 1 11 2 NO NO
summary(df)
## Breakfast.Cereal Brand size
## All-Bran : 1 General Mills:22 Min. :0.1100
## Apple Jacks : 1 Kashi : 1 1st Qu.:0.7500
## Banana Nut Crunch : 1 Kellogg's :20 Median :0.7500
## Basic 4 : 1 Post : 9 Mean :0.7585
## Cheerios : 1 Quaker : 1 3rd Qu.:1.0000
## Cinnamon Toast Crunch: 1 Max. :1.2500
## (Other) :47
## Calories Total.Fat Saturated.Fat Carbs
## Min. : 60.0 Min. :0.000 Min. :0.0000 Min. :20.00
## 1st Qu.:110.0 1st Qu.:0.500 1st Qu.:0.0000 1st Qu.:23.00
## Median :120.0 Median :1.000 Median :0.0000 Median :25.00
## Mean :127.7 Mean :1.292 Mean :0.1415 Mean :28.02
## 3rd Qu.:120.0 3rd Qu.:1.500 3rd Qu.:0.0000 3rd Qu.:28.00
## Max. :240.0 Max. :6.000 Max. :1.0000 Max. :48.00
##
## Fiber Sugar Protein transfat hfcs
## Min. : 0.000 Min. : 0.000 Min. : 1.000 NO :41 NO :34
## 1st Qu.: 1.000 1st Qu.: 5.000 1st Qu.: 1.000 YES:12 YES:19
## Median : 1.000 Median :11.000 Median : 2.000
## Mean : 2.453 Mean : 9.075 Mean : 2.679
## 3rd Qu.: 3.000 3rd Qu.:12.000 3rd Qu.: 3.000
## Max. :14.000 Max. :20.000 Max. :13.000
##
require(ggplot2)
shist <- ggplot(df, aes(x = size)) + geom_histogram(color="darkblue", fill="lightblue") +
ggtitle("Serving Size") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))
calhist <- ggplot(df, aes(x = Calories)) + geom_histogram(color="darkblue", fill="lightblue") + ggtitle("Calories") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))
tfathist <- ggplot(df, aes(x = Total.Fat)) + geom_histogram(color="darkblue", fill="lightblue") +
ggtitle("Total Fat") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))
sfathist <- ggplot(df, aes(x = Saturated.Fat)) + geom_histogram(color="darkblue", fill="lightblue") +
ggtitle("Saturated Fat") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))
carbhist <- ggplot(df, aes(x = Carbs)) + geom_histogram(color="darkblue", fill="lightblue") +
ggtitle("Carbs") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))
fibhist <- ggplot(df, aes(x = Fiber)) + geom_histogram(color="darkblue", fill="lightblue") +
ggtitle("Fiber") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))
sughist <- ggplot(df, aes(x = Sugar)) + geom_histogram(color="darkblue", fill="lightblue") +
ggtitle("Sugar") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))
prothist <- ggplot(df, aes(x = Protein)) + geom_histogram(color="darkblue", fill="lightblue") +
ggtitle("Protein") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))
require(gridExtra)
grid.arrange(shist, calhist, tfathist, sfathist, carbhist, fibhist, sughist, prothist)
Because the serving sizes on the nutrition labels are so different, it’s important to normalize them so that we can really have an applejacks-to-applejacks comparison of the brands.
To do this I’m going to divide each value in that column by the mean of the metric, resulting in an index to the mean.
df$siindex <- "serving size index"
df$sindex = df$size / mean(df$size)
Now I can convert each of the other measurements to the normalized serving size by multiplying them by the serving size index
df$calnorm = df$Calories * df$sindex
df$fatnorm = df$Total.Fat * df$sindex
df$sfatnorm = df$Saturated.Fat * df$sindex
df$carbnorm = df$Carbs * df$sindex
df$fibernorm = df$Fiber * df$sindex
df$protnorm = df$Protein * df$sindex
head(df)
## Breakfast.Cereal Brand size Calories Total.Fat
## 1 Rice Chex General Mills 1.00 100 0.5
## 2 Rice Krispies Kellogg's 1.25 120 0.0
## 3 Rice Krispies Treats Cereal Kellogg's 0.75 120 1.5
## 4 Frosted Krispies Kellogg's 0.75 110 0.0
## 5 Trix General Mills 1.00 120 1.5
## 6 Lucky Charms General Mills 0.75 110 1.0
## Saturated.Fat Carbs Fiber Sugar Protein transfat hfcs siindex
## 1 0 23 0 2 2 NO NO serving size index
## 2 0 29 0 3 2 NO YES serving size index
## 3 0 26 0 9 1 YES YES serving size index
## 4 0 27 0 12 1 NO YES serving size index
## 5 0 28 1 13 1 NO YES serving size index
## 6 0 22 1 11 2 NO NO serving size index
## sindex calnorm fatnorm sfatnorm carbnorm fibernorm protnorm
## 1 1.318408 131.8408 0.659204 0 30.32338 0.000000 2.636816
## 2 1.648010 197.7612 0.000000 0 47.79229 0.000000 3.296020
## 3 0.988806 118.6567 1.483209 0 25.70896 0.000000 0.988806
## 4 0.988806 108.7687 0.000000 0 26.69776 0.000000 0.988806
## 5 1.318408 158.2090 1.977612 0 36.91542 1.318408 1.318408
## 6 0.988806 108.7687 0.988806 0 21.75373 0.988806 1.977612
Now I’ll convert the transfat and hfcs factors into numerics
class(df$transfat)
## [1] "factor"
df$tfatnum <- ifelse(df$transfat=="YES", 1, 0)
df$hfcsnum <- ifelse(df$hfcs=="YES", 1, 0)
head(df)
## Breakfast.Cereal Brand size Calories Total.Fat
## 1 Rice Chex General Mills 1.00 100 0.5
## 2 Rice Krispies Kellogg's 1.25 120 0.0
## 3 Rice Krispies Treats Cereal Kellogg's 0.75 120 1.5
## 4 Frosted Krispies Kellogg's 0.75 110 0.0
## 5 Trix General Mills 1.00 120 1.5
## 6 Lucky Charms General Mills 0.75 110 1.0
## Saturated.Fat Carbs Fiber Sugar Protein transfat hfcs siindex
## 1 0 23 0 2 2 NO NO serving size index
## 2 0 29 0 3 2 NO YES serving size index
## 3 0 26 0 9 1 YES YES serving size index
## 4 0 27 0 12 1 NO YES serving size index
## 5 0 28 1 13 1 NO YES serving size index
## 6 0 22 1 11 2 NO NO serving size index
## sindex calnorm fatnorm sfatnorm carbnorm fibernorm protnorm tfatnum
## 1 1.318408 131.8408 0.659204 0 30.32338 0.000000 2.636816 0
## 2 1.648010 197.7612 0.000000 0 47.79229 0.000000 3.296020 0
## 3 0.988806 118.6567 1.483209 0 25.70896 0.000000 0.988806 1
## 4 0.988806 108.7687 0.000000 0 26.69776 0.000000 0.988806 0
## 5 1.318408 158.2090 1.977612 0 36.91542 1.318408 1.318408 0
## 6 0.988806 108.7687 0.988806 0 21.75373 0.988806 1.977612 0
## hfcsnum
## 1 0
## 2 1
## 3 1
## 4 1
## 5 1
## 6 0
df2 <- data.frame(df[,c(1,2,14:22)])
and get the summary
summary(df2)
## Breakfast.Cereal Brand sindex
## All-Bran : 1 General Mills:22 Min. :0.1450
## Apple Jacks : 1 Kashi : 1 1st Qu.:0.9888
## Banana Nut Crunch : 1 Kellogg's :20 Median :0.9888
## Basic 4 : 1 Post : 9 Mean :1.0000
## Cheerios : 1 Quaker : 1 3rd Qu.:1.3184
## Cinnamon Toast Crunch: 1 Max. :1.6480
## (Other) :47
## calnorm fatnorm sfatnorm carbnorm
## Min. : 15.95 Min. :0.0000 Min. :0.0000 Min. : 3.191
## 1st Qu.: 98.88 1st Qu.:0.2900 1st Qu.:0.0000 1st Qu.:21.754
## Median :118.66 Median :0.9888 Median :0.0000 Median :26.368
## Mean :126.42 Mean :1.2036 Mean :0.1167 Mean :27.703
## 3rd Qu.:158.21 3rd Qu.:1.4832 3rd Qu.:0.0000 3rd Qu.:34.279
## Max. :379.04 Max. :6.5920 Max. :0.8240 Max. :75.808
##
## fibernorm protnorm tfatnum hfcsnum
## Min. : 0.0000 Min. : 0.145 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.5142 1st Qu.: 1.318 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 1.3184 Median : 1.978 Median :0.0000 Median :0.0000
## Mean : 2.1241 Mean : 2.706 Mean :0.2264 Mean :0.3585
## 3rd Qu.: 2.6368 3rd Qu.: 2.966 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :13.1841 Max. :17.139 Max. :1.0000 Max. :1.0000
##
I can use boxplots to show how the transformation impacted the data.
qplot(Brand, Calories, data = df, geom= "boxplot") + ggtitle("Calories by Brand Group (original)")
qplot(Brand, calnorm, data = df2, geom= "boxplot") + ggtitle("Calories by Brand Group (normalized)")
The transformed data is much tighter but because the data is so skewed overall in neither case does a box plot really provide any actionable information.
Before we do that, though, we should normalize all of the data
df2scale <- scale((df2[,c(4:11)]))
dfn <- data.frame(df2scale)
# class(dfn)
# scale
# scaled.dfn <- scale(dfn)
# scaled.dfn
calfatscat <- ggplot(dfn, aes(x=calnorm, y=fatnorm)) + geom_point() +
ggtitle("Calories vs. Fat Content") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10))
calsfatscat <- ggplot(dfn, aes(x=calnorm, y=sfatnorm)) + geom_point() +
ggtitle("Calories vs. Saturated Fat Content") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10))
calcarbscat <- ggplot(dfn, aes(x=calnorm, y=carbnorm)) + geom_point() +
ggtitle("Calories vs. Carb Content") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10))
calfiberscat <- ggplot(dfn, aes(x=calnorm, y=fibernorm)) + geom_point() +
ggtitle("Calories vs. Fiber Content") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10))
calprotscat <- ggplot(dfn, aes(x=calnorm, y=protnorm)) + geom_point() +
ggtitle("Calories vs. Protein Content") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10))
require(gridExtra)
grid.arrange(calfatscat, calsfatscat, calcarbscat, calfiberscat, calprotscat)
Clearly carbohydrate content is well-correlated with calories. Fat, fiber, and protein don’t look like they are particularly linked, though. Let’s take a look at the actual correlations.
To do that we’re first going to have to make a table that is purely numeric
require(ggcorrplot)
corr <- round(cor(dfn), 1)
corr
## calnorm fatnorm sfatnorm carbnorm fibernorm protnorm tfatnum
## calnorm 1.0 0.7 0.5 1.0 0.4 0.6 -0.1
## fatnorm 0.7 1.0 0.7 0.6 0.4 0.4 0.0
## sfatnorm 0.5 0.7 1.0 0.4 0.1 0.2 0.3
## carbnorm 1.0 0.6 0.4 1.0 0.4 0.6 -0.1
## fibernorm 0.4 0.4 0.1 0.4 1.0 0.7 -0.2
## protnorm 0.6 0.4 0.2 0.6 0.7 1.0 -0.2
## tfatnum -0.1 0.0 0.3 -0.1 -0.2 -0.2 1.0
## hfcsnum 0.3 0.1 0.2 0.3 0.0 0.1 0.1
## hfcsnum
## calnorm 0.3
## fatnorm 0.1
## sfatnorm 0.2
## carbnorm 0.3
## fibernorm 0.0
## protnorm 0.1
## tfatnum 0.1
## hfcsnum 1.0
ggcorrplot(corr)
with this grid I can see that there isn’t much correlation between any other factors, though protein and fiber content seem to be somewhat correlated. Let’s just take one last look at what those two items look like in a scatterplot
ggplot(dfn, aes(x=fatnorm, y=protnorm)) + geom_point() +
ggtitle("Fat vs. Protein Content") +
theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10))
It seems like data science hasn’t helped my cereal buying decisions quite yet. Other than knowing that the more carbs a cereal has the more calories it has (not a groundbreaking revelation) and that cereals with higher fat content might have more protein it looks like I’m still going to have to be checking those boxes.