Picking the Right Cereal

I love cereal, but it’s hard to figure out what cereal is best for me to choose for my dietary needs. I hate looking at nutrition labels in the store so I’m hoping that I can find one metric that I can use to base my choices on.

Question for analysis: What is the most useful metric to look at for a healthy cereal?

I was originally answering this question with the data at https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/Stat2Data/Cereal.csv but that data only included information on calories, sugar, and fiber when fat content is a vital factor to consider. Because of that I found another data source (http://www.acaloriecounter.com/breakfast-cereal.php) that had more variables and more observations.


Read in csv file
mydata <- read.csv("https://raw.githubusercontent.com/ajbentley/cuny_ms_ds/master/cereal_nutrition.csv", header = TRUE)


Convert to dataframe
df <- data.frame(mydata)


Take a peek at what it looks like
head(df)
##              Breakfast.Cereal         Brand size Calories Total.Fat
## 1                   Rice Chex General Mills 1.00      100       0.5
## 2               Rice Krispies     Kellogg's 1.25      120       0.0
## 3 Rice Krispies Treats Cereal     Kellogg's 0.75      120       1.5
## 4            Frosted Krispies     Kellogg's 0.75      110       0.0
## 5                        Trix General Mills 1.00      120       1.5
## 6                Lucky Charms General Mills 0.75      110       1.0
##   Saturated.Fat Carbs Fiber Sugar Protein transfat hfcs
## 1             0    23     0     2       2       NO   NO
## 2             0    29     0     3       2       NO  YES
## 3             0    26     0     9       1      YES  YES
## 4             0    27     0    12       1       NO  YES
## 5             0    28     1    13       1       NO  YES
## 6             0    22     1    11       2       NO   NO


Get a summary of the data
summary(df)
##               Breakfast.Cereal           Brand         size       
##  All-Bran             : 1      General Mills:22   Min.   :0.1100  
##  Apple Jacks          : 1      Kashi        : 1   1st Qu.:0.7500  
##  Banana Nut Crunch    : 1      Kellogg's    :20   Median :0.7500  
##  Basic 4              : 1      Post         : 9   Mean   :0.7585  
##  Cheerios             : 1      Quaker       : 1   3rd Qu.:1.0000  
##  Cinnamon Toast Crunch: 1                         Max.   :1.2500  
##  (Other)              :47                                         
##     Calories       Total.Fat     Saturated.Fat        Carbs      
##  Min.   : 60.0   Min.   :0.000   Min.   :0.0000   Min.   :20.00  
##  1st Qu.:110.0   1st Qu.:0.500   1st Qu.:0.0000   1st Qu.:23.00  
##  Median :120.0   Median :1.000   Median :0.0000   Median :25.00  
##  Mean   :127.7   Mean   :1.292   Mean   :0.1415   Mean   :28.02  
##  3rd Qu.:120.0   3rd Qu.:1.500   3rd Qu.:0.0000   3rd Qu.:28.00  
##  Max.   :240.0   Max.   :6.000   Max.   :1.0000   Max.   :48.00  
##                                                                  
##      Fiber            Sugar           Protein       transfat  hfcs   
##  Min.   : 0.000   Min.   : 0.000   Min.   : 1.000   NO :41   NO :34  
##  1st Qu.: 1.000   1st Qu.: 5.000   1st Qu.: 1.000   YES:12   YES:19  
##  Median : 1.000   Median :11.000   Median : 2.000                    
##  Mean   : 2.453   Mean   : 9.075   Mean   : 2.679                    
##  3rd Qu.: 3.000   3rd Qu.:12.000   3rd Qu.: 3.000                    
##  Max.   :14.000   Max.   :20.000   Max.   :13.000                    
## 
Notes on the data
  • Breakfast.Cereal is a factor with each level representing a cereal brand
  • Brand is a factor with each level being a brand owner
  • GM and Kellogg’s have the most representation
  • Size is the number of cups per serving
    • there is a wide range, going from just over a tenth of a cup to 1.25 cups. This will have to be normalized for the analysis.
    • the mean and median are fairly close, suggesting that there might be a normal distribution to the data, but the 1st Quartile is the same as the median, suggesting it will be left-skewed.
  • Calories is the number of calories per serving.
    • Here the median is the same as the 3rd Quartile, suggesting a right-skew. Also suggesting this is the fact that the max is twice the 3rd quartile.
  • Total.Fat is the amount of fat (in grams) per serving of the cereal.
  • Saturated.Fat is the amount of saturated fat (in grams) per serving
    • It appears as though most cereals do not contain saturated fat
  • Carbs is the total carbohydrate count per serving.
  • Fiber is the grams of fiber per serving.
    • This metric is also likely right-skewing given the first quartile is equal to the median and the 3rd quartile is 3 times the median.
  • Sugar is the grams of sugar per serving.
    • Probably right skewing
  • Protein is the grams of protein per serving.
    • Probably right skewing
  • transfat is a factor with Yes and No (presence of transfat in the cereal)
  • hfcs is a factor with Yes and No (presence of high fructose corn syrup in the cereal)


Let’s run a few histograms to see if I was right about the skews
require(ggplot2)
shist <- ggplot(df, aes(x = size)) + geom_histogram(color="darkblue", fill="lightblue") +
  ggtitle("Serving Size") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))


calhist <- ggplot(df, aes(x = Calories)) + geom_histogram(color="darkblue", fill="lightblue") + ggtitle("Calories") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))


tfathist <- ggplot(df, aes(x = Total.Fat)) + geom_histogram(color="darkblue", fill="lightblue") +
  ggtitle("Total Fat") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))


sfathist <- ggplot(df, aes(x = Saturated.Fat)) + geom_histogram(color="darkblue", fill="lightblue") +
  ggtitle("Saturated Fat") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))


carbhist <- ggplot(df, aes(x = Carbs)) + geom_histogram(color="darkblue", fill="lightblue") +
  ggtitle("Carbs") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))


fibhist <- ggplot(df, aes(x = Fiber)) + geom_histogram(color="darkblue", fill="lightblue") +
  ggtitle("Fiber") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))


sughist <- ggplot(df, aes(x = Sugar)) + geom_histogram(color="darkblue", fill="lightblue") +
  ggtitle("Sugar") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))


prothist <- ggplot(df, aes(x = Protein)) + geom_histogram(color="darkblue", fill="lightblue") +
  ggtitle("Protein") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1))




require(gridExtra)
grid.arrange(shist, calhist, tfathist, sfathist, carbhist, fibhist, sughist, prothist)


Normalize data

Because the serving sizes on the nutrition labels are so different, it’s important to normalize them so that we can really have an applejacks-to-applejacks comparison of the brands.

To do this I’m going to divide each value in that column by the mean of the metric, resulting in an index to the mean.

df$siindex <- "serving size index"

df$sindex = df$size / mean(df$size)


Now I can convert each of the other measurements to the normalized serving size by multiplying them by the serving size index

df$calnorm = df$Calories * df$sindex
df$fatnorm = df$Total.Fat * df$sindex
df$sfatnorm = df$Saturated.Fat * df$sindex
df$carbnorm = df$Carbs * df$sindex
df$fibernorm = df$Fiber * df$sindex
df$protnorm = df$Protein * df$sindex


head(df)
##              Breakfast.Cereal         Brand size Calories Total.Fat
## 1                   Rice Chex General Mills 1.00      100       0.5
## 2               Rice Krispies     Kellogg's 1.25      120       0.0
## 3 Rice Krispies Treats Cereal     Kellogg's 0.75      120       1.5
## 4            Frosted Krispies     Kellogg's 0.75      110       0.0
## 5                        Trix General Mills 1.00      120       1.5
## 6                Lucky Charms General Mills 0.75      110       1.0
##   Saturated.Fat Carbs Fiber Sugar Protein transfat hfcs            siindex
## 1             0    23     0     2       2       NO   NO serving size index
## 2             0    29     0     3       2       NO  YES serving size index
## 3             0    26     0     9       1      YES  YES serving size index
## 4             0    27     0    12       1       NO  YES serving size index
## 5             0    28     1    13       1       NO  YES serving size index
## 6             0    22     1    11       2       NO   NO serving size index
##     sindex  calnorm  fatnorm sfatnorm carbnorm fibernorm protnorm
## 1 1.318408 131.8408 0.659204        0 30.32338  0.000000 2.636816
## 2 1.648010 197.7612 0.000000        0 47.79229  0.000000 3.296020
## 3 0.988806 118.6567 1.483209        0 25.70896  0.000000 0.988806
## 4 0.988806 108.7687 0.000000        0 26.69776  0.000000 0.988806
## 5 1.318408 158.2090 1.977612        0 36.91542  1.318408 1.318408
## 6 0.988806 108.7687 0.988806        0 21.75373  0.988806 1.977612


Now I’ll convert the transfat and hfcs factors into numerics

class(df$transfat)
## [1] "factor"
df$tfatnum <- ifelse(df$transfat=="YES", 1, 0)
df$hfcsnum <- ifelse(df$hfcs=="YES", 1, 0)

head(df)
##              Breakfast.Cereal         Brand size Calories Total.Fat
## 1                   Rice Chex General Mills 1.00      100       0.5
## 2               Rice Krispies     Kellogg's 1.25      120       0.0
## 3 Rice Krispies Treats Cereal     Kellogg's 0.75      120       1.5
## 4            Frosted Krispies     Kellogg's 0.75      110       0.0
## 5                        Trix General Mills 1.00      120       1.5
## 6                Lucky Charms General Mills 0.75      110       1.0
##   Saturated.Fat Carbs Fiber Sugar Protein transfat hfcs            siindex
## 1             0    23     0     2       2       NO   NO serving size index
## 2             0    29     0     3       2       NO  YES serving size index
## 3             0    26     0     9       1      YES  YES serving size index
## 4             0    27     0    12       1       NO  YES serving size index
## 5             0    28     1    13       1       NO  YES serving size index
## 6             0    22     1    11       2       NO   NO serving size index
##     sindex  calnorm  fatnorm sfatnorm carbnorm fibernorm protnorm tfatnum
## 1 1.318408 131.8408 0.659204        0 30.32338  0.000000 2.636816       0
## 2 1.648010 197.7612 0.000000        0 47.79229  0.000000 3.296020       0
## 3 0.988806 118.6567 1.483209        0 25.70896  0.000000 0.988806       1
## 4 0.988806 108.7687 0.000000        0 26.69776  0.000000 0.988806       0
## 5 1.318408 158.2090 1.977612        0 36.91542  1.318408 1.318408       0
## 6 0.988806 108.7687 0.988806        0 21.75373  0.988806 1.977612       0
##   hfcsnum
## 1       0
## 2       1
## 3       1
## 4       1
## 5       1
## 6       0


Make a new dataframe with the transformed columns
df2 <- data.frame(df[,c(1,2,14:22)])

and get the summary

summary(df2)
##               Breakfast.Cereal           Brand        sindex      
##  All-Bran             : 1      General Mills:22   Min.   :0.1450  
##  Apple Jacks          : 1      Kashi        : 1   1st Qu.:0.9888  
##  Banana Nut Crunch    : 1      Kellogg's    :20   Median :0.9888  
##  Basic 4              : 1      Post         : 9   Mean   :1.0000  
##  Cheerios             : 1      Quaker       : 1   3rd Qu.:1.3184  
##  Cinnamon Toast Crunch: 1                         Max.   :1.6480  
##  (Other)              :47                                         
##     calnorm          fatnorm          sfatnorm         carbnorm     
##  Min.   : 15.95   Min.   :0.0000   Min.   :0.0000   Min.   : 3.191  
##  1st Qu.: 98.88   1st Qu.:0.2900   1st Qu.:0.0000   1st Qu.:21.754  
##  Median :118.66   Median :0.9888   Median :0.0000   Median :26.368  
##  Mean   :126.42   Mean   :1.2036   Mean   :0.1167   Mean   :27.703  
##  3rd Qu.:158.21   3rd Qu.:1.4832   3rd Qu.:0.0000   3rd Qu.:34.279  
##  Max.   :379.04   Max.   :6.5920   Max.   :0.8240   Max.   :75.808  
##                                                                     
##    fibernorm          protnorm         tfatnum          hfcsnum      
##  Min.   : 0.0000   Min.   : 0.145   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.5142   1st Qu.: 1.318   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 1.3184   Median : 1.978   Median :0.0000   Median :0.0000  
##  Mean   : 2.1241   Mean   : 2.706   Mean   :0.2264   Mean   :0.3585  
##  3rd Qu.: 2.6368   3rd Qu.: 2.966   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :13.1841   Max.   :17.139   Max.   :1.0000   Max.   :1.0000  
## 

I can use boxplots to show how the transformation impacted the data.

qplot(Brand, Calories, data = df, geom= "boxplot") + ggtitle("Calories by Brand Group (original)")

qplot(Brand, calnorm, data = df2, geom= "boxplot") + ggtitle("Calories by Brand Group (normalized)")

The transformed data is much tighter but because the data is so skewed overall in neither case does a box plot really provide any actionable information.

Calorie count is probably the most important variable to base decisions on so let’s take a look at how the other factors correlate to it using scatter plots.

Before we do that, though, we should normalize all of the data

df2scale <- scale((df2[,c(4:11)]))
dfn <- data.frame(df2scale)
# class(dfn)
# scale
# scaled.dfn <- scale(dfn)
# scaled.dfn
calfatscat <- ggplot(dfn, aes(x=calnorm, y=fatnorm)) + geom_point() + 
    ggtitle("Calories vs. Fat Content") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10)) 
  
calsfatscat <- ggplot(dfn, aes(x=calnorm, y=sfatnorm)) + geom_point() + 
    ggtitle("Calories vs. Saturated Fat Content") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10)) 

calcarbscat <- ggplot(dfn, aes(x=calnorm, y=carbnorm)) + geom_point() + 
    ggtitle("Calories vs. Carb Content") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10)) 

calfiberscat <- ggplot(dfn, aes(x=calnorm, y=fibernorm)) + geom_point() + 
    ggtitle("Calories vs. Fiber Content") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10)) 

calprotscat <- ggplot(dfn, aes(x=calnorm, y=protnorm)) + geom_point() + 
    ggtitle("Calories vs. Protein Content") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10)) 


require(gridExtra)
grid.arrange(calfatscat, calsfatscat, calcarbscat, calfiberscat, calprotscat)


Clearly carbohydrate content is well-correlated with calories. Fat, fiber, and protein don’t look like they are particularly linked, though. Let’s take a look at the actual correlations.

To do that we’re first going to have to make a table that is purely numeric

require(ggcorrplot)
corr <- round(cor(dfn), 1)

corr
##           calnorm fatnorm sfatnorm carbnorm fibernorm protnorm tfatnum
## calnorm       1.0     0.7      0.5      1.0       0.4      0.6    -0.1
## fatnorm       0.7     1.0      0.7      0.6       0.4      0.4     0.0
## sfatnorm      0.5     0.7      1.0      0.4       0.1      0.2     0.3
## carbnorm      1.0     0.6      0.4      1.0       0.4      0.6    -0.1
## fibernorm     0.4     0.4      0.1      0.4       1.0      0.7    -0.2
## protnorm      0.6     0.4      0.2      0.6       0.7      1.0    -0.2
## tfatnum      -0.1     0.0      0.3     -0.1      -0.2     -0.2     1.0
## hfcsnum       0.3     0.1      0.2      0.3       0.0      0.1     0.1
##           hfcsnum
## calnorm       0.3
## fatnorm       0.1
## sfatnorm      0.2
## carbnorm      0.3
## fibernorm     0.0
## protnorm      0.1
## tfatnum       0.1
## hfcsnum       1.0
ggcorrplot(corr)

with this grid I can see that there isn’t much correlation between any other factors, though protein and fiber content seem to be somewhat correlated. Let’s just take one last look at what those two items look like in a scatterplot

ggplot(dfn, aes(x=fatnorm, y=protnorm)) + geom_point() + 
    ggtitle("Fat vs. Protein Content") + 
  theme(plot.title = element_text(face="bold", hjust=0.5, vjust=1, size=10)) 

Conclusion

It seems like data science hasn’t helped my cereal buying decisions quite yet. Other than knowing that the more carbs a cereal has the more calories it has (not a groundbreaking revelation) and that cereals with higher fat content might have more protein it looks like I’m still going to have to be checking those boxes.