80 cereals dataset

Source:

I initially found this dataset on Kaggle : https://www.kaggle.com/datasets/crawford/80-cereals.

However, this dataset was gathered by Petra Isenberg, Pierre Dragicevic and Yvonne Jansen. The original source can be found here: https://perso.telecom-paristech.fr/eagan/class/igr204/datasets.

Why this Dataset?

I chose this dataset because I thought It would be fun to explore, analyze and visualize. I also thought that we would all enjoy it as a class since cereal is a pretty common food, as well as benefit from learning more about our favorite cereals.

Context

This is a multivariate dataset describing seventy-seven commonly available breakfast cereals with their dietary characteristics.

Content

Variables in the dataset:

  • Name: Name of cereal
  • mfr: Manufacturer of cereal
    • A = American Home Food Products;
    • G = General Mills
    • K = Kelloggs
    • N = Nabisco
    • P = Post
    • Q = Quaker Oats
    • R = Ralston Purina
  • type:
    • cold
    • hot
  • calories: calories per serving
  • protein: grams of protein
  • fat: grams of fat
  • sodium: milligrams of sodium
  • fiber: grams of dietary fiber
  • carbo: grams of complex carbohydrates
  • sugars: grams of sugars
  • potass: milligrams of potassium
  • vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended
  • shelf: display shelf (1, 2, or 3, counting from the floor)
  • weight: weight in ounces of one serving
  • cups: number of cups in one serving
  • rating: a rating of the cereals (Possibly from Consumer Reports?)

What I explored:

  • Calories by Manufacturer
  • Do people prefer hot or cold cereal?
  • Cereals that have more than 3 grams of protein
  • Cereals that have less than 3 grams of protein
  • What cereals have the most amount of sugar and sodium?
  • Highest and lowest-rated cereals
  • Cereals by Shelf Display

Loading the libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(ggplot2)
library(readr)
cereal <- read_csv("cereal.csv")
## Rows: 77 Columns: 16
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (3): name, mfr, type
## dbl (13): calories, protein, fat, sodium, fiber, carbo, sugars, potass, vita...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(cereal)

Exploring and Cleaning…

str(cereal)
## spec_tbl_df [77 x 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name    : chr [1:77] "100% Bran" "100% Natural Bran" "All-Bran" "All-Bran with Extra Fiber" ...
##  $ mfr     : chr [1:77] "N" "Q" "K" "K" ...
##  $ type    : chr [1:77] "C" "C" "C" "C" ...
##  $ calories: num [1:77] 70 120 70 50 110 110 110 130 90 90 ...
##  $ protein : num [1:77] 4 3 4 4 2 2 2 3 2 3 ...
##  $ fat     : num [1:77] 1 5 1 0 2 2 0 2 1 0 ...
##  $ sodium  : num [1:77] 130 15 260 140 200 180 125 210 200 210 ...
##  $ fiber   : num [1:77] 10 2 9 14 1 1.5 1 2 4 5 ...
##  $ carbo   : num [1:77] 5 8 7 8 14 10.5 11 18 15 13 ...
##  $ sugars  : num [1:77] 6 8 5 0 8 10 14 8 6 5 ...
##  $ potass  : num [1:77] 280 135 320 330 -1 70 30 100 125 190 ...
##  $ vitamins: num [1:77] 25 0 25 25 25 25 25 25 25 25 ...
##  $ shelf   : num [1:77] 3 3 3 3 3 1 2 3 1 3 ...
##  $ weight  : num [1:77] 1 1 1 1 1 1 1 1.33 1 1 ...
##  $ cups    : num [1:77] 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
##  $ rating  : num [1:77] 68.4 34 59.4 93.7 34.4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   mfr = col_character(),
##   ..   type = col_character(),
##   ..   calories = col_double(),
##   ..   protein = col_double(),
##   ..   fat = col_double(),
##   ..   sodium = col_double(),
##   ..   fiber = col_double(),
##   ..   carbo = col_double(),
##   ..   sugars = col_double(),
##   ..   potass = col_double(),
##   ..   vitamins = col_double(),
##   ..   shelf = col_double(),
##   ..   weight = col_double(),
##   ..   cups = col_double(),
##   ..   rating = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(cereal)
##      name               mfr                type              calories    
##  Length:77          Length:77          Length:77          Min.   : 50.0  
##  Class :character   Class :character   Class :character   1st Qu.:100.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :110.0  
##                                                           Mean   :106.9  
##                                                           3rd Qu.:110.0  
##                                                           Max.   :160.0  
##     protein           fat            sodium          fiber       
##  Min.   :1.000   Min.   :0.000   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:0.000   1st Qu.:130.0   1st Qu.: 1.000  
##  Median :3.000   Median :1.000   Median :180.0   Median : 2.000  
##  Mean   :2.545   Mean   :1.013   Mean   :159.7   Mean   : 2.152  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:210.0   3rd Qu.: 3.000  
##  Max.   :6.000   Max.   :5.000   Max.   :320.0   Max.   :14.000  
##      carbo          sugars           potass          vitamins     
##  Min.   :-1.0   Min.   :-1.000   Min.   : -1.00   Min.   :  0.00  
##  1st Qu.:12.0   1st Qu.: 3.000   1st Qu.: 40.00   1st Qu.: 25.00  
##  Median :14.0   Median : 7.000   Median : 90.00   Median : 25.00  
##  Mean   :14.6   Mean   : 6.922   Mean   : 96.08   Mean   : 28.25  
##  3rd Qu.:17.0   3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00  
##  Max.   :23.0   Max.   :15.000   Max.   :330.00   Max.   :100.00  
##      shelf           weight          cups           rating     
##  Min.   :1.000   Min.   :0.50   Min.   :0.250   Min.   :18.04  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:0.670   1st Qu.:33.17  
##  Median :2.000   Median :1.00   Median :0.750   Median :40.40  
##  Mean   :2.208   Mean   :1.03   Mean   :0.821   Mean   :42.67  
##  3rd Qu.:3.000   3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:50.83  
##  Max.   :3.000   Max.   :1.50   Max.   :1.500   Max.   :93.70
glimpse(cereal)
## Rows: 77
## Columns: 16
## $ name     <chr> "100% Bran", "100% Natural Bran", "All-Bran", "All-Bran with ~
## $ mfr      <chr> "N", "Q", "K", "K", "R", "G", "K", "G", "R", "P", "Q", "G", "~
## $ type     <chr> "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "~
## $ calories <dbl> 70, 120, 70, 50, 110, 110, 110, 130, 90, 90, 120, 110, 120, 1~
## $ protein  <dbl> 4, 3, 4, 4, 2, 2, 2, 3, 2, 3, 1, 6, 1, 3, 1, 2, 2, 1, 1, 3, 3~
## $ fat      <dbl> 1, 5, 1, 0, 2, 2, 0, 2, 1, 0, 2, 2, 3, 2, 1, 0, 0, 0, 1, 3, 0~
## $ sodium   <dbl> 130, 15, 260, 140, 200, 180, 125, 210, 200, 210, 220, 290, 21~
## $ fiber    <dbl> 10.0, 2.0, 9.0, 14.0, 1.0, 1.5, 1.0, 2.0, 4.0, 5.0, 0.0, 2.0,~
## $ carbo    <dbl> 5.0, 8.0, 7.0, 8.0, 14.0, 10.5, 11.0, 18.0, 15.0, 13.0, 12.0,~
## $ sugars   <dbl> 6, 8, 5, 0, 8, 10, 14, 8, 6, 5, 12, 1, 9, 7, 13, 3, 2, 12, 13~
## $ potass   <dbl> 280, 135, 320, 330, -1, 70, 30, 100, 125, 190, 35, 105, 45, 1~
## $ vitamins <dbl> 25, 0, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25~
## $ shelf    <dbl> 3, 3, 3, 3, 3, 1, 2, 3, 1, 3, 2, 1, 2, 3, 2, 1, 1, 2, 2, 3, 2~
## $ weight   <dbl> 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.00, 1.33, 1.00, 1.00, 1~
## $ cups     <dbl> 0.33, 1.00, 0.33, 0.50, 0.75, 0.75, 1.00, 0.75, 0.67, 0.67, 0~
## $ rating   <dbl> 68.40297, 33.98368, 59.42551, 93.70491, 34.38484, 29.50954, 3~
table(cereal$type) # To see how many cereals for each type. 
## 
##  C  H 
## 74  3

We can see that:

  • There are 77 observations of 16 variables.
  • There are negative values for carbs, sugars, and potassium.
  • Most of the variables are numerical with a few columns made of characters.
  • There are 74 cold cereals and only 3 hot cereals.

Analyzing Calories by Manufacturer (Boxplot)

  • mfr: Manufacturer of cereal
    • A = American Home Food Products
    • G = General Mills
    • K = Kelloggs
    • N = Nabisco
    • P = Post
    • Q = Quaker Oats
    • R = Ralston Purina
ggplot(cereal, aes(calories, mfr, fill = mfr))+
  geom_boxplot()+
  ggtitle("Calories by Manufacturer")+
  ylab("manufacturer")

From this boxplot, we can see that Ralston Purina manufactures cereal with the highest amount of calories while Quaker Oats manufactures cereals with the lowest amount of calories. We can also see some outliers that indicate the levels of calories that range from under 60 to 160 calories. These also indicate that Kellogs manufactures cereals with both, the lowest and highest amount of calories.

More exploring…

I want to see what types of cereals do people prefer and from what manufacturers:

ggplot(cereal, aes(mfr, rating, fill = type))+ 
  geom_col(position = 'dodge')+
  theme_classic()+
  ggtitle('Ratings for cereal type')

From this we can conclude that:

  • On average, most people prefer cold cereals,
  • American Home Food Products only manufactures hot cereals.
  • Kelloggs, Ralston Purina, Post, and General Mills only manufacture cold cereal.
  • Nabisco and Quaker Oats manufacture botch cold and hot cereal.

Next I want to see what cereals have the most grams of protein. As well as which have the least.

table(cereal$protein)
## 
##  1  2  3  4  5  6 
## 13 25 28  8  1  2

This table tells us that these range from 1 to 6 grams of protein.

Filtering only cereals that have more than 3 grams of protein.
proteinplot <- cereal %>%
  select(name, protein)%>%
  filter(protein > 3)

Plotting

ggplot(proteinplot, aes(protein, name,color = protein))+
  geom_point()+
  xlab("protein (grams)")+
  ylab("Cereal names")+
  ggtitle("Cereals that have more than 3 grams of protein")

We can see that Special K and Cheerios have the most grams of protein (6 grams). I was surprised to see Cheerios have the most grams of protein.

Now filtering those that have less than 3 grams of protein:

proteinplot2 <- cereal %>%
  select(name, protein)%>%
  filter(protein < 3)

Plotting cereals that have less than 3 grams of protein

ggplot(proteinplot2, aes(protein, name,color = protein))+
  geom_point()+
  xlab("protein (grams)")+
  ylab("Cereal names")+
  ggtitle("Cereals that have less than 3 grams of protein")

We can see that some of the most populars cereals such as Cinnamon Toast Crunch, Cap n’ Crunch, Fruity Pebbles have only 1 gram of protein.

Sugar and Sodium Contents For Cereals using Tableau

Exploring what cereals have the highest contents of sugar and sodium:

Link : https://public.tableau.com/views/80Cereals_16503220510550/Sheet3?:language=en-US&:display_count=n&:origin=viz_share_link

From this visualization we can conclude that:

  • Golden Crisp and Smacks have the most grams of sugar (15 grams).
  • Product 19 cereal has the highest content of sodium. This cereal didn’t sound familiar to me so I searched it up and according to https://www.atlasobscura.com/articles/product-19-cereal-discontinued, this cereal was discontinued in 2016. However, the second-highest were Rice Krispies, Corn Flakes , and Cheerios, which are all cereals we can find today (290 mg or 0.29 g).

Is this too much sugar?

According to the American Heart Association (AHA), the maximum amount of added sugars you should eat in a day are:

  • Men: 150 calories per day (37.5 grams or 9 teaspoons)
  • Women: 100 calories per day (25 grams or 6 teaspoons)
  • No more than 2,300 milligrams (mg) a day and moving toward an ideal limit of no more than 1,500 mg per day for most adults.

We can agree on the fact that 15 grams of sugars is a lot for a serving of these cereals.

Sources:https://pubmed.ncbi.nlm.nih.gov/19704096/ and https://www.heart.org/en/healthy-living/healthy-eating/eat-smart/sodium/how-much-sodium-should-i-eat-per-day

Ratings for Cereals using Tableau

Exploring what cereals are rated the highest/lowest:

Link: https://public.tableau.com/views/RatingsforCereal/Sheet1?:language=en-US&:display_count=n&:origin=viz_share_link

Cereals by Shelf Display using Tableau

Exploring shelf displays for the cereals

Link: https://public.tableau.com/views/CerealShelfDisplay/Sheet2?:language=en-US&:display_count=n&:origin=viz_share_link

I took a nutrition class last semester, and I heard something interesting saying that less healthy cereals are mostly displayed on the middle shelves so it’s the first thing we see, while other healthier options are usually put at the top or bottom shelves. After some research I found evidence of this:

  • According to this scientific article published in the National Library of Medicine, “A total of 19.8% of cereals were displayed on the bottom shelf, 52.9% were displayed on the middle shelves, 24.5% were on the top shelf and 2.9% were found on multiple shelves. Less healthy cereals were displayed at eye level, in the middle shelf, 2.9 times more frequently than healthier cereals.”

Source:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5684751/

From this visualization, we can confirm that less healthy cereals are on fact displayed in the middle shelves. These have the highest contents for both sugars and sodium.

This 80 cereals dataset revealed some interesting findings as well as confirmed other outside observations. I am pretty happy with my work and I have included everything that I wanted to explore in this dataset!