Introduction

Context

Variables in the dataset:

  • Name: Name of cereal

  • mfr: Manufacturer of cereal

    A = American Home Food Products;

    G = General Mills

    K = Kelloggs

    N = Nabisco

    P = Post

    Q = Quaker Oats

    R = Ralston Purina

  • type:

    cold

    hot

  • calories: calories per serving

  • protein: grams of protein

  • fat: grams of fat

  • sodium: milligrams of sodium

  • fiber: grams of dietary fiber

  • carbo: grams of complex carbohydrates

  • sugars: grams of sugars

  • potass: milligrams of potassium

  • vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA recommended

  • shelf: display shelf (1, 2, or 3, counting from the floor)

  • weight: weight in ounces of one serving

  • cups: number of cups in one serving

  • rating: a rating of the cereals (Possibly from Consumer Reports?)

1. Explanation

The dataset is from Kaggle https://www.kaggle.com/datasets/crawford/80-cereals. This dataset contains information about 80 different cereal products from 7 manufacturers. Manufacturers include American Home Food Products, General Mills, Kellogs, Nabisco, Post, Quaker Oats, and Ralston Purina. This analysis will provide understanding of what goes in the cereal that we eat everyday for breakfast.

2. Read data and initial inspection

Loading libraries

library(readr)
library(dplyr)
library(reshape)
library(ggplot2)

2.1. Reading data

Read data using read.csv (data is in the same folder)

cereal <- read.csv("cereal.csv")

2.2. Initial data inspection

Data dimension is 77 rows by 16 columns

dim(cereal)
## [1] 77 16

The columns of the dataset

names(cereal)
##  [1] "name"     "mfr"      "type"     "calories" "protein"  "fat"     
##  [7] "sodium"   "fiber"    "carbo"    "sugars"   "potass"   "vitamins"
## [13] "shelf"    "weight"   "cups"     "rating"
head(cereal, 10)
##                         name mfr type calories protein fat sodium fiber carbo
## 1                  100% Bran   N    C       70       4   1    130  10.0   5.0
## 2          100% Natural Bran   Q    C      120       3   5     15   2.0   8.0
## 3                   All-Bran   K    C       70       4   1    260   9.0   7.0
## 4  All-Bran with Extra Fiber   K    C       50       4   0    140  14.0   8.0
## 5             Almond Delight   R    C      110       2   2    200   1.0  14.0
## 6    Apple Cinnamon Cheerios   G    C      110       2   2    180   1.5  10.5
## 7                Apple Jacks   K    C      110       2   0    125   1.0  11.0
## 8                    Basic 4   G    C      130       3   2    210   2.0  18.0
## 9                  Bran Chex   R    C       90       2   1    200   4.0  15.0
## 10               Bran Flakes   P    C       90       3   0    210   5.0  13.0
##    sugars potass vitamins shelf weight cups   rating
## 1       6    280       25     3   1.00 0.33 68.40297
## 2       8    135        0     3   1.00 1.00 33.98368
## 3       5    320       25     3   1.00 0.33 59.42551
## 4       0    330       25     3   1.00 0.50 93.70491
## 5       8     -1       25     3   1.00 0.75 34.38484
## 6      10     70       25     1   1.00 0.75 29.50954
## 7      14     30       25     2   1.00 1.00 33.17409
## 8       8    100       25     3   1.33 0.75 37.03856
## 9       6    125       25     1   1.00 0.67 49.12025
## 10      5    190       25     3   1.00 0.67 53.31381

Data structure

str(cereal)
## 'data.frame':    77 obs. of  16 variables:
##  $ name    : chr  "100% Bran" "100% Natural Bran" "All-Bran" "All-Bran with Extra Fiber" ...
##  $ mfr     : chr  "N" "Q" "K" "K" ...
##  $ type    : chr  "C" "C" "C" "C" ...
##  $ calories: int  70 120 70 50 110 110 110 130 90 90 ...
##  $ protein : int  4 3 4 4 2 2 2 3 2 3 ...
##  $ fat     : int  1 5 1 0 2 2 0 2 1 0 ...
##  $ sodium  : int  130 15 260 140 200 180 125 210 200 210 ...
##  $ fiber   : num  10 2 9 14 1 1.5 1 2 4 5 ...
##  $ carbo   : num  5 8 7 8 14 10.5 11 18 15 13 ...
##  $ sugars  : int  6 8 5 0 8 10 14 8 6 5 ...
##  $ potass  : int  280 135 320 330 -1 70 30 100 125 190 ...
##  $ vitamins: int  25 0 25 25 25 25 25 25 25 25 ...
##  $ shelf   : int  3 3 3 3 3 1 2 3 1 3 ...
##  $ weight  : num  1 1 1 1 1 1 1 1.33 1 1 ...
##  $ cups    : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
##  $ rating  : num  68.4 34 59.4 93.7 34.4 ...

Data summary

summary(cereal)
##      name               mfr                type              calories    
##  Length:77          Length:77          Length:77          Min.   : 50.0  
##  Class :character   Class :character   Class :character   1st Qu.:100.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :110.0  
##                                                           Mean   :106.9  
##                                                           3rd Qu.:110.0  
##                                                           Max.   :160.0  
##     protein           fat            sodium          fiber       
##  Min.   :1.000   Min.   :0.000   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:0.000   1st Qu.:130.0   1st Qu.: 1.000  
##  Median :3.000   Median :1.000   Median :180.0   Median : 2.000  
##  Mean   :2.545   Mean   :1.013   Mean   :159.7   Mean   : 2.152  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:210.0   3rd Qu.: 3.000  
##  Max.   :6.000   Max.   :5.000   Max.   :320.0   Max.   :14.000  
##      carbo          sugars           potass          vitamins     
##  Min.   :-1.0   Min.   :-1.000   Min.   : -1.00   Min.   :  0.00  
##  1st Qu.:12.0   1st Qu.: 3.000   1st Qu.: 40.00   1st Qu.: 25.00  
##  Median :14.0   Median : 7.000   Median : 90.00   Median : 25.00  
##  Mean   :14.6   Mean   : 6.922   Mean   : 96.08   Mean   : 28.25  
##  3rd Qu.:17.0   3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00  
##  Max.   :23.0   Max.   :15.000   Max.   :330.00   Max.   :100.00  
##      shelf           weight          cups           rating     
##  Min.   :1.000   Min.   :0.50   Min.   :0.250   Min.   :18.04  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:0.670   1st Qu.:33.17  
##  Median :2.000   Median :1.00   Median :0.750   Median :40.40  
##  Mean   :2.208   Mean   :1.03   Mean   :0.821   Mean   :42.67  
##  3rd Qu.:3.000   3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:50.83  
##  Max.   :3.000   Max.   :1.50   Max.   :1.500   Max.   :93.70

insight

There are negative value in carbo, sugars, and potass

insight

There are a lot of “outliers” in categories and weight which might be caused by different company size classification (small, medium, large)

3. Data cleansing and transformation

3.1. Check missing data

Check if data has missing value

anyNA(cereal)
## [1] FALSE

insight

No missing value from data. However, we need to replace negative value with zero because none of the columns can have value below zero

cereal[cereal < 0] <- 0

3.2. Transform data type

cereal<-cereal %>%
  mutate(mfr = as.factor(mfr),
         type = as.factor(type),
         mfr=recode(mfr, 
                    A="American Home Food Products",
                    G="General Mills",
                    K="Kelloggs",
                    N="Nabisco",
                    P="Post",
                    Q="Quarker Oats",
                    R="Ralston Purnia"),
         type=recode(type,
                     C="Cold",
                     H="Hot"))
levels(cereal$mfr)
## [1] "American Home Food Products" "General Mills"              
## [3] "Kelloggs"                    "Nabisco"                    
## [5] "Post"                        "Quarker Oats"               
## [7] "Ralston Purnia"
levels(cereal$type)
## [1] "Cold" "Hot"

Manufacturer(mfr) of cereal: - A = American Home Food Products;

  • G = General Mills

  • K = Kelloggs

  • N = Nabisco

  • P = Post

  • Q = Quaker Oats

  • R = Ralston Purina

Type of cereal:

  • C = Cold

  • H = Hot

4. Further exploratory analysis

1. Correlation heatmap of numeric columns value

cereal_is.numeric <- cereal %>% 
  select_if(is.numeric)

cereal_heatmap <- round(cor(cereal_is.numeric), 1)
cereal_heatmap <- melt(cereal_heatmap)

ggplot(cereal_heatmap, aes(x=X1, y=X2, fill=value)) +
  geom_tile()+
  scale_fill_gradient(low = "pink", high = "dark red")+
  labs(title="Correlation Heatmap",x="", y="")+
  geom_text(aes(x=X1, y=X2, label=value), color="white")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1), plot.title = element_text(hjust=0.5))

insight

There seems to be a very strong correlation between fiber and potassium content in cereal. Weight also correlate positively with calories, this makes sense because in physics, mass equates to energy, however not all of the energy from food (in this case, cereal) can be digested therefore it is not a perfect correlation.

2. Maybe it is more crucial to look at the correlation specific to the rating column

cereal_ratcorr <- data.frame(cor(cereal_is.numeric)) %>% select(rating) %>% filter(rating != 1)
cereal_ratcorr$id <- row.names(cereal_ratcorr)


ggplot(cereal_ratcorr, mapping=aes(x=id, y=rating, fill=rating)) +
  geom_col()+
  ylim(-1,1)+
  scale_fill_gradient2(low = "#3D2F38", mid="#715266", high = "#A3918C", limits=c(-1,1))+
  labs(title="Correlation with Rating",x="", y="Correlation Value")+
  coord_flip()+
  theme(plot.title = element_text(hjust=0.5))+
  theme_minimal()

insight

There aren’t a strong positive correlation between rating and the cereal’s ingredients. The largest positive correlation value is only 0.58. Meanwhile, there are two strong negative correlation that contribute to the cereal’s rating, sugars and calories which the correlation value of -0.76 and -0.69 respectively.

3. Calories and Sugar have a pretty negative correlation with ratings

cereal_calcor<- melt(cereal %>% 
select(rating, calories, sugars), id.vars="rating")
  
  
ggplot(cereal_calcor, mapping=aes(x=rating, y=value)) +
  facet_wrap(vars(variable), scale="free_y")+
  geom_point(aes(col=variable), show.legend=FALSE)+
  geom_smooth(method="lm", aes(col=variable), show.legend=FALSE)+
  labs(title="Scatter Plot Correlation of Both Calories and Sugars with Ratings",x="Rating", y="")+
  theme(plot.title = element_text(hjust=0.5))+
  theme_minimal()

insight

As we can see from this scatter plot, the rating of a cereal seems to go down the more the calories and sugars it contains.

4. Manfacturer ranked by average sugars in their products

mfr_sugar <- cereal %>% 
  group_by(mfr) %>% 
  summarise(mean_sugars = mean(sugars)) %>% 
  ungroup() %>% 
  arrange(desc(mean_sugars))


ggplot(mfr_sugar, mapping=aes(x=reorder(mfr, mean_sugars), y=mean_sugars, fill=mean_sugars)) +
  geom_col(show.legend=FALSE)+
  scale_fill_gradient(low = "#5ADBFF", high = "#004BA8")+
  labs(
    title="Average Sugar Content in Each Manufacturer's Products",
    x="",
    y="Average Sugar Content (grams)"
  )+
  coord_flip()+
  theme(plot.title = element_text(hjust=0.5))+
  theme_minimal()

insight

The difference between manufacturers with the lowest and highest average sugar content in their products is astonishing. We should really becareful when buying cereals from Post, General Mills, and Kelloggs as their products contain the most sugar.

5. Top 10 cereal with highest sugar level

sugar_highest <- head(cereal %>% 
  arrange(desc(sugars)) %>% 
  select(c(name, sugars)), 10)

ggplot(sugar_highest, mapping=aes(x=reorder(name, sugars), y=sugars))+
  geom_col(aes(fill=sugars), show.legend = FALSE)+
  geom_hline(yintercept = mean(cereal$sugars), col="white", size=1)+
  labs(
    title="Top 10 Cereal with Most Sugar Content",
    x="",
    y="Sugar Content (grams)"
    )+  
  geom_text(aes(0, mean(cereal$sugars), label=round(mean(cereal$sugars), 1), vjust=0))+
  scale_fill_gradient(low="#C1A5A9", high="#694873")+
  theme(plot.title = element_text(hjust=0.5))+
  theme_minimal()+
  coord_flip()

insight

These cereal products have sugar content exceeds average cereal (6.9 grams) by 3-10 grams or 60-200%. This is a very surprising fact and I think we should avoid eating these cereals. However, there are cereal brands without sugar content which are:

w.o_sugar <- cereal %>% 
  select(c(name, sugars)) %>% 
  filter(sugars == 0)

print(w.o_sugar$name)
## [1] "All-Bran with Extra Fiber" "Cream of Wheat (Quick)"   
## [3] "Puffed Rice"               "Puffed Wheat"             
## [5] "Quaker Oatmeal"            "Shredded Wheat"           
## [7] "Shredded Wheat 'n'Bran"    "Shredded Wheat spoon size"

5. Conclusion

Generally, cereal that contain more sugar and calories will have lower ratings. While containing more fiber and protein can help with the ratings. Here we notice that cereal that are healthier are rated higher, so I assume that the rating is rated according to the health value. There are also cereals that we absolutely have to avoid eating regularly because the sugar content are just too high.