INTRODUCTION
Found on Kaggle and titled “80 Cereals” but actually nutritional facts on 77 different types of breakfast cereal. The dataset was collected for a challenge that was part of a Statistical Graphics Exposition hosted by Carnegie Mellon University in 1993. To be sure, the last 28 years have seen major changes in the market for breakfast cereal and their nutritional composition, but it is nonetheless interesting to review this snapshot of what Americans were eating for breakfast in the 1990s and how it might reflect ideas about health and nutrition during that time.
The dataset can be downloaded from Kaggle: https://www.kaggle.com/crawford/80-cereals
DATA
There are two data types in the set, chr and num values. For manufacturer names they included a code letter (ie “N” for Nabisco) and adding the full manufacturer name was the only cleaning operation that was performed. One of the manufacturers, AHFP, is represented by only one product, Maypo, a maple flavored hot cereal, and as a result their representation in the plots look out of place.
I made two new data frames: avg_sugar and avg_cal that calculated the mean product sugar and calorie content respectively across the seven manufacturers represented.
VISUALIZATIONS
The 1990s was the height of the low fat dieting fad which caused food suppliers to increase sugar content in their products as a way to compensate for the drop in fat that made foods less appealing. I wanted to investigate which manufacturers were most aggressively adding sugars to their products.
PLOT 1 “Average Product Sugar Content by Manufacturer” shows Post brands leading the way in high sugar content with General Mills and Kelloggs not far behind.
In PLOT 2 “Average Product Calories by Manufacturer” we see that most of the manufacturers are at similar calorie levels, within the range of 90 - 115 calories per serving. Note that Nabisco, which had the lowest average sugar of all Manufacturers, also has the lowest average calories, although not by much.
PLOT 3 “Sugar Content for All Products by Manufacturer” shows boxplots by manufacturer that shows the broad range of sugar used in their products. Nabisco, which has six entries in the data set, all of which are healthy cereals geared towards adults (oatmeal, bran and four types of shredded wheat), again is clearly the manufacturer with the lowest overall sugar content across their product range.
Finally, PLOT 4 shows “Distribution of Sugar and Fat” for all entries in the data set. I wish I had been able to show correlation (I see that is coming up in the next unit) but it looks like there is a slight connection between low fat and high sugar which makes sense during a time when people were prioritizing low fat over other dietary restrictions.
##Import the data
getwd()
## [1] "/Users/h0age/Downloads"
setwd("/Users/h0age/Documents/data110/data_sets")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.0 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dbplyr)
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
cereal <- read_csv("cereal.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## name = col_character(),
## mfr = col_character(),
## type = col_character(),
## calories = col_double(),
## protein = col_double(),
## fat = col_double(),
## sodium = col_double(),
## fiber = col_double(),
## carbo = col_double(),
## sugars = col_double(),
## potass = col_double(),
## vitamins = col_double(),
## shelf = col_double(),
## weight = col_double(),
## cups = col_double(),
## rating = col_double()
## )
##Look at the structure of the data
str(cereal)
## spec_tbl_df [77 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:77] "100% Bran" "100% Natural Bran" "All-Bran" "All-Bran with Extra Fiber" ...
## $ mfr : chr [1:77] "N" "Q" "K" "K" ...
## $ type : chr [1:77] "C" "C" "C" "C" ...
## $ calories: num [1:77] 70 120 70 50 110 110 110 130 90 90 ...
## $ protein : num [1:77] 4 3 4 4 2 2 2 3 2 3 ...
## $ fat : num [1:77] 1 5 1 0 2 2 0 2 1 0 ...
## $ sodium : num [1:77] 130 15 260 140 200 180 125 210 200 210 ...
## $ fiber : num [1:77] 10 2 9 14 1 1.5 1 2 4 5 ...
## $ carbo : num [1:77] 5 8 7 8 14 10.5 11 18 15 13 ...
## $ sugars : num [1:77] 6 8 5 0 8 10 14 8 6 5 ...
## $ potass : num [1:77] 280 135 320 330 -1 70 30 100 125 190 ...
## $ vitamins: num [1:77] 25 0 25 25 25 25 25 25 25 25 ...
## $ shelf : num [1:77] 3 3 3 3 3 1 2 3 1 3 ...
## $ weight : num [1:77] 1 1 1 1 1 1 1 1.33 1 1 ...
## $ cups : num [1:77] 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
## $ rating : num [1:77] 68.4 34 59.4 93.7 34.4 ...
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. mfr = col_character(),
## .. type = col_character(),
## .. calories = col_double(),
## .. protein = col_double(),
## .. fat = col_double(),
## .. sodium = col_double(),
## .. fiber = col_double(),
## .. carbo = col_double(),
## .. sugars = col_double(),
## .. potass = col_double(),
## .. vitamins = col_double(),
## .. shelf = col_double(),
## .. weight = col_double(),
## .. cups = col_double(),
## .. rating = col_double()
## .. )
mean(cereal$calories) ## mean product calories for all Manufacturers
## [1] 106.8831
var(cereal$fat) ## fat variance for all Manufacturers
## [1] 1.012987
##Change Manufacturer code to full Manufacturer name
cereal$mfr[cereal$mfr == "N"]<- "Nabisco"
cereal$mfr[cereal$mfr == "P"]<- "Post"
cereal$mfr[cereal$mfr == "Q"]<- "Quaker Oats"
cereal$mfr[cereal$mfr == "R"]<- "Ralston Purina"
cereal$mfr[cereal$mfr == "K"]<- "Kelloggs"
cereal$mfr[cereal$mfr == "G"]<- "General Mills"
cereal$mfr[cereal$mfr == "A"]<- "AHFP"
cereal$type[cereal$type == "C"]<- "Cold"
cereal$type[cereal$type == "H"]<- "Hot"
##create new data frame to calculate mean sugar content by manufacturer
avg_sugar <- cereal %>% ## making new data frame for avg sugar
group_by(mfr) %>% ## group data by mfr
summarize(avg_sugar = mean(sugars)) %>% ## calculate avg sugars by
arrange(avg_sugar) ## arrange by mfr from lowest to highest by mean sugar
##create new data frame to calculate mean calories by manufacturer
avg_cal <- cereal %>% ## making new data frame for avg sugar
group_by(mfr) %>% ## group data by mfr
summarize(avg_cal = mean(calories)) %>% ## calculate avg sugars by
arrange(avg_cal) ## arrange by mfr from lowest to highest by mean sugar
##Plot 1 - Barplot - Mean Sugar by Manufacturer
p1 <- avg_sugar %>% ## using avg_sugar data frame
ggplot() + ## using ggplot
ggtitle("Average Product Sugar Content by Manufacturer") + ## top line title
xlab("Manufacturer") + ## x axis label
ylab("Grams of Sugar") + ## y axis label
geom_bar(aes(x = mfr, y = avg_sugar, fill = mfr),stat = "identity") + ##
scale_fill_discrete(name = "Manufacturer") + ## labeling for legend
theme(axis.text.x = element_text(angle = 30)) ## rotating x axis label text
p1
##plot 2 - Barplot - Mean Calories by Manufacturer
p2 <- avg_cal %>% ## using avg_cal data frame
ggplot() + ## using ggplot
ggtitle("Average Product Calories by Manufacturer") + ## top line title
xlab("Manufacturer") + ## x axis label
ylab("Calories") + ## y axis label
geom_bar(aes(x = mfr, y = avg_cal, fill = mfr),stat = "identity") + ##
scale_fill_discrete(name = "Manufacturer") + ## labeling for legend
theme(axis.text.x = element_text(angle = 30)) ## rotating x axis label text
p2
##plot 3 - Boxplot - Sugar Content for all Products
p3 <- cereal %>% ## using original data frame
ggplot(aes(mfr, sugars, fill = mfr)) + ## aesthetics x, y axis var + fill
theme(axis.text.x = element_text(angle = 30)) + ## adjust x axis labeling
ggtitle("Sugar Content for All Products by Manufacturer") + ## plot title
xlab("Manufacturer") + ## x axis label
ylab("Grams of Sugar") + ## y axis labell
geom_boxplot() + ## boxplot function
scale_fill_discrete(name = "Manufacturer") ## label for the legend
p3
##plot 4 - Scatter Plot - Sugar and Fat Content Distribution
ggplot(data = cereal) + ## using ggplot with cereal data frame
ggtitle("Distribution of Sugar and Fat") + ## top line title
xlab("Grams of Fat") + #x axis label
ylab("Grams of Sugar") + #y axis label
geom_point(mapping = aes(x = fat, y = sugars, color = mfr)) + ## scatter plot function
scale_fill_discrete(name = "Brand") ## labeling mfr label