wk6_project1

INTRODUCTION

Found on Kaggle and titled “80 Cereals” but actually nutritional facts on 77 different types of breakfast cereal. The dataset was collected for a challenge that was part of a Statistical Graphics Exposition hosted by Carnegie Mellon University in 1993. To be sure, the last 28 years have seen major changes in the market for breakfast cereal and their nutritional composition, but it is nonetheless interesting to review this snapshot of what Americans were eating for breakfast in the 1990s and how it might reflect ideas about health and nutrition during that time.

The dataset can be downloaded from Kaggle: https://www.kaggle.com/crawford/80-cereals

DATA

There are two data types in the set, chr and num values. For manufacturer names they included a code letter (ie “N” for Nabisco) and adding the full manufacturer name was the only cleaning operation that was performed. One of the manufacturers, AHFP, is represented by only one product, Maypo, a maple flavored hot cereal, and as a result their representation in the plots look out of place.

I made two new data frames: avg_sugar and avg_cal that calculated the mean product sugar and calorie content respectively across the seven manufacturers represented.

VISUALIZATIONS

The 1990s was the height of the low fat dieting fad which caused food suppliers to increase sugar content in their products as a way to compensate for the drop in fat that made foods less appealing. I wanted to investigate which manufacturers were most aggressively adding sugars to their products.

PLOT 1 “Average Product Sugar Content by Manufacturer” shows Post brands leading the way in high sugar content with General Mills and Kelloggs not far behind.

In PLOT 2 “Average Product Calories by Manufacturer” we see that most of the manufacturers are at similar calorie levels, within the range of 90 - 115 calories per serving. Note that Nabisco, which had the lowest average sugar of all Manufacturers, also has the lowest average calories, although not by much.

PLOT 3 “Sugar Content for All Products by Manufacturer” shows boxplots by manufacturer that shows the broad range of sugar used in their products. Nabisco, which has six entries in the data set, all of which are healthy cereals geared towards adults (oatmeal, bran and four types of shredded wheat), again is clearly the manufacturer with the lowest overall sugar content across their product range.

Finally, PLOT 4 shows “Distribution of Sugar and Fat” for all entries in the data set. I wish I had been able to show correlation (I see that is coming up in the next unit) but it looks like there is a slight connection between low fat and high sugar which makes sense during a time when people were prioritizing low fat over other dietary restrictions.

##Import the data

getwd()

## [1] "/Users/h0age/Downloads"

setwd("/Users/h0age/Documents/data110/data_sets")
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.0     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dbplyr)

## 
## Attaching package: 'dbplyr'

## The following objects are masked from 'package:dplyr':
## 
##     ident, sql

cereal <- read_csv("cereal.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   name = col_character(),
##   mfr = col_character(),
##   type = col_character(),
##   calories = col_double(),
##   protein = col_double(),
##   fat = col_double(),
##   sodium = col_double(),
##   fiber = col_double(),
##   carbo = col_double(),
##   sugars = col_double(),
##   potass = col_double(),
##   vitamins = col_double(),
##   shelf = col_double(),
##   weight = col_double(),
##   cups = col_double(),
##   rating = col_double()
## )

##Look at the structure of the data

str(cereal)

## spec_tbl_df [77 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name    : chr [1:77] "100% Bran" "100% Natural Bran" "All-Bran" "All-Bran with Extra Fiber" ...
##  $ mfr     : chr [1:77] "N" "Q" "K" "K" ...
##  $ type    : chr [1:77] "C" "C" "C" "C" ...
##  $ calories: num [1:77] 70 120 70 50 110 110 110 130 90 90 ...
##  $ protein : num [1:77] 4 3 4 4 2 2 2 3 2 3 ...
##  $ fat     : num [1:77] 1 5 1 0 2 2 0 2 1 0 ...
##  $ sodium  : num [1:77] 130 15 260 140 200 180 125 210 200 210 ...
##  $ fiber   : num [1:77] 10 2 9 14 1 1.5 1 2 4 5 ...
##  $ carbo   : num [1:77] 5 8 7 8 14 10.5 11 18 15 13 ...
##  $ sugars  : num [1:77] 6 8 5 0 8 10 14 8 6 5 ...
##  $ potass  : num [1:77] 280 135 320 330 -1 70 30 100 125 190 ...
##  $ vitamins: num [1:77] 25 0 25 25 25 25 25 25 25 25 ...
##  $ shelf   : num [1:77] 3 3 3 3 3 1 2 3 1 3 ...
##  $ weight  : num [1:77] 1 1 1 1 1 1 1 1.33 1 1 ...
##  $ cups    : num [1:77] 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
##  $ rating  : num [1:77] 68.4 34 59.4 93.7 34.4 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   mfr = col_character(),
##   ..   type = col_character(),
##   ..   calories = col_double(),
##   ..   protein = col_double(),
##   ..   fat = col_double(),
##   ..   sodium = col_double(),
##   ..   fiber = col_double(),
##   ..   carbo = col_double(),
##   ..   sugars = col_double(),
##   ..   potass = col_double(),
##   ..   vitamins = col_double(),
##   ..   shelf = col_double(),
##   ..   weight = col_double(),
##   ..   cups = col_double(),
##   ..   rating = col_double()
##   .. )

calculate summary statistics

mean(cereal$calories) ## mean product calories for all Manufacturers

## [1] 106.8831

var(cereal$fat)  ## fat variance for all Manufacturers

## [1] 1.012987

##Change Manufacturer code to full Manufacturer name

cereal$mfr[cereal$mfr == "N"]<- "Nabisco"
cereal$mfr[cereal$mfr == "P"]<- "Post"
cereal$mfr[cereal$mfr == "Q"]<- "Quaker Oats"
cereal$mfr[cereal$mfr == "R"]<- "Ralston Purina"
cereal$mfr[cereal$mfr == "K"]<- "Kelloggs"
cereal$mfr[cereal$mfr == "G"]<- "General Mills"
cereal$mfr[cereal$mfr == "A"]<- "AHFP"
cereal$type[cereal$type == "C"]<- "Cold"
cereal$type[cereal$type == "H"]<- "Hot"

##create new data frame to calculate mean sugar content by manufacturer

avg_sugar <- cereal %>% ## making new data frame for avg sugar 
  group_by(mfr) %>% ## group data by mfr 
  summarize(avg_sugar = mean(sugars)) %>% ## calculate avg sugars by
  arrange(avg_sugar) ## arrange by mfr from lowest to highest by mean sugar

##create new data frame to calculate mean calories by manufacturer

avg_cal <- cereal %>% ## making new data frame for avg sugar 
  group_by(mfr) %>% ## group data by mfr 
  summarize(avg_cal = mean(calories)) %>% ## calculate avg sugars by
  arrange(avg_cal) ## arrange by mfr from lowest to highest by mean sugar

##Plot 1 - Barplot - Mean Sugar by Manufacturer

p1 <- avg_sugar %>% ## using avg_sugar data frame
  ggplot() + ## using ggplot 
  ggtitle("Average Product Sugar Content by Manufacturer") + ## top line title
  xlab("Manufacturer") + ## x axis label
  ylab("Grams of Sugar") + ## y axis label
  geom_bar(aes(x = mfr, y = avg_sugar, fill = mfr),stat = "identity")   + ##
  scale_fill_discrete(name = "Manufacturer") + ## labeling for legend 
  theme(axis.text.x = element_text(angle = 30)) ## rotating x axis label text

p1

##plot 2 - Barplot - Mean Calories by Manufacturer

p2 <- avg_cal %>% ## using avg_cal data frame
  ggplot() + ## using ggplot 
  ggtitle("Average Product Calories by Manufacturer") + ## top line title
  xlab("Manufacturer") + ## x axis label
  ylab("Calories") + ## y axis label
  geom_bar(aes(x = mfr, y = avg_cal, fill = mfr),stat = "identity")   + ##
  scale_fill_discrete(name = "Manufacturer") + ## labeling for legend 
  theme(axis.text.x = element_text(angle = 30)) ## rotating x axis label text

p2

##plot 3 - Boxplot - Sugar Content for all Products

p3 <- cereal %>% ## using original data frame
  ggplot(aes(mfr, sugars, fill = mfr)) + ## aesthetics x, y axis var + fill
  theme(axis.text.x = element_text(angle = 30)) + ## adjust x axis labeling
  ggtitle("Sugar Content for All Products by Manufacturer") + ## plot title
  xlab("Manufacturer") + ## x axis label
  ylab("Grams of Sugar") + ## y axis labell
  geom_boxplot() + ## boxplot function
  scale_fill_discrete(name = "Manufacturer") ## label for the legend 

p3

##plot 4 - Scatter Plot - Sugar and Fat Content Distribution

ggplot(data = cereal) + ## using ggplot with cereal data frame
ggtitle("Distribution of Sugar and Fat") + ## top line title
  xlab("Grams of Fat") + #x axis label
  ylab("Grams of Sugar") + #y axis label 
  geom_point(mapping = aes(x = fat, y = sugars, color = mfr)) + ## scatter plot function
    scale_fill_discrete(name = "Brand") ## labeling mfr label

wk6_project1_cereal

neal hoage

3/5/2021

calculate summary statistics