Cereal: An Expoloratory Data Analysis

Homework 4

Claire Battaglia

Wednesday, October 20, 2021

Exploratory Data Analysis

For this post I’ll be trying a style designed by Edward Tufte.

The first step in any data analysis project is exploring the data. Statistician John Tukey aptly called this process exploratory data analysis, the goal of which is to simply get a sense of the data. This ideally allows us to uncover anything suspect or worth questioning, discern relationships between the variables, and discover opportunities for further exploration and analysis.1 Gutman, Alex J. and Jordan Goldmeier. 2021. Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning. Indianapolis: John Wiley & Sons, Inc.

Some questions to guide this process of initial exploration are:

  1. What is the context and origin story of the data? I.e. Who collected the data? How?
  2. Are the data representative? I.e. Is there sampling bias? How were outliers handled?
  3. What data are we not seeing? I.e. How were missing values handled?

With those questions in mind, let’s read in our dataset and take a look at it.

# set working directory
setwd("~/Documents/DACSS601Fall21/_data")

# read in data and assign to variable
cereal <- read_csv("cereal.csv")
## Rows: 20 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Cereal, Type
## dbl (2): Sodium, Sugar
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# print
cereal
## # A tibble: 20 Ă— 4
##    Cereal                Sodium Sugar Type 
##    <chr>                  <dbl> <dbl> <chr>
##  1 Frosted Mini Wheats        0    11 A    
##  2 Raisin Bran              340    18 A    
##  3 All Bran                  70     5 A    
##  4 Apple Jacks              140    14 C    
##  5 Captain Crunch           200    12 C    
##  6 Cheerios                 180     1 C    
##  7 Cinnamon Toast Crunch    210    10 C    
##  8 Crackling Oat Bran       150    16 A    
##  9 Fiber One                100     0 A    
## 10 Frosted Flakes           130    12 C    
## 11 Froot Loops              140    14 C    
## 12 Honey Bunches of Oats    180     7 A    
## 13 Honey Nut Cheerios       190     9 C    
## 14 Life                     160     6 C    
## 15 Rice Krispies            290     3 C    
## 16 Honey Smacks              50    15 A    
## 17 Special K                220     4 A    
## 18 Wheaties                 180     4 A    
## 19 Corn Flakes              200     3 A    
## 20 Honeycomb                210    11 C

The Data

What are we looking at at? We have a list of twenty cereals and for each cereal we have a numeric value for both sodium and sugar and a categorical value for type. I will assume that the numeric values represent the amount of sodium and sugar in each cereal but we don’t know whether it’s the amount per suggested serving, per average serving (these two are almost never the same), per package, or per some standard unit (e.g. 100 grams). We also don’t know what type is. Inspecting the column shows that there are two values, A and C, for this variable, but that doesn’t tell us what that means.

Let’s look back at our guiding questions.

  1. What is the context and origin story of the data? I.e. Who collected the data? How? We can’t really answer this.
  2. Are the data representative? I.e. Is there sampling bias? How were outliers handled? Anyone who’s ever been down the cereal aisle knows that this list is nowhere near comprehensive. That is, there are hundreds of types of cereal not included in this dataset. So why these twenty? Without knowing more about this dataset, we can’t say for sure if the data are representative or can answer whatever question the researcher had in mind when collecting data.
  3. What data are we not seeing? I.e. How were missing values handled? Again, we can’t really answer this.

Summarizing the Data

Moving forward despite all that we don’t know about this dataset, let’s get some summary statistics. I know we’ll want the mean amount of sodium and sugar so let’s start there.

# get mean of sodium using base r method
mean(cereal$Sodium)
## [1] 167
# get mean of sugar using base r method
mean(cereal$Sugar)
## [1] 8.75
# get mean of both using colMeans()
colMeans(cereal[sapply(cereal, is.numeric)])
## Sodium  Sugar 
## 167.00   8.75

Good. I’d also like to see the median amounts of sodium and sugar. This time I’ll use the summary() function. This will return all summary statistics at once, which is more useful.

# summarize dataset
summary(cereal)
##     Cereal              Sodium          Sugar           Type          
##  Length:20          Min.   :  0.0   Min.   : 0.00   Length:20         
##  Class :character   1st Qu.:137.5   1st Qu.: 4.00   Class :character  
##  Mode  :character   Median :180.0   Median : 9.50   Mode  :character  
##                     Mean   :167.0   Mean   : 8.75                     
##                     3rd Qu.:202.5   3rd Qu.:12.50                     
##                     Max.   :340.0   Max.   :18.00

Good. We can see this function also returns information about non-numeric columns, such as the length of the column (number of rows) and the class (what type of information R considers that information to be).

Visualizing the Data

Visualizing the data is an important part of the exploratory process. Visualizations can help us see outliers or anything that’s wonky about our dataset. They can also help us discern relationships between the variables.

First I’d like to simply see the sodium and sugar content of each cereal. To do this, I’m going to pivot sodium and sugar into a single column Nutrient.

# pivot_longer
cereal <- cereal %>% 
  pivot_longer(
    cols = c(`Sodium`, `Sugar`),
    names_to = "Nutrient",
    values_to = "Amount"
  )

# print
cereal
## # A tibble: 40 Ă— 4
##    Cereal              Type  Nutrient Amount
##    <chr>               <chr> <chr>     <dbl>
##  1 Frosted Mini Wheats A     Sodium        0
##  2 Frosted Mini Wheats A     Sugar        11
##  3 Raisin Bran         A     Sodium      340
##  4 Raisin Bran         A     Sugar        18
##  5 All Bran            A     Sodium       70
##  6 All Bran            A     Sugar         5
##  7 Apple Jacks         C     Sodium      140
##  8 Apple Jacks         C     Sugar        14
##  9 Captain Crunch      C     Sodium      200
## 10 Captain Crunch      C     Sugar        12
## # … with 30 more rows

Good.

# create geom_col plot
ggplot(data = cereal) +
  geom_col(mapping = aes(x = Cereal, y = Amount, color = Nutrient)) +
  labs(title = "Sugar and Sodium Content in Some Cereal Brands", y = "Amount", x = "Cereal")

Looks good, minus the label madness at the bottom. One solution might be to change the font size but I think I’d have to make it so small that it wouldn’t be any more readable than it currently is. Instead I’m going to try and rotate the axis labels. I’m also going to fill in the colors of the columns for effect.

# rotate axis labels on geom_col plot
ggplot(data = cereal) +
  geom_col(mapping = aes(x = Cereal, y = Amount, fill = Nutrient)) +
  labs(title = "Sugar and Sodium Content in Some Cereal Brands", y = "Amount", x = "Cereal") +
  theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))

Much better. This makes it much easier to see that all the cereals but one (Frosted Mini Wheats) in our dataset contain significantly more sodium than sugar. While looking at our data in a table would provide the same information, it’s much easier to see when represented visually.

This isn’t the best option for uncovering any correlation between sodium and sugar, however. For that I think we need a scatterplot.

# pivot_wider
cerealwide <- cereal %>% 
  pivot_wider(
    names_from = "Nutrient",
    values_from = "Amount"
    )

# print
cerealwide
## # A tibble: 20 Ă— 4
##    Cereal                Type  Sodium Sugar
##    <chr>                 <chr>  <dbl> <dbl>
##  1 Frosted Mini Wheats   A          0    11
##  2 Raisin Bran           A        340    18
##  3 All Bran              A         70     5
##  4 Apple Jacks           C        140    14
##  5 Captain Crunch        C        200    12
##  6 Cheerios              C        180     1
##  7 Cinnamon Toast Crunch C        210    10
##  8 Crackling Oat Bran    A        150    16
##  9 Fiber One             A        100     0
## 10 Frosted Flakes        C        130    12
## 11 Froot Loops           C        140    14
## 12 Honey Bunches of Oats A        180     7
## 13 Honey Nut Cheerios    C        190     9
## 14 Life                  C        160     6
## 15 Rice Krispies         C        290     3
## 16 Honey Smacks          A         50    15
## 17 Special K             A        220     4
## 18 Wheaties              A        180     4
## 19 Corn Flakes           A        200     3
## 20 Honeycomb             C        210    11

Good. Note to self that in the future when I pivot_longer, I may want to assign the long dataset to a new variable so I don’t have to recreate the original wide format.

# create scatterplot
ggplot(data = cerealwide) +
  geom_point(
    mapping = aes(x = Sodium, y = Sugar, color = Cereal)
  )

Good. I don’t see a strong correlation, if any, between sodium content and sugar content. Thinking back to our dataset, though, I remember that each cereal is assigned a type: A or C. I have no idea what those values mean but I wonder if grouping the cereal by type might reveal anything different.

ggplot(data = cerealwide) +
  geom_point(mapping = aes(x = Sodium, y = Sugar, color = Cereal)) + 
    facet_wrap(~ Type, nrow = 2)

Type A cereals appear to be less clustered than type C cereals but viewing them separately still doesn’t make the difference between the two types clear.

Conclusion

Exploratory data analysis is the process by which we utilize various tools, such as calculating summary statistics and visualizing our data in different ways, to explore our data. The goal is to understand whether it will be useful for the questions we want to answer and discover opportunities for further exploration and analysis. More next week!