For this post I’ll be trying a style designed by Edward Tufte.
The first step in any data analysis project is exploring the data. Statistician John Tukey aptly called this process exploratory data analysis, the goal of which is to simply get a sense of the data. This ideally allows us to uncover anything suspect or worth questioning, discern relationships between the variables, and discover opportunities for further exploration and analysis.1 Gutman, Alex J. and Jordan Goldmeier. 2021. Becoming a Data Head: How to Think, Speak, and Understand Data Science, Statistics, and Machine Learning. Indianapolis: John Wiley & Sons, Inc.
Some questions to guide this process of initial exploration are:
With those questions in mind, let’s read in our dataset and take a look at it.
# set working directory
setwd("~/Documents/DACSS601Fall21/_data")
# read in data and assign to variable
cereal <- read_csv("cereal.csv")
## Rows: 20 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Cereal, Type
## dbl (2): Sodium, Sugar
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# print
cereal
## # A tibble: 20 Ă— 4
## Cereal Sodium Sugar Type
## <chr> <dbl> <dbl> <chr>
## 1 Frosted Mini Wheats 0 11 A
## 2 Raisin Bran 340 18 A
## 3 All Bran 70 5 A
## 4 Apple Jacks 140 14 C
## 5 Captain Crunch 200 12 C
## 6 Cheerios 180 1 C
## 7 Cinnamon Toast Crunch 210 10 C
## 8 Crackling Oat Bran 150 16 A
## 9 Fiber One 100 0 A
## 10 Frosted Flakes 130 12 C
## 11 Froot Loops 140 14 C
## 12 Honey Bunches of Oats 180 7 A
## 13 Honey Nut Cheerios 190 9 C
## 14 Life 160 6 C
## 15 Rice Krispies 290 3 C
## 16 Honey Smacks 50 15 A
## 17 Special K 220 4 A
## 18 Wheaties 180 4 A
## 19 Corn Flakes 200 3 A
## 20 Honeycomb 210 11 C
What are we looking at at? We have a list of twenty cereals and for each cereal we have a numeric value for both sodium and sugar and a categorical value for type. I will assume that the numeric values represent the amount of sodium and sugar in each cereal but we don’t know whether it’s the amount per suggested serving, per average serving (these two are almost never the same), per package, or per some standard unit (e.g. 100 grams). We also don’t know what type is. Inspecting the column shows that there are two values, A and C, for this variable, but that doesn’t tell us what that means.
Let’s look back at our guiding questions.
Moving forward despite all that we don’t know about this dataset, let’s get some summary statistics. I know we’ll want the mean amount of sodium and sugar so let’s start there.
# get mean of sodium using base r method
mean(cereal$Sodium)
## [1] 167
# get mean of sugar using base r method
mean(cereal$Sugar)
## [1] 8.75
# get mean of both using colMeans()
colMeans(cereal[sapply(cereal, is.numeric)])
## Sodium Sugar
## 167.00 8.75
Good. I’d also like to see the median amounts of sodium and sugar. This time I’ll use the summary() function. This will return all summary statistics at once, which is more useful.
# summarize dataset
summary(cereal)
## Cereal Sodium Sugar Type
## Length:20 Min. : 0.0 Min. : 0.00 Length:20
## Class :character 1st Qu.:137.5 1st Qu.: 4.00 Class :character
## Mode :character Median :180.0 Median : 9.50 Mode :character
## Mean :167.0 Mean : 8.75
## 3rd Qu.:202.5 3rd Qu.:12.50
## Max. :340.0 Max. :18.00
Good. We can see this function also returns information about non-numeric columns, such as the length of the column (number of rows) and the class (what type of information R considers that information to be).
Visualizing the data is an important part of the exploratory process. Visualizations can help us see outliers or anything that’s wonky about our dataset. They can also help us discern relationships between the variables.
First I’d like to simply see the sodium and sugar content of each cereal. To do this, I’m going to pivot sodium and sugar into a single column Nutrient.
# pivot_longer
cereal <- cereal %>%
pivot_longer(
cols = c(`Sodium`, `Sugar`),
names_to = "Nutrient",
values_to = "Amount"
)
# print
cereal
## # A tibble: 40 Ă— 4
## Cereal Type Nutrient Amount
## <chr> <chr> <chr> <dbl>
## 1 Frosted Mini Wheats A Sodium 0
## 2 Frosted Mini Wheats A Sugar 11
## 3 Raisin Bran A Sodium 340
## 4 Raisin Bran A Sugar 18
## 5 All Bran A Sodium 70
## 6 All Bran A Sugar 5
## 7 Apple Jacks C Sodium 140
## 8 Apple Jacks C Sugar 14
## 9 Captain Crunch C Sodium 200
## 10 Captain Crunch C Sugar 12
## # … with 30 more rows
Good.
# create geom_col plot
ggplot(data = cereal) +
geom_col(mapping = aes(x = Cereal, y = Amount, color = Nutrient)) +
labs(title = "Sugar and Sodium Content in Some Cereal Brands", y = "Amount", x = "Cereal")
Looks good, minus the label madness at the bottom. One solution might be to change the font size but I think I’d have to make it so small that it wouldn’t be any more readable than it currently is. Instead I’m going to try and rotate the axis labels. I’m also going to fill in the colors of the columns for effect.
# rotate axis labels on geom_col plot
ggplot(data = cereal) +
geom_col(mapping = aes(x = Cereal, y = Amount, fill = Nutrient)) +
labs(title = "Sugar and Sodium Content in Some Cereal Brands", y = "Amount", x = "Cereal") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1))
Much better. This makes it much easier to see that all the cereals but one (Frosted Mini Wheats) in our dataset contain significantly more sodium than sugar. While looking at our data in a table would provide the same information, it’s much easier to see when represented visually.
This isn’t the best option for uncovering any correlation between sodium and sugar, however. For that I think we need a scatterplot.
# pivot_wider
cerealwide <- cereal %>%
pivot_wider(
names_from = "Nutrient",
values_from = "Amount"
)
# print
cerealwide
## # A tibble: 20 Ă— 4
## Cereal Type Sodium Sugar
## <chr> <chr> <dbl> <dbl>
## 1 Frosted Mini Wheats A 0 11
## 2 Raisin Bran A 340 18
## 3 All Bran A 70 5
## 4 Apple Jacks C 140 14
## 5 Captain Crunch C 200 12
## 6 Cheerios C 180 1
## 7 Cinnamon Toast Crunch C 210 10
## 8 Crackling Oat Bran A 150 16
## 9 Fiber One A 100 0
## 10 Frosted Flakes C 130 12
## 11 Froot Loops C 140 14
## 12 Honey Bunches of Oats A 180 7
## 13 Honey Nut Cheerios C 190 9
## 14 Life C 160 6
## 15 Rice Krispies C 290 3
## 16 Honey Smacks A 50 15
## 17 Special K A 220 4
## 18 Wheaties A 180 4
## 19 Corn Flakes A 200 3
## 20 Honeycomb C 210 11
Good. Note to self that in the future when I pivot_longer, I may want to assign the long dataset to a new variable so I don’t have to recreate the original wide format.
# create scatterplot
ggplot(data = cerealwide) +
geom_point(
mapping = aes(x = Sodium, y = Sugar, color = Cereal)
)
Good. I don’t see a strong correlation, if any, between sodium content and sugar content. Thinking back to our dataset, though, I remember that each cereal is assigned a type: A or C. I have no idea what those values mean but I wonder if grouping the cereal by type might reveal anything different.
ggplot(data = cerealwide) +
geom_point(mapping = aes(x = Sodium, y = Sugar, color = Cereal)) +
facet_wrap(~ Type, nrow = 2)
Type A cereals appear to be less clustered than type C cereals but viewing them separately still doesn’t make the difference between the two types clear.
Exploratory data analysis is the process by which we utilize various tools, such as calculating summary statistics and visualizing our data in different ways, to explore our data. The goal is to understand whether it will be useful for the questions we want to answer and discover opportunities for further exploration and analysis. More next week!