Challenge 5

Read in the Data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
cereal_raw = read.csv("../challenge_datasets/cereal.csv")
cereal_raw
                  Cereal Sodium Sugar Type
1    Frosted Mini Wheats      0    11    A
2            Raisin Bran    340    18    A
3               All Bran     70     5    A
4            Apple Jacks    140    14    C
5         Captain Crunch    200    12    C
6               Cheerios    180     1    C
7  Cinnamon Toast Crunch    210    10    C
8     Crackling Oat Bran    150    16    A
9              Fiber One    100     0    A
10        Frosted Flakes    130    12    C
11           Froot Loops    140    14    C
12 Honey Bunches of Oats    180     7    A
13    Honey Nut Cheerios    190     9    C
14                  Life    160     6    C
15         Rice Krispies    290     3    C
16          Honey Smacks     50    15    A
17             Special K    220     4    A
18              Wheaties    180     4    A
19           Corn Flakes    200     3    A
20             Honeycomb    210    11    C

Briefly describe the data

The data has 20 popular cereal brands and lists the sodium and sugar content of each. It also gives the “type” of each cereal, which is either “A” or “C”, though I am not sure what type actually refers to about the cereal.

Tidy Data (as needed)

I tidied the data by making the Type be a factor rather than a character string. Other than that no tidying was needed.

cereal <- mutate(cereal_raw, Type = factor(Type))

Univariate Visualizations

First, we can visualize the sodium content across cereals. On the first attempt, we see that using a bin width which is too small does not really capture the general shape of the data. So, I increased the bin width to 60 on the second attempt which did a better job of showing the overall concentration of sodium values around 200 and their diminishing spread above and below.

ggplot(cereal, aes(Sodium)) +
  geom_histogram(fill = "blue", color = "black") +
  labs(title = "Histogram of Sodium Content",
       x = "Sodium Content",
       y = "Frequency")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(cereal, aes(Sodium)) +
  geom_histogram(binwidth = 60, fill = "blue", color = "black", alpha=0.5) +
  labs(title = "Histogram of Sodium Content",
       x = "Sodium Content",
       y = "Frequency")

We can also visualize the sugar content of the various cereals in the same fashion. We see that the sugar data is not very well concentrated around a single number, but is instead fairly spread out.

ggplot(cereal, aes(x = Sugar)) +
  geom_histogram(binwidth = 2, fill = "blue", color = "black", alpha = 0.5) +
  labs(title = "Histogram and Density Overlay of Sugar Content",
       x = "Sugar Content",
       y = "Frequency")

We can also visualize the distribution of the Type of the cereals using a barchart. We see that there are an exactly equal number of Type A’s and Type C’s.

# Assuming 'cereal' is your dataset
ggplot(cereal, aes(Type)) +
  geom_bar(fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Number of Cereals by Type",
       x = "Cereal Type",
       y = "Count")

Bivariate Visualization(s)

We can create a bivariate graph which compares the sodium content of a cereal to its sugar content. We see that there is not a strong relationship between the two measures.

ggplot(cereal, aes(x = Sodium, y = Sugar, label = Cereal, color = Type)) +
  geom_point(size = 3, alpha = 0.7) +
  geom_text(hjust = -0.1, vjust = 0.5, size = 3) +
  labs(title = "Scatter Plot of Sodium vs Sugar Content with Type Labels",
       x = "Sodium Content",
       y = "Sugar Content") +
  scale_color_discrete(name = "Type")

We can also try to visualize the differences in sugar or sodium content between the types of cereal. We observe that for both sugar and sodium content, tpye C cereals are more well concentrated around the median than type A. In other words, they have significantly less spread.

ggplot(cereal, aes(Type, Sodium)) + geom_boxplot() +
  labs(title = "Sodium Content Distributions by Type",
       x = "Type",
       y = "Sodium Content")

ggplot(cereal, aes(Type, Sugar)) + geom_boxplot() +
  labs(title = "Sugar Content Distributions by Type",
       x = "Type",
       y = "Sugar Content")