Introduction

We chose to investigate a 77x16 dataset that contained nutritional information about 77 cereals from 7 different manufacturers. The dataset included the different cereal names, manufacturers, and nutritional information such as calories, protein, sugar, and potassium. We chose to investigate the sugar content of the different cereals, both individually by cereal name and averaged by manufacturer. Specifically, we investigated the sugar content of different cereals (by manufacturer) according to the customer rating and shelf location. We hypothesized that cereals with a higher sugar content would have a higher rating and be more likely to be placed on shelves closer to eye level (2 or 3).

We dealt with quantitative and categorical variables as follows:

FACTOR (a category represented by name instead of number)
Cereal name - 100% Bran, Corn Chex, Shredded Wheat, etc.
Manufacturer - American Home Food Products, General Mills, Kellogg, Nabisco, Post, Quaker Oats, and Ralston Purina
Type of cereal - cold or hot

INTEGER (a whole number)
Calories - per serving
Fat - grams
Potassium - milligrams
Protein - grams
Shelf position - 1, 2, or 3 considering the floor as 0
Sodium - milligrams
Sugar - grams
Vitamins - vitamins and minerals - 0, 25, or 100, indicating the typical percentage recommended by the FDA

NUMERIC (a number with a decimal point, not a whole number)
Carbohydrates - grams
Cereal rating - suggested as being from Consumer Reports
Cups - number of cups in one serving
Fiber - grams
Weight - weight in ounces of one serving

Pipe to calculate mean values of numeric data

# PIPE TO FIND MEAN AND SD OF ALL NUMERIC VARIABLES
mean_df <- 
  cereal %>%
  group_by(mfr) %>%
  summarize_if(is.numeric, 
                 list(~ mean(., na.rm = TRUE), 
                      ~ sd(., na.rm = TRUE))) 
mean_df

Plots

Graph 1 is a bar plot that shows the average sugar content (calculated in the pipe above) of the cereals produced by each manufacturer. The columns are colored by manufacturer and manipulated to show the sugar content in an ascending order. It should be noted that the data in the graph can be expressed as a Cleveland plot (a reordered representation of means), but we kept the bar visuals to better visualize the differences in mean sugar content between manufacturers.

Graph 2 is a series of boxplots that shows the five-number summary of sugar content by each manufacturer. This graph helps us identify which manufacturers have the greatest variation in sugar content; we can see that there is a decent variation in the median amount of sugar between manufacturers.

Graph 3 is a ridged density plot (created using geom_density_ridges()) that allows us to look more closely at the data provided in the boxplot above; while the box visual provides deeper insight into the variation in sugar content by manufacturer following the representation of mean sugar content by manufacturer, the ridged density plot further breaks down the same information into an even more visually appealing format.

To create the ridge plot, we first installed library(ggridges). We then used geom_density_ridges(color = "black") to show the variation in sugar content per manufacturer, outlining each ridge in black for clarity; the plot automatically scales the ridges in relation to each other. While some of the ridges do overlap, we do not lose any important data.

Graph 4 includes seven data points, plotting the sugar content against the mean rating of each cereal manufacturer. We faceted the data by manufacturer in order to take a closer look at the relationship between sugar content and rating of the individual cereals in a format that is easy to read.

Graph 5 investigates a few specific points in the graph above, we focused on one manufacturer with some of the highest sugar content values: Kellogg’s. This lollipop plot (using geom_segment()) neatly represents each of the Kellogg’s cereal names according to their sugar content. While the sugar content for Kelloggs’ cereal names is widely dispersed, it appears that “Smacks” and “Apple Jacks” are the cereals with the highest sugar content.

To draw the lollipop stems, we used geom_segment(aes(x = name, xend = name, y = 0, yend = sugars, color = name)) to define where the stems started and ended on each axis. Although the graph shows sugars on the x axis and name on the y axis, we had to leave name on the x axis and sugars on the y axis because the graph was flipped using coord_flip(). We used coord_flip() so that we could read the names of the individual cereals horizontally instead of vertically. The stems are colored by cereal name.

Graph 6 is a scatterplot that displays the shelf level that each manufacturer’s products sit on. While the graph does not specify how many of each manufacturer’s cereals sit on each shelf, it reveals that all manufacturers have a place on each shelf except for American Home Food Products and Ralston Purina. It demonstrates fair treatment towards different manufacturers’ products being displayed on most shelf levels; these results could indicate that shelf display location likely does not determine a manufacturer’s popularity.

Graph 7 is a faceted scatterplot helps determine whether a relationship between sugar content, shelf level, and cereal rating exist. Coloring the points by rating shows us that, within manufacturers, the shelf level of a particular cereal does not correspond to sugar content or cereal rating.

Discussion

While this data set’s results do not reveal any groundbreaking discoveries or strong relationships between certain variables, it provides some helpful findings for manufacturers and useful information for consumers. For manufacturers who might be concerned that shelf placement or sugar content corresponds to cereal rating, this concern could be dismissed given this weak to absent relationship.

This data set’s findings provide a useful analysis for false assumptions consumers might hold when purchasing cereal. Consumers have a wide range of cereals to choose from and there is no pattern or formula they can use to determine sugar content based on shelf placement. Similarly, consumers cannot determine a cereal’s rating based on shelf placement. So, consumers who assume that shelf placement corresponds to a cereal’s sugar level, and that sugar level corresponds to a cereal’s rating, should reconsider these assumptions.

This dataset and its visual representations provides some notable limitations. Firstly, the author of the dataset did not provide a clear origin or definition of the cereal rating variable, but hypothesized that it came from Consumer Reports. One drawback of the representations is apparent in that the “Shelf Placement for each Manufacturer” does not list the number of cereal names that are placed on each shelf. Additionally, it would be more visually appealing for the rating’s legend on the graph, “Sugar content v. Shelf Level,” to provide a more distinctive color scheme for each of the three shelf levels.

For future research, it would be interesting to find a cereal dataset that includes time to see if time correlates to patterns in rating, sugar content, and shelf-placement.

Conclusion

We concluded that a cereal’s sugar content does not appear to be related to its shelf position or consumer rating; there was great variation in shelf location and sugar content of all cereals produced by each manufacturer. We hypothesize that the high consumer ratings given to the different cereals do not rely on high sugar content but rather the overall desirability of the cereal. Ultimately, despite the media’s attacks on sugary cereals, a high sugar content does not mean that consumers will rate a cereal highly.

Data Source

The dataset used in this analysis is titled “80 Cereals” and came from Kaggle. This can be found at the following link: https://www.kaggle.com/crawford/80-cereals