Breakfast Cereals

Clusters and Visualizations

David Hardister

Description

Hundreds of millions of people eat breakfast cereals on a daily basis. This data product is a way of visualizing clusters of breakfast cereals for some of the most popular brands in the United States.

The selection of the data set presented here was based on the simplicity of the data and the cleanliness of the data. Cleaning a larger data set would provide much more interesting results but it would also involve a much more task-oriented process. This data product was built using the 'shiny' library and the presentation was constructed using 'R Presentation'.

Details

The data presented here comes from the MASS library in R. The data set used in this library is the 'UScereal' data set.

The data product includes the ability to cluster cereals by the most common between two selected variables.

Among the key variables of interest are Manufacturer and Brand. Manufacturers of cereals in this data set including the following:

G=General Mills, K=Kelloggs, N=Nabisco, P=Post, Q=Quaker Oats, R=Ralston Purina

Numeric Variables

  • Calories
  • Protein (grams)
  • Fat (grams)
  • Sodium (milligrams)
  • Fiber (grams)
  • Carbohydrates (grams)
  • Sugar (grams)
  • Shelf (display shelf, counting from the floor)
  • Potassium (grams)

Example

Here is the beginning of the data set, in order to get an idea of expected values:

'data.frame':   3 obs. of  11 variables:
 $ mfr      : Factor w/ 6 levels "G","K","N","P",..: 3 2 2
 $ calories : num  212 212 100
 $ protein  : num  12.1 12.1 8
 $ fat      : num  3.03 3.03 0
 $ sodium   : num  394 788 280
 $ fibre    : num  30.3 27.3 28
 $ carbo    : num  15.2 21.2 16
 $ sugars   : num  18.2 15.2 0
 $ shelf    : int  3 3 3
 $ potassium: num  848 970 660
 $ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2

How to Use the Data Product

The user has the option of choosing a specific Manufacturer or looking at all data. An X Variable and Y Variable must be chosen. The use can also choose a cluster count; however, if the cluster count is greater than the number of data points then the clusters will not compute.

By hovering over a data point, the user is able to look at more detailed information. This information includes specifics on the X and Y variables chosen. It also includes the Manufacturer of the cereal and the specific Brand of cereal.

Conclusion

Although this specific data set and product are limited to breakfast cereals consumed in the United States, there are many implications for using this product. Being able to cluster data does not have to be limited to two variables, but it makes more sense for the simple comparison presented here.

Additionally, the information presented does not have to be limited. Additional charts could be used in conjunction with the information that is being looked into, especially for larger data sets. Using this data product should be easy, and benefiting from this data product means deserving a healthy and/or delicious breakfast!