Clusters and Visualizations
David Hardister
Hundreds of millions of people eat breakfast cereals on a daily basis. This data product is a way of visualizing clusters of breakfast cereals for some of the most popular brands in the United States.
The selection of the data set presented here was based on the simplicity of the data and the cleanliness of the data. Cleaning a larger data set would provide much more interesting results but it would also involve a much more task-oriented process. This data product was built using the 'shiny' library and the presentation was constructed using 'R Presentation'.
The data presented here comes from the MASS library in R. The data set used in this library is the 'UScereal' data set.
The data product includes the ability to cluster cereals by the most common between two selected variables.
Among the key variables of interest are Manufacturer and Brand. Manufacturers of cereals in this data set including the following:
G=General Mills, K=Kelloggs, N=Nabisco, P=Post, Q=Quaker Oats, R=Ralston Purina
Here is the beginning of the data set, in order to get an idea of expected values:
'data.frame': 3 obs. of 11 variables:
$ mfr : Factor w/ 6 levels "G","K","N","P",..: 3 2 2
$ calories : num 212 212 100
$ protein : num 12.1 12.1 8
$ fat : num 3.03 3.03 0
$ sodium : num 394 788 280
$ fibre : num 30.3 27.3 28
$ carbo : num 15.2 21.2 16
$ sugars : num 18.2 15.2 0
$ shelf : int 3 3 3
$ potassium: num 848 970 660
$ vitamins : Factor w/ 3 levels "100%","enriched",..: 2 2 2
The user has the option of choosing a specific Manufacturer or looking at all data. An X Variable and Y Variable must be chosen. The use can also choose a cluster count; however, if the cluster count is greater than the number of data points then the clusters will not compute.
By hovering over a data point, the user is able to look at more detailed information. This information includes specifics on the X and Y variables chosen. It also includes the Manufacturer of the cereal and the specific Brand of cereal.
Although this specific data set and product are limited to breakfast cereals consumed in the United States, there are many implications for using this product. Being able to cluster data does not have to be limited to two variables, but it makes more sense for the simple comparison presented here.
Additionally, the information presented does not have to be limited. Additional charts could be used in conjunction with the information that is being looked into, especially for larger data sets. Using this data product should be easy, and benefiting from this data product means deserving a healthy and/or delicious breakfast!