An Analysis of Cereal Data

Research Question

Do less healthy cereals tend to be displayed on the upper, lower, or middle shelf?

Note: To evaluate relative healthiness of cereals, we will look at calories and sugar specifically. There will be two separate analyses performed.

Background Information

This dataset contains nutrition information about 77 cereals from 7 different manufacturers. The data displays the measurements of various items you would read on a nutrition label as well as information about serving sizes and the cereal’s overall rating. This dataset might specifically appeal to people who eat cereal regularly and want to compare different types of cereal to find the healthiest option.

Variables

The fields in this dataset include:

name: the name of the cereal
mfr: the manufacturer of the cereal
- A = American Home Food Products
- G = General Mills
- K = Kelloggs
- N = Nabisco
- P = Post
- Q = Quaker Oats
- R = Ralston Purina)
type: the type of cereal
- C = cold cereal
- H = hot cereal
calories: the number of calories per serving
protein: the number of grams of protein per serving
fat: the number of grams of fat per serving
sodium: the number of milligrams of sodium per serving
fiber: the number of grams of dietary fiber per serving
carbo: the number of grams of complex carbohydrates per serving
sugars: the number of grams of sugars per serving
potass: the number of milligrams of potassium
vitamins: the amount of vitamins and minerals
- values = 0, 25, or 100 -> indicates the typical percentage of FDA recommended
shelf: display shelf
- values = 1, 2, or 3 - counting from the floor
weight: the weight in ounces of one serving
cups: the number of cups in one serving
rating: a rating of the cereals - from consumer reports

The names of the cereals comprise the rows of the dataset whereas all of the other variables comprise the columns.

References to Previous Research

Research has shown that more sugary cereals tend to be placed on the lower shelves for two reasons. First, the lower shelves are taller and deeper and can larger boxes which the best-selling brands tend to manufacture. Second, younger kids walk by the lower shelves and are attracted to these cereals. Then then ask their parents to purchase their favorite Lucky Charms or Fruit Loops boxes. The adult cereals tend to be on the higher shelves and do not sell out as frequently. The less healthy cereals are shelved at eye level which typically means shelf 2.

Source: https://www.deseret.com/1990/5/8/18860646/top-shelve-those-sugary-cereals

Exploratory Data Analysis

Setup & Loading Data

# Set working directory
# setwd('C:/Users/student/Documents/DS4001/PracticeOfDataScience')

# Read in dataset
cereal <- read.csv("cereal.csv")

# Load libraries
library(ggplot2)
library(psych)
library(dplyr)

Data Cleanup

# Renaming the columns
colnames(cereal) <- c("Name", "Manufacturer", "Type", "Calories", "Protein", "Fat", "Sodium", "Fiber", "Carbohydrates", "Sugar", "Potassium", "Vitamins", "Shelf", "Weight", "Cups", "Rating")

# Renaming the manufacturers
cereal$Manufacturer_Name <- cereal$Manufacturer

cereal$Manufacturer_Name <- gsub(pattern = "P", replacement = "Post", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "A", replacement = "American Home Food Products", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "G", replacement = "General Mills", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "K", replacement = "Kellogs", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "N", replacement = "Nabisco", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "Q", replacement = "Quaker Oats", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "R", replacement = "Ralston Purina", x = cereal$Manufacturer_Name)

# Recoding negative values as NA
cereal$Carbohydrates[cereal$Carbohydrates < 0] <- NA
cereal$Sugar[cereal$Sugar < 0] <- NA
cereal$Potassium[cereal$Potassium < 0] <- NA

# Recode cereal types
cereal$Type <- gsub("H", "Hot", x = cereal$Type)
cereal$Type <- gsub("C", "Cold", x = cereal$Type)

# Change cereal type and shelf from character to factor
cereal$Type <- factor(cereal$Type)
cereal$Shelf <- factor(cereal$Shelf)
cereal$Manufacturer <- factor(cereal$Manufacturer)

Data Summary

# taking a look at the raw data
head(cereal)

##                        Name Manufacturer Type Calories Protein Fat Sodium Fiber
## 1                 100% Bran            N Cold       70       4   1    130  10.0
## 2         100% Natural Bran            Q Cold      120       3   5     15   2.0
## 3                  All-Bran            K Cold       70       4   1    260   9.0
## 4 All-Bran with Extra Fiber            K Cold       50       4   0    140  14.0
## 5            Almond Delight            R Cold      110       2   2    200   1.0
## 6   Apple Cinnamon Cheerios            G Cold      110       2   2    180   1.5
##   Carbohydrates Sugar Potassium Vitamins Shelf Weight Cups   Rating
## 1           5.0     6       280       25     3      1 0.33 68.40297
## 2           8.0     8       135        0     3      1 1.00 33.98368
## 3           7.0     5       320       25     3      1 0.33 59.42551
## 4           8.0     0       330       25     3      1 0.50 93.70491
## 5          14.0     8        NA       25     3      1 0.75 34.38484
## 6          10.5    10        70       25     1      1 0.75 29.50954
##   Manufacturer_Name
## 1           Nabisco
## 2       Quaker Oats
## 3           Kellogs
## 4           Kellogs
## 5    Ralston Purina
## 6     General Mills

# summary of all cereal data
summary(cereal)

##                         Name    Manufacturer   Type       Calories    
##  100% Bran                : 1   A: 1         Cold:74   Min.   : 50.0  
##  100% Natural Bran        : 1   G:22         Hot : 3   1st Qu.:100.0  
##  All-Bran                 : 1   K:23                   Median :110.0  
##  All-Bran with Extra Fiber: 1   N: 6                   Mean   :106.9  
##  Almond Delight           : 1   P: 9                   3rd Qu.:110.0  
##  Apple Cinnamon Cheerios  : 1   Q: 8                   Max.   :160.0  
##  (Other)                  :71   R: 8                                  
##     Protein           Fat            Sodium          Fiber       
##  Min.   :1.000   Min.   :0.000   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:0.000   1st Qu.:130.0   1st Qu.: 1.000  
##  Median :3.000   Median :1.000   Median :180.0   Median : 2.000  
##  Mean   :2.545   Mean   :1.013   Mean   :159.7   Mean   : 2.152  
##  3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:210.0   3rd Qu.: 3.000  
##  Max.   :6.000   Max.   :5.000   Max.   :320.0   Max.   :14.000  
##                                                                  
##  Carbohydrates      Sugar          Potassium         Vitamins      Shelf 
##  Min.   : 5.0   Min.   : 0.000   Min.   : 15.00   Min.   :  0.00   1:20  
##  1st Qu.:12.0   1st Qu.: 3.000   1st Qu.: 42.50   1st Qu.: 25.00   2:21  
##  Median :14.5   Median : 7.000   Median : 90.00   Median : 25.00   3:36  
##  Mean   :14.8   Mean   : 7.026   Mean   : 98.67   Mean   : 28.25         
##  3rd Qu.:17.0   3rd Qu.:11.000   3rd Qu.:120.00   3rd Qu.: 25.00         
##  Max.   :23.0   Max.   :15.000   Max.   :330.00   Max.   :100.00         
##  NA's   :1      NA's   :1        NA's   :2                               
##      Weight          Cups           Rating      Manufacturer_Name 
##  Min.   :0.50   Min.   :0.250   Min.   :18.04   Length:77         
##  1st Qu.:1.00   1st Qu.:0.670   1st Qu.:33.17   Class :character  
##  Median :1.00   Median :0.750   Median :40.40   Mode  :character  
##  Mean   :1.03   Mean   :0.821   Mean   :42.67                     
##  3rd Qu.:1.00   3rd Qu.:1.000   3rd Qu.:50.83                     
##  Max.   :1.50   Max.   :1.500   Max.   :93.70                     
##

# structure of dataset
str(cereal)

## 'data.frame':    77 obs. of  17 variables:
##  $ Name             : Factor w/ 77 levels "100% Bran","100% Natural Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Manufacturer     : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
##  $ Type             : Factor w/ 2 levels "Cold","Hot": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Calories         : int  70 120 70 50 110 110 110 130 90 90 ...
##  $ Protein          : int  4 3 4 4 2 2 2 3 2 3 ...
##  $ Fat              : int  1 5 1 0 2 2 0 2 1 0 ...
##  $ Sodium           : int  130 15 260 140 200 180 125 210 200 210 ...
##  $ Fiber            : num  10 2 9 14 1 1.5 1 2 4 5 ...
##  $ Carbohydrates    : num  5 8 7 8 14 10.5 11 18 15 13 ...
##  $ Sugar            : int  6 8 5 0 8 10 14 8 6 5 ...
##  $ Potassium        : int  280 135 320 330 NA 70 30 100 125 190 ...
##  $ Vitamins         : int  25 0 25 25 25 25 25 25 25 25 ...
##  $ Shelf            : Factor w/ 3 levels "1","2","3": 3 3 3 3 3 1 2 3 1 3 ...
##  $ Weight           : num  1 1 1 1 1 1 1 1.33 1 1 ...
##  $ Cups             : num  0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
##  $ Rating           : num  68.4 34 59.4 93.7 34.4 ...
##  $ Manufacturer_Name: chr  "Nabisco" "Quaker Oats" "Kellogs" "Kellogs" ...

Summary Statistics for Health Metrics: Calories

# summary across all manufacturers
summaryCals <- summary(cereal$Calories) 
summaryCals

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    50.0   100.0   110.0   106.9   110.0   160.0

# histogram across all manufacturers
calHist <- ggplot(cereal, aes(x=Calories))
bnsize <- diff(range(cereal$Calories))/8
caloriehistogram <- calHist + geom_histogram(binwidth=bnsize, fill="light pink", colour="black", alpha=0.6) + 
ggtitle("Histogram of Calories with Normal Curve") 
calhistogram <- caloriehistogram + stat_function(fun = function(x, mean, sd, n, bw){ dnorm(x = x, mean = mean(cereal$Calories), sd = sd(cereal$Calories)) * length(cereal$Calories) * bnsize }, args = c(mean = mean(cereal$Calories), sd = sd(cereal$Calories), n = length(cereal$Calories), bw = bnsize))
calhistogram

From the summary of the distribution, we see that the mean number of calories per serving is 106.9 whereas the median is 110.0. Since the mean is less than the median, the distribution is slightly left skewed. The range of the distribution of calories is 110. From the graph, we can see that the distribution of calories for all 77 brands is relatively normal. The overlaid normal curve shows the shape of the underlying distribution. Although the normal curve does not perfectly map the data, if more cereals were included in the population, the distribution would look more normally distributed.

Summary Statistics for Health Metrics: Sugar

# summary across all manufacturers
summarySugar <- summary(cereal$Sugar) 
summarySugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   3.000   7.000   7.026  11.000  15.000       1

# histogram across all manufacturers
sugHist <- ggplot(cereal, aes(x=Sugar))
sugarhistogram <- sugHist + geom_histogram(binwidth=2, fill="light blue", colour="black", alpha=0.6) + 
ggtitle("Histogram of Sugar Distribution with Normal Curve") 
sughistogram <- sugarhistogram + stat_function(fun = function(x, mean, sd, n, bw){ dnorm(x = x, mean = mean(cereal$Sugar, na.rm = TRUE), sd = sd(cereal$Sugar, na.rm = TRUE)) * length(cereal$Sugar) * 2 }, args = c(mean = mean(cereal$Sugar, na.rm = TRUE), sd = sd(cereal$Sugar, na.rm = TRUE), n = length(cereal$Sugar), bw = 2))
sughistogram

From the summary of the distribution, we see that the mean number of grams of sugar per serving is 7.026 whereas the median is 7. Since the mean is greater than the median, the distribution is slightly right skewed. The range of the distribution of sugar is 15. From the graph, we can see that the distribution of sugar for all 77 brands is not very normal. This distribution probably does not meet the normality assumption; however the analysis will still be performed.

Correlation Analysis

# taking the subset of data relevant to analysis
cereal_heathMetrics <-subset(cereal,select=c(Calories, Protein, Fat, Carbohydrates, Vitamins, Sugar))

# creating the scatterplot and correlation matrix
pairs.panels(cereal_heathMetrics[,1:6],
             method = "pearson", #correlation method
             hist.col = "red",
             main="Nutritional Scatterplot and Correlation Matrix",
             density = TRUE, # show density plots
             ellipses = TRUE, # show correlation ellipses
             lm=TRUE #linear regression fits 
             )

The plots above demonstrate the correlations between various nutritional factors. The strongest correlation is a positive 0.57 correlation between calories and sugar, which are the two factors included in this analysis. A perfect correlation would have a pearson correlation of 1 or -1 depending on the directionality of the correlation.

Methods: Data Visualization & Analysis of Variance

Pie Chart Analysis: Proportion of Cereals on Each Shelf

### Proportion of Cereals on Each Shelf

# calculating total number of cereals on each shelf
shelf1 <- cereal[cereal$Shelf == 1,]
num_shelf1 <- nrow(shelf1) 
num_shelf1 # 20

shelf2 <- cereal[cereal$Shelf == 2,]
num_shelf2 <- nrow(shelf2) 
num_shelf2 # 21

shelf3 <- cereal[cereal$Shelf == 3,]
num_shelf3 <- nrow(shelf3) 
num_shelf3 # 36

# calculating total number of cereals
total <- num_shelf1 + num_shelf2 + num_shelf3
total # 77

# proportion of cereals on each shelf
prop_shelf1 <- num_shelf1 / total 
prop_shelf1 # 0.2597

prop_shelf2 <- num_shelf2 / total
prop_shelf2 # 0.2727

prop_shelf3 <- num_shelf3 / total 
prop_shelf3 # 0.4675

# creating data frame of quantities and proportion on each shelf
num_shelves <- c(num_shelf1, num_shelf2, num_shelf3)
percentage <- c(round(prop_shelf1,4)*100, round(prop_shelf2,4)*100, round(prop_shelf3,4)*100)

shelf_quantities <- as.data.frame(cbind(sort(unique(cereal$Shelf)), num_shelves, percentage))
colnames(shelf_quantities) <- c("Shelf", "Quantity", "Percentage")

# creating labels for pie chart
lbl<-paste("Shelf ", shelf_quantities$Shelf," (", shelf_quantities$Quantity, ") ", shelf_quantities$Percentage,"%",sep="")

# creating pie chart
pie(shelf_quantities$Quantity,
    labels = lbl,
    cex=0.8,
    col=rainbow(length(shelf_quantities$Quantity)),
    main="Proportion of Cereals on Each Shelf")

I wanted to create this visualization to show that the proportion of cereals on each shelf is not equal. We can see that shelf 3 has 36 cereals placed on it, making up almost half of all the cereals (46.75%). Shelf 2 follows with 21 cereals, making up 27.27% of the total and then shelf 1 has 20 cereals or 25.97% of the cereals. These proportions will serve as the benchmark.

Pie Chart Analysis: Proportion of Caloric Cereals on Each Shelf

### Proportion of Caloric Cereals on Each Shelf
### * a caloric cereal is defined as one whose calorie content is in the third quartile

summaryCals[5] # 3rd quartile = 110 calories per serving

high_caloric_content <- cereal[cereal$Calories > 110,]

shelf1_cal <- high_caloric_content[high_caloric_content$Shelf == 1,]
num_shelf1_cal <- nrow(shelf1_cal) 
num_shelf1_cal # 0

shelf2_cal <- high_caloric_content[high_caloric_content$Shelf == 2,]
num_shelf2_cal <- nrow(shelf2_cal) 
num_shelf2_cal # 5

shelf3_cal <- high_caloric_content[high_caloric_content$Shelf == 3,]
num_shelf3_cal <- nrow(shelf3_cal) 
num_shelf3_cal # 13

# calculating total number of cereals
total_cal <- num_shelf1_cal + num_shelf2_cal + num_shelf3_cal
total_cal # 18

# proportion of cereals on each shelf
prop_shelf1_cal <- num_shelf1_cal / total_cal
prop_shelf1_cal # 0

prop_shelf2_cal <- num_shelf2_cal / total_cal
prop_shelf2_cal # 0.2778

prop_shelf3_cal <- num_shelf3_cal / total_cal
prop_shelf3_cal # 0.7222

# creating data frame of quantities and proportion on each shelf
num_shelves_cal <- c(num_shelf1_cal, num_shelf2_cal, num_shelf3_cal)
percentage_cal <- c(round(prop_shelf1_cal,4)*100, round(prop_shelf2_cal,4)*100, round(prop_shelf3_cal,4)*100)

shelf_quantities_cal <- as.data.frame(cbind(sort(unique(cereal$Shelf)), num_shelves_cal, percentage_cal))
colnames(shelf_quantities_cal) <- c("Shelf", "Quantity", "Percentage")

# creating labels for pie chart
lbl<-paste("Shelf ", shelf_quantities_cal$Shelf," (", shelf_quantities_cal$Quantity, ") ", shelf_quantities_cal$Percentage,"%",sep="")

# creating pie chart
pie(shelf_quantities_cal$Quantity,
    labels = lbl,
    cex=0.8,
    col=rainbow(length(shelf_quantities_cal$Quantity)),
    main="Proportion of Caloric Cereals on Each Shelf")

This pie chart shows the proportion of caloric cereals on each shelf. We can see that shelf 3 has the majority of the caloric cereals with 13 cereals placed on it or 72.22%. Shelf 2 follows with 5 cereals, making up 27.78% of the total and then shelf 1 has 0 cereals. This shows that most of the caloric cereals are on the top shelf.

Pie Chart Analysis: Proportion of Sugary Cereals on Each Shelf

### Proportion of Sugary Cereals on Each Shelf
### * a sugary cereal is defined as one whose sugar content is in the third quartile

summarySugar[5] # 3rd quartile = 11g of sugar

high_sugar_content <- cereal[cereal$Sugar > 11,]

shelf1_sug <- high_sugar_content[high_sugar_content$Shelf == 1,]
num_shelf1_sug <- nrow(shelf1_sug) 
num_shelf1_sug # 2

shelf2_sug <- high_sugar_content[high_sugar_content$Shelf == 2,]
num_shelf2_sug <- nrow(shelf2_sug) 
num_shelf2_sug # 12

shelf3_sug <- high_sugar_content[high_sugar_content$Shelf == 3,]
num_shelf3_sug <- nrow(shelf3_sug) 
num_shelf3_sug # 5

# calculating total number of cereals
total_sug <- num_shelf1_sug + num_shelf2_sug + num_shelf3_sug
total_sug # 19

# proportion of cereals on each shelf
prop_shelf1_sug <- num_shelf1_sug / total_sug
prop_shelf1_sug # 0.10526

prop_shelf2_sug <- num_shelf2_sug / total_sug
prop_shelf2_sug # 0.6316

prop_shelf3_sug <- num_shelf3_sug / total_sug
prop_shelf3_sug # 0.2632

# creating data frame of quantities and proportion on each shelf
num_shelves_sug <- c(num_shelf1_sug, num_shelf2_sug, num_shelf3_sug)
percentage_sug <- c(round(prop_shelf1_sug,4)*100, round(prop_shelf2_sug,4)*100, round(prop_shelf3_sug,4)*100)

shelf_quantities_sug <- as.data.frame(cbind(sort(unique(cereal$Shelf)), num_shelves_sug, percentage_sug))
colnames(shelf_quantities_sug) <- c("Shelf", "Quantity", "Percentage")

# creating labels for pie chart
lbl<-paste("Shelf ", shelf_quantities_sug$Shelf," (", shelf_quantities_sug$Quantity, ") ", shelf_quantities_sug$Percentage,"%",sep="")

# creating pie chart
pie(shelf_quantities_sug$Quantity,
    labels = lbl,
    cex=0.8,
    col=rainbow(length(shelf_quantities_sug$Quantity)),
    main="Proportion of Sugary Cereals on Each Shelf")

This pie chart shows the proportion of sugary cereals on each shelf. We can see that shelf 2 has the majority of the sugary cereals with 12 cereals placed on it or 63.16%. Shelf 3 follows with 5 cereals, making up 26.32% of the total and then shelf 1 has 2 cereals, 10.53%. This shows that most of the sugary cereals are on the middle shelf.

Pie Chart Analysis: Side by Side Comparison

All Cereals

pie(shelf_quantities$Quantity,
    labels = lbl,
    cex=0.8,
    col=rainbow(length(shelf_quantities$Quantity)),
    main="All Cereals")

Caloric Cereals

pie(shelf_quantities_cal$Quantity,
    labels = lbl,
    cex=0.8,
    col=rainbow(length(shelf_quantities_cal$Quantity)),
    main="Caloric Cereals")

Sugary Cereals

pie(shelf_quantities_sug$Quantity,
    labels = lbl,
    cex=0.8,
    col=rainbow(length(shelf_quantities_sug$Quantity)),
    main="Sugary Cereals")

These are the same graphs as above, but showing them side by side exposes the differences in percentages on each shelf more visibly.

Dotchart Visualizations

# setup
grps <- factor(cereal$Shelf)
cols <- c("#FF0000", "#FF6600", "#CC00CC")

# calories and shelf
dotchart(cereal$Calories,
         groups = grps, color = cols[grps], gcolor = cols,  cex = 1.5,  pch = 18, xlab = "Calories",
         ylab = "Manufacturer", main = "Relationship Between Calories and Shelf")

# sugars and shelf
dotchart(cereal$Sugar,
         groups = grps, color = cols[grps], gcolor = cols,  cex = 1.5,  pch = 18, xlab = "Sugar",
         ylab = "Manufacturer", main = "Relationship Between Sugar and Shelf")

The dotcharts show a different visualization of the shelf levels and their relationship to sugar and calories respectively. The x-axis shows the sugar or calorie amounts whereas the y-axis shows the manufacturer. The names of the manufacturers are hidden to give a cleaner look. The points on the graph demonstrate each of the cereals. The cereals are colored by shelf and are located within one of the three distinct banded regions that correspond to the shelf numbers also located on the y-axis. Although the number of cereals on each shelf is not the same, the visualization demonstrates that the most sugary cereals (> 11 grams) are most highly concentrated on shelf 2 whereas the most caloric cereals (> 110 calories) are most highly concentrated on shelf 3.

Boxplot Visualizations

# Calories
bplot <- ggplot(cereal, aes(x=Shelf, y=Calories))
boxplot_cal <- bplot + geom_boxplot(outlier.color = "purple", 
outlier.size = 2) + stat_summary(fun.y=mean, color = "orange", 
geom="point", size = 3) + ggtitle("Boxplot of Calories Distribution 
for Each Shelf") + labs(x="Shelf", y="Calories")
boxplot_cal

# Sugar
bplot <- ggplot(cereal, aes(x=Shelf, y=Sugar))
boxplot_sug <- bplot + geom_boxplot(outlier.color = "purple", 
outlier.size = 2) + stat_summary(fun.y=mean, color = "orange", 
geom="point", size = 3) + ggtitle("Boxplot of Sugar Distribution 
for Each Shelf") + labs(x="Shelf", y="Sugar (grams)")
boxplot_sug

These graphs demonstrate the distribution of the cereal data (sugar and calories respectively) on each shelf. The orange points represent the means for each shelf whereas the purple dots represent outliers. When looking at calorie distribution, we see that the median for shelf 2 and 3 are the same. However, shelf 2 does not show a box because the first and third quartiles are the same as the median of 110 calories. The boxplots for sugar tell a slightly different story. We see that the median for shelf 2 is significantly higher than the medians on shelf 1 and 3. We also see that the mean is higher for shelf 2 than shelf 1 and 3.

Analysis of Variance

anova_cal <- aov(Calories ~ Shelf, data = cereal)
summary(anova_cal)

##             Df Sum Sq Mean Sq F value Pr(>F)
## Shelf        2    559   279.7   0.732  0.485
## Residuals   74  28292   382.3

anova_sug <- aov(Sugar ~ Shelf, data = cereal)
summary(anova_sug)

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Shelf        2  220.2  110.12   6.601 0.00232 **
## Residuals   73 1217.7   16.68                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness

Analysis of Variance (ANOVA) is a test used to evaluate the differences among means in a sample. I am running two separate ANOVA tests (one for calories and another for sugar) to see which tests (one or both) provide results that suggest a difference in means of either calories or sugar among shelves. The null hypothesis for ANOVA is that the mean is the same for all groups (all three shelves) whereas the alternative hypothesis is that the mean is not the same for all groups. The p-value suggests whether the results are statistically significant. If the p-value is greater than our significance level of 0.05, we fail to reject the null hypothesis. If it is less than 0.05, we reject the null hypothesis and we can can conclude that the mean is not the same for all groups. In this case, at least two groups are different from each other so we would want to perform a Tukey test to see which groups are significantly different.

RESULTS

Calories: Since the p-value is 0.485 which is greater than our significance level of 0.05, we fail to reject the null hypothesis. We conclude that our results are not significantly different and that the means of the calories are approximately the same across all three shelves.

Sugar: Since the p-value is 0.00232 which is less than our significance level of 0.05, we reject the null hypothesis. We conclude that are results are significantly different and that the means of the sugars are not the same across all three shelves. We will proceed with a Tukey test to see which shelves have significantly different means.

Tukey Test for Sugar

TukeyTest <- TukeyHSD(anova_sug)
TukeyTest

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Sugar ~ Shelf, data = cereal)
## 
## $Shelf
##          diff       lwr        upr     p adj
## 2-1  4.513784  1.419965  7.6076039 0.0023434
## 3-1  1.422515 -1.348282  4.1933116 0.4404910
## 3-2 -3.091270 -5.774315 -0.4082249 0.0199413

plot(TukeyTest)

From the results, we can see that shelf one and two have significantly different means since the p-value is less than 0.05. Additionally, shelf two and three have have significantly different means since the p-value is less than 0.05. This is depicted in the graph which shows the pair of shelf one and three as not statistically different since it crosses the x=0 dotted line. As a result, we can say that shelf two was significantly different from the other two shelves.

Conclusions

In conclusion, there is statistically significant evidence to affirm that shelf two holds more sugary cereals. This meets our initial hypothesis that less healthy cereals are placed on shelves that are eye-level, if we assume that more sugar in cereals means they are less healthy. This result remained constant among all of the visualizations and analyses performed. From the pie charts, we could see that the majority of sugary cereals (63.16%) were on shelf 2. Similarly, the dotchart showed the same result. The boxplot showed shelf 2 had the highest mean of sugars per serving. Lastly, the ANOVA test showed that the means across all three shelves were significantly different which was further proven by the Tukey test which showed that shelf 2 specifically was statistically different from the other two shelves. There was not enough evidence to conclude that the calories among all three shelves were significantly different. This was evident in all of the visualizations but most concretely shown by the ANOVA test which concluded that the means of the calories were not statistically different across all three shelves.

Future Work

Regarding future work, more data would be needed from other grocery stores to see if the results remain constant. Additionally, if there are more than 3 shelves, it would be interesting to see where the most sugarly cereals are placed. Would the placement change from shelf 2 to shelf 3? The limitations of this study would include that the sample was relatively small. A larger pool of data would provide better, more reliable results. Additionally, I would like to look into other nutritional data besides calories and sugar in another future analysis.