Do less healthy cereals tend to be displayed on the upper, lower, or middle shelf?
Note: To evaluate relative healthiness of cereals, we will look at calories and sugar specifically. There will be two separate analyses performed.
This dataset contains nutrition information about 77 cereals from 7 different manufacturers. The data displays the measurements of various items you would read on a nutrition label as well as information about serving sizes and the cereal’s overall rating. This dataset might specifically appeal to people who eat cereal regularly and want to compare different types of cereal to find the healthiest option.
The fields in this dataset include:
The names of the cereals comprise the rows of the dataset whereas all of the other variables comprise the columns.
Research has shown that more sugary cereals tend to be placed on the lower shelves for two reasons. First, the lower shelves are taller and deeper and can larger boxes which the best-selling brands tend to manufacture. Second, younger kids walk by the lower shelves and are attracted to these cereals. Then then ask their parents to purchase their favorite Lucky Charms or Fruit Loops boxes. The adult cereals tend to be on the higher shelves and do not sell out as frequently. The less healthy cereals are shelved at eye level which typically means shelf 2.
Source: https://www.deseret.com/1990/5/8/18860646/top-shelve-those-sugary-cereals
# Set working directory
# setwd('C:/Users/student/Documents/DS4001/PracticeOfDataScience')
# Read in dataset
cereal <- read.csv("cereal.csv")
# Load libraries
library(ggplot2)
library(psych)
library(dplyr)
# Renaming the columns
colnames(cereal) <- c("Name", "Manufacturer", "Type", "Calories", "Protein", "Fat", "Sodium", "Fiber", "Carbohydrates", "Sugar", "Potassium", "Vitamins", "Shelf", "Weight", "Cups", "Rating")
# Renaming the manufacturers
cereal$Manufacturer_Name <- cereal$Manufacturer
cereal$Manufacturer_Name <- gsub(pattern = "P", replacement = "Post", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "A", replacement = "American Home Food Products", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "G", replacement = "General Mills", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "K", replacement = "Kellogs", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "N", replacement = "Nabisco", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "Q", replacement = "Quaker Oats", x = cereal$Manufacturer_Name)
cereal$Manufacturer_Name <- gsub(pattern = "R", replacement = "Ralston Purina", x = cereal$Manufacturer_Name)
# Recoding negative values as NA
cereal$Carbohydrates[cereal$Carbohydrates < 0] <- NA
cereal$Sugar[cereal$Sugar < 0] <- NA
cereal$Potassium[cereal$Potassium < 0] <- NA
# Recode cereal types
cereal$Type <- gsub("H", "Hot", x = cereal$Type)
cereal$Type <- gsub("C", "Cold", x = cereal$Type)
# Change cereal type and shelf from character to factor
cereal$Type <- factor(cereal$Type)
cereal$Shelf <- factor(cereal$Shelf)
cereal$Manufacturer <- factor(cereal$Manufacturer)
# taking a look at the raw data
head(cereal)
## Name Manufacturer Type Calories Protein Fat Sodium Fiber
## 1 100% Bran N Cold 70 4 1 130 10.0
## 2 100% Natural Bran Q Cold 120 3 5 15 2.0
## 3 All-Bran K Cold 70 4 1 260 9.0
## 4 All-Bran with Extra Fiber K Cold 50 4 0 140 14.0
## 5 Almond Delight R Cold 110 2 2 200 1.0
## 6 Apple Cinnamon Cheerios G Cold 110 2 2 180 1.5
## Carbohydrates Sugar Potassium Vitamins Shelf Weight Cups Rating
## 1 5.0 6 280 25 3 1 0.33 68.40297
## 2 8.0 8 135 0 3 1 1.00 33.98368
## 3 7.0 5 320 25 3 1 0.33 59.42551
## 4 8.0 0 330 25 3 1 0.50 93.70491
## 5 14.0 8 NA 25 3 1 0.75 34.38484
## 6 10.5 10 70 25 1 1 0.75 29.50954
## Manufacturer_Name
## 1 Nabisco
## 2 Quaker Oats
## 3 Kellogs
## 4 Kellogs
## 5 Ralston Purina
## 6 General Mills
# summary of all cereal data
summary(cereal)
## Name Manufacturer Type Calories
## 100% Bran : 1 A: 1 Cold:74 Min. : 50.0
## 100% Natural Bran : 1 G:22 Hot : 3 1st Qu.:100.0
## All-Bran : 1 K:23 Median :110.0
## All-Bran with Extra Fiber: 1 N: 6 Mean :106.9
## Almond Delight : 1 P: 9 3rd Qu.:110.0
## Apple Cinnamon Cheerios : 1 Q: 8 Max. :160.0
## (Other) :71 R: 8
## Protein Fat Sodium Fiber
## Min. :1.000 Min. :0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:0.000 1st Qu.:130.0 1st Qu.: 1.000
## Median :3.000 Median :1.000 Median :180.0 Median : 2.000
## Mean :2.545 Mean :1.013 Mean :159.7 Mean : 2.152
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:210.0 3rd Qu.: 3.000
## Max. :6.000 Max. :5.000 Max. :320.0 Max. :14.000
##
## Carbohydrates Sugar Potassium Vitamins Shelf
## Min. : 5.0 Min. : 0.000 Min. : 15.00 Min. : 0.00 1:20
## 1st Qu.:12.0 1st Qu.: 3.000 1st Qu.: 42.50 1st Qu.: 25.00 2:21
## Median :14.5 Median : 7.000 Median : 90.00 Median : 25.00 3:36
## Mean :14.8 Mean : 7.026 Mean : 98.67 Mean : 28.25
## 3rd Qu.:17.0 3rd Qu.:11.000 3rd Qu.:120.00 3rd Qu.: 25.00
## Max. :23.0 Max. :15.000 Max. :330.00 Max. :100.00
## NA's :1 NA's :1 NA's :2
## Weight Cups Rating Manufacturer_Name
## Min. :0.50 Min. :0.250 Min. :18.04 Length:77
## 1st Qu.:1.00 1st Qu.:0.670 1st Qu.:33.17 Class :character
## Median :1.00 Median :0.750 Median :40.40 Mode :character
## Mean :1.03 Mean :0.821 Mean :42.67
## 3rd Qu.:1.00 3rd Qu.:1.000 3rd Qu.:50.83
## Max. :1.50 Max. :1.500 Max. :93.70
##
# structure of dataset
str(cereal)
## 'data.frame': 77 obs. of 17 variables:
## $ Name : Factor w/ 77 levels "100% Bran","100% Natural Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Manufacturer : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
## $ Type : Factor w/ 2 levels "Cold","Hot": 1 1 1 1 1 1 1 1 1 1 ...
## $ Calories : int 70 120 70 50 110 110 110 130 90 90 ...
## $ Protein : int 4 3 4 4 2 2 2 3 2 3 ...
## $ Fat : int 1 5 1 0 2 2 0 2 1 0 ...
## $ Sodium : int 130 15 260 140 200 180 125 210 200 210 ...
## $ Fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
## $ Carbohydrates : num 5 8 7 8 14 10.5 11 18 15 13 ...
## $ Sugar : int 6 8 5 0 8 10 14 8 6 5 ...
## $ Potassium : int 280 135 320 330 NA 70 30 100 125 190 ...
## $ Vitamins : int 25 0 25 25 25 25 25 25 25 25 ...
## $ Shelf : Factor w/ 3 levels "1","2","3": 3 3 3 3 3 1 2 3 1 3 ...
## $ Weight : num 1 1 1 1 1 1 1 1.33 1 1 ...
## $ Cups : num 0.33 1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
## $ Rating : num 68.4 34 59.4 93.7 34.4 ...
## $ Manufacturer_Name: chr "Nabisco" "Quaker Oats" "Kellogs" "Kellogs" ...
# summary across all manufacturers
summaryCals <- summary(cereal$Calories)
summaryCals
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.0 100.0 110.0 106.9 110.0 160.0
# histogram across all manufacturers
calHist <- ggplot(cereal, aes(x=Calories))
bnsize <- diff(range(cereal$Calories))/8
caloriehistogram <- calHist + geom_histogram(binwidth=bnsize, fill="light pink", colour="black", alpha=0.6) +
ggtitle("Histogram of Calories with Normal Curve")
calhistogram <- caloriehistogram + stat_function(fun = function(x, mean, sd, n, bw){ dnorm(x = x, mean = mean(cereal$Calories), sd = sd(cereal$Calories)) * length(cereal$Calories) * bnsize }, args = c(mean = mean(cereal$Calories), sd = sd(cereal$Calories), n = length(cereal$Calories), bw = bnsize))
calhistogram
From the summary of the distribution, we see that the mean number of calories per serving is 106.9 whereas the median is 110.0. Since the mean is less than the median, the distribution is slightly left skewed. The range of the distribution of calories is 110. From the graph, we can see that the distribution of calories for all 77 brands is relatively normal. The overlaid normal curve shows the shape of the underlying distribution. Although the normal curve does not perfectly map the data, if more cereals were included in the population, the distribution would look more normally distributed.
# summary across all manufacturers
summarySugar <- summary(cereal$Sugar)
summarySugar
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 3.000 7.000 7.026 11.000 15.000 1
# histogram across all manufacturers
sugHist <- ggplot(cereal, aes(x=Sugar))
sugarhistogram <- sugHist + geom_histogram(binwidth=2, fill="light blue", colour="black", alpha=0.6) +
ggtitle("Histogram of Sugar Distribution with Normal Curve")
sughistogram <- sugarhistogram + stat_function(fun = function(x, mean, sd, n, bw){ dnorm(x = x, mean = mean(cereal$Sugar, na.rm = TRUE), sd = sd(cereal$Sugar, na.rm = TRUE)) * length(cereal$Sugar) * 2 }, args = c(mean = mean(cereal$Sugar, na.rm = TRUE), sd = sd(cereal$Sugar, na.rm = TRUE), n = length(cereal$Sugar), bw = 2))
sughistogram
From the summary of the distribution, we see that the mean number of grams of sugar per serving is 7.026 whereas the median is 7. Since the mean is greater than the median, the distribution is slightly right skewed. The range of the distribution of sugar is 15. From the graph, we can see that the distribution of sugar for all 77 brands is not very normal. This distribution probably does not meet the normality assumption; however the analysis will still be performed.
# taking the subset of data relevant to analysis
cereal_heathMetrics <-subset(cereal,select=c(Calories, Protein, Fat, Carbohydrates, Vitamins, Sugar))
# creating the scatterplot and correlation matrix
pairs.panels(cereal_heathMetrics[,1:6],
method = "pearson", #correlation method
hist.col = "red",
main="Nutritional Scatterplot and Correlation Matrix",
density = TRUE, # show density plots
ellipses = TRUE, # show correlation ellipses
lm=TRUE #linear regression fits
)
The plots above demonstrate the correlations between various nutritional factors. The strongest correlation is a positive 0.57 correlation between calories and sugar, which are the two factors included in this analysis. A perfect correlation would have a pearson correlation of 1 or -1 depending on the directionality of the correlation.
### Proportion of Cereals on Each Shelf
# calculating total number of cereals on each shelf
shelf1 <- cereal[cereal$Shelf == 1,]
num_shelf1 <- nrow(shelf1)
num_shelf1 # 20
shelf2 <- cereal[cereal$Shelf == 2,]
num_shelf2 <- nrow(shelf2)
num_shelf2 # 21
shelf3 <- cereal[cereal$Shelf == 3,]
num_shelf3 <- nrow(shelf3)
num_shelf3 # 36
# calculating total number of cereals
total <- num_shelf1 + num_shelf2 + num_shelf3
total # 77
# proportion of cereals on each shelf
prop_shelf1 <- num_shelf1 / total
prop_shelf1 # 0.2597
prop_shelf2 <- num_shelf2 / total
prop_shelf2 # 0.2727
prop_shelf3 <- num_shelf3 / total
prop_shelf3 # 0.4675
# creating data frame of quantities and proportion on each shelf
num_shelves <- c(num_shelf1, num_shelf2, num_shelf3)
percentage <- c(round(prop_shelf1,4)*100, round(prop_shelf2,4)*100, round(prop_shelf3,4)*100)
shelf_quantities <- as.data.frame(cbind(sort(unique(cereal$Shelf)), num_shelves, percentage))
colnames(shelf_quantities) <- c("Shelf", "Quantity", "Percentage")
# creating labels for pie chart
lbl<-paste("Shelf ", shelf_quantities$Shelf," (", shelf_quantities$Quantity, ") ", shelf_quantities$Percentage,"%",sep="")
# creating pie chart
pie(shelf_quantities$Quantity,
labels = lbl,
cex=0.8,
col=rainbow(length(shelf_quantities$Quantity)),
main="Proportion of Cereals on Each Shelf")
I wanted to create this visualization to show that the proportion of cereals on each shelf is not equal. We can see that shelf 3 has 36 cereals placed on it, making up almost half of all the cereals (46.75%). Shelf 2 follows with 21 cereals, making up 27.27% of the total and then shelf 1 has 20 cereals or 25.97% of the cereals. These proportions will serve as the benchmark.
### Proportion of Caloric Cereals on Each Shelf
### * a caloric cereal is defined as one whose calorie content is in the third quartile
summaryCals[5] # 3rd quartile = 110 calories per serving
high_caloric_content <- cereal[cereal$Calories > 110,]
shelf1_cal <- high_caloric_content[high_caloric_content$Shelf == 1,]
num_shelf1_cal <- nrow(shelf1_cal)
num_shelf1_cal # 0
shelf2_cal <- high_caloric_content[high_caloric_content$Shelf == 2,]
num_shelf2_cal <- nrow(shelf2_cal)
num_shelf2_cal # 5
shelf3_cal <- high_caloric_content[high_caloric_content$Shelf == 3,]
num_shelf3_cal <- nrow(shelf3_cal)
num_shelf3_cal # 13
# calculating total number of cereals
total_cal <- num_shelf1_cal + num_shelf2_cal + num_shelf3_cal
total_cal # 18
# proportion of cereals on each shelf
prop_shelf1_cal <- num_shelf1_cal / total_cal
prop_shelf1_cal # 0
prop_shelf2_cal <- num_shelf2_cal / total_cal
prop_shelf2_cal # 0.2778
prop_shelf3_cal <- num_shelf3_cal / total_cal
prop_shelf3_cal # 0.7222
# creating data frame of quantities and proportion on each shelf
num_shelves_cal <- c(num_shelf1_cal, num_shelf2_cal, num_shelf3_cal)
percentage_cal <- c(round(prop_shelf1_cal,4)*100, round(prop_shelf2_cal,4)*100, round(prop_shelf3_cal,4)*100)
shelf_quantities_cal <- as.data.frame(cbind(sort(unique(cereal$Shelf)), num_shelves_cal, percentage_cal))
colnames(shelf_quantities_cal) <- c("Shelf", "Quantity", "Percentage")
# creating labels for pie chart
lbl<-paste("Shelf ", shelf_quantities_cal$Shelf," (", shelf_quantities_cal$Quantity, ") ", shelf_quantities_cal$Percentage,"%",sep="")
# creating pie chart
pie(shelf_quantities_cal$Quantity,
labels = lbl,
cex=0.8,
col=rainbow(length(shelf_quantities_cal$Quantity)),
main="Proportion of Caloric Cereals on Each Shelf")
This pie chart shows the proportion of caloric cereals on each shelf. We can see that shelf 3 has the majority of the caloric cereals with 13 cereals placed on it or 72.22%. Shelf 2 follows with 5 cereals, making up 27.78% of the total and then shelf 1 has 0 cereals. This shows that most of the caloric cereals are on the top shelf.
### Proportion of Sugary Cereals on Each Shelf
### * a sugary cereal is defined as one whose sugar content is in the third quartile
summarySugar[5] # 3rd quartile = 11g of sugar
high_sugar_content <- cereal[cereal$Sugar > 11,]
shelf1_sug <- high_sugar_content[high_sugar_content$Shelf == 1,]
num_shelf1_sug <- nrow(shelf1_sug)
num_shelf1_sug # 2
shelf2_sug <- high_sugar_content[high_sugar_content$Shelf == 2,]
num_shelf2_sug <- nrow(shelf2_sug)
num_shelf2_sug # 12
shelf3_sug <- high_sugar_content[high_sugar_content$Shelf == 3,]
num_shelf3_sug <- nrow(shelf3_sug)
num_shelf3_sug # 5
# calculating total number of cereals
total_sug <- num_shelf1_sug + num_shelf2_sug + num_shelf3_sug
total_sug # 19
# proportion of cereals on each shelf
prop_shelf1_sug <- num_shelf1_sug / total_sug
prop_shelf1_sug # 0.10526
prop_shelf2_sug <- num_shelf2_sug / total_sug
prop_shelf2_sug # 0.6316
prop_shelf3_sug <- num_shelf3_sug / total_sug
prop_shelf3_sug # 0.2632
# creating data frame of quantities and proportion on each shelf
num_shelves_sug <- c(num_shelf1_sug, num_shelf2_sug, num_shelf3_sug)
percentage_sug <- c(round(prop_shelf1_sug,4)*100, round(prop_shelf2_sug,4)*100, round(prop_shelf3_sug,4)*100)
shelf_quantities_sug <- as.data.frame(cbind(sort(unique(cereal$Shelf)), num_shelves_sug, percentage_sug))
colnames(shelf_quantities_sug) <- c("Shelf", "Quantity", "Percentage")
# creating labels for pie chart
lbl<-paste("Shelf ", shelf_quantities_sug$Shelf," (", shelf_quantities_sug$Quantity, ") ", shelf_quantities_sug$Percentage,"%",sep="")
# creating pie chart
pie(shelf_quantities_sug$Quantity,
labels = lbl,
cex=0.8,
col=rainbow(length(shelf_quantities_sug$Quantity)),
main="Proportion of Sugary Cereals on Each Shelf")
This pie chart shows the proportion of sugary cereals on each shelf. We can see that shelf 2 has the majority of the sugary cereals with 12 cereals placed on it or 63.16%. Shelf 3 follows with 5 cereals, making up 26.32% of the total and then shelf 1 has 2 cereals, 10.53%. This shows that most of the sugary cereals are on the middle shelf.
pie(shelf_quantities$Quantity,
labels = lbl,
cex=0.8,
col=rainbow(length(shelf_quantities$Quantity)),
main="All Cereals")
pie(shelf_quantities_cal$Quantity,
labels = lbl,
cex=0.8,
col=rainbow(length(shelf_quantities_cal$Quantity)),
main="Caloric Cereals")
pie(shelf_quantities_sug$Quantity,
labels = lbl,
cex=0.8,
col=rainbow(length(shelf_quantities_sug$Quantity)),
main="Sugary Cereals")
These are the same graphs as above, but showing them side by side exposes the differences in percentages on each shelf more visibly.
# setup
grps <- factor(cereal$Shelf)
cols <- c("#FF0000", "#FF6600", "#CC00CC")
# calories and shelf
dotchart(cereal$Calories,
groups = grps, color = cols[grps], gcolor = cols, cex = 1.5, pch = 18, xlab = "Calories",
ylab = "Manufacturer", main = "Relationship Between Calories and Shelf")
# sugars and shelf
dotchart(cereal$Sugar,
groups = grps, color = cols[grps], gcolor = cols, cex = 1.5, pch = 18, xlab = "Sugar",
ylab = "Manufacturer", main = "Relationship Between Sugar and Shelf")
The dotcharts show a different visualization of the shelf levels and their relationship to sugar and calories respectively. The x-axis shows the sugar or calorie amounts whereas the y-axis shows the manufacturer. The names of the manufacturers are hidden to give a cleaner look. The points on the graph demonstrate each of the cereals. The cereals are colored by shelf and are located within one of the three distinct banded regions that correspond to the shelf numbers also located on the y-axis. Although the number of cereals on each shelf is not the same, the visualization demonstrates that the most sugary cereals (> 11 grams) are most highly concentrated on shelf 2 whereas the most caloric cereals (> 110 calories) are most highly concentrated on shelf 3.
# Calories
bplot <- ggplot(cereal, aes(x=Shelf, y=Calories))
boxplot_cal <- bplot + geom_boxplot(outlier.color = "purple",
outlier.size = 2) + stat_summary(fun.y=mean, color = "orange",
geom="point", size = 3) + ggtitle("Boxplot of Calories Distribution
for Each Shelf") + labs(x="Shelf", y="Calories")
boxplot_cal
# Sugar
bplot <- ggplot(cereal, aes(x=Shelf, y=Sugar))
boxplot_sug <- bplot + geom_boxplot(outlier.color = "purple",
outlier.size = 2) + stat_summary(fun.y=mean, color = "orange",
geom="point", size = 3) + ggtitle("Boxplot of Sugar Distribution
for Each Shelf") + labs(x="Shelf", y="Sugar (grams)")
boxplot_sug
These graphs demonstrate the distribution of the cereal data (sugar and calories respectively) on each shelf. The orange points represent the means for each shelf whereas the purple dots represent outliers. When looking at calorie distribution, we see that the median for shelf 2 and 3 are the same. However, shelf 2 does not show a box because the first and third quartiles are the same as the median of 110 calories. The boxplots for sugar tell a slightly different story. We see that the median for shelf 2 is significantly higher than the medians on shelf 1 and 3. We also see that the mean is higher for shelf 2 than shelf 1 and 3.
anova_cal <- aov(Calories ~ Shelf, data = cereal)
summary(anova_cal)
## Df Sum Sq Mean Sq F value Pr(>F)
## Shelf 2 559 279.7 0.732 0.485
## Residuals 74 28292 382.3
anova_sug <- aov(Sugar ~ Shelf, data = cereal)
summary(anova_sug)
## Df Sum Sq Mean Sq F value Pr(>F)
## Shelf 2 220.2 110.12 6.601 0.00232 **
## Residuals 73 1217.7 16.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness
Analysis of Variance (ANOVA) is a test used to evaluate the differences among means in a sample. I am running two separate ANOVA tests (one for calories and another for sugar) to see which tests (one or both) provide results that suggest a difference in means of either calories or sugar among shelves. The null hypothesis for ANOVA is that the mean is the same for all groups (all three shelves) whereas the alternative hypothesis is that the mean is not the same for all groups. The p-value suggests whether the results are statistically significant. If the p-value is greater than our significance level of 0.05, we fail to reject the null hypothesis. If it is less than 0.05, we reject the null hypothesis and we can can conclude that the mean is not the same for all groups. In this case, at least two groups are different from each other so we would want to perform a Tukey test to see which groups are significantly different.
RESULTS
Calories: Since the p-value is 0.485 which is greater than our significance level of 0.05, we fail to reject the null hypothesis. We conclude that our results are not significantly different and that the means of the calories are approximately the same across all three shelves.
Sugar: Since the p-value is 0.00232 which is less than our significance level of 0.05, we reject the null hypothesis. We conclude that are results are significantly different and that the means of the sugars are not the same across all three shelves. We will proceed with a Tukey test to see which shelves have significantly different means.
TukeyTest <- TukeyHSD(anova_sug)
TukeyTest
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sugar ~ Shelf, data = cereal)
##
## $Shelf
## diff lwr upr p adj
## 2-1 4.513784 1.419965 7.6076039 0.0023434
## 3-1 1.422515 -1.348282 4.1933116 0.4404910
## 3-2 -3.091270 -5.774315 -0.4082249 0.0199413
plot(TukeyTest)
From the results, we can see that shelf one and two have significantly different means since the p-value is less than 0.05. Additionally, shelf two and three have have significantly different means since the p-value is less than 0.05. This is depicted in the graph which shows the pair of shelf one and three as not statistically different since it crosses the x=0 dotted line. As a result, we can say that shelf two was significantly different from the other two shelves.
In conclusion, there is statistically significant evidence to affirm that shelf two holds more sugary cereals. This meets our initial hypothesis that less healthy cereals are placed on shelves that are eye-level, if we assume that more sugar in cereals means they are less healthy. This result remained constant among all of the visualizations and analyses performed. From the pie charts, we could see that the majority of sugary cereals (63.16%) were on shelf 2. Similarly, the dotchart showed the same result. The boxplot showed shelf 2 had the highest mean of sugars per serving. Lastly, the ANOVA test showed that the means across all three shelves were significantly different which was further proven by the Tukey test which showed that shelf 2 specifically was statistically different from the other two shelves. There was not enough evidence to conclude that the calories among all three shelves were significantly different. This was evident in all of the visualizations but most concretely shown by the ANOVA test which concluded that the means of the calories were not statistically different across all three shelves.
Regarding future work, more data would be needed from other grocery stores to see if the results remain constant. Additionally, if there are more than 3 shelves, it would be interesting to see where the most sugarly cereals are placed. Would the placement change from shelf 2 to shelf 3? The limitations of this study would include that the sample was relatively small. A larger pool of data would provide better, more reliable results. Additionally, I would like to look into other nutritional data besides calories and sugar in another future analysis.