Each student in a large statistics class of 600 students is asked to toss a fair coin 100 times, count the resulting number of Heads, and construct a 0.95-level confidence interval for the probability of Heads. Assume that each student uses a fair coin and constructs the confidence interval correctly. True or False: We would expect approximately 570 of the confidence intervals to contain the number 0.5.
The Hypotheses are:
Ho: The coins are fair.
Ha: The coins are not fair.
We evaluated this question by replicating 60,000 tosses by tossing 600 fair coins 100 times each to get the proportion of heads for each coin, which can be seen below. We used 100 one-sample proportion tests to statistically determine if the results are consistent with the expectation of a fair coin at a 95% confidence interval.
set.seed(25)
total.tosses = 600 * 100 #600 students toss a fair coin 100 times
total.tosses
[1] 60000
binom.tosses = rbinom(600, 100, prob = 0.5) # Calculates the Number of heads tosses
table(binom.tosses)
binom.tosses
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
1 4 1 7 10 3 14 18 27 40 34 43 38 46 49 30 56 43 39 29 19 12 9
59 60 61 62 63
11 9 4 1 3
binom.tosses
[1] 52 45 44 49 53 60 52 47 50 53 46 45 57 54 61 52 52 46 45 46 45
[22] 48 52 50 50 45 50 45 52 48 44 52 43 47 53 46 47 52 52 53 53 49
[43] 50 44 45 40 46 45 53 50 49 52 44 46 46 45 60 51 45 50 49 47 53
[64] 63 51 50 51 45 44 52 49 49 54 55 46 47 50 54 44 55 51 56 39 48
[85] 52 40 54 55 63 39 50 47 53 50 45 42 44 52 47 52 55 47 47 44 51
[106] 59 50 48 46 52 50 56 48 55 51 55 52 44 45 54 53 50 48 45 55 43
[127] 45 39 45 48 40 53 52 61 56 44 48 50 46 52 47 60 45 49 50 47 58
[148] 55 45 57 40 49 54 56 51 49 51 49 56 55 53 51 58 45 46 49 52 50
[169] 48 45 58 47 53 54 48 48 49 49 53 42 47 56 52 58 46 49 55 49 45
[190] 48 47 45 49 52 48 37 50 56 48 39 46 48 38 54 50 47 49 45 57 51
[211] 45 42 43 54 41 42 47 50 54 52 50 46 37 56 46 52 52 50 50 54 53
[232] 47 51 57 54 40 44 50 47 52 44 55 47 51 43 57 45 52 40 54 55 53
[253] 46 43 56 48 44 59 48 40 47 54 53 44 46 50 53 46 48 52 49 55 49
[274] 53 48 53 42 54 57 46 53 49 54 56 53 52 53 45 49 49 59 53 42 50
[295] 51 51 59 47 43 54 60 55 45 52 50 60 56 55 48 46 41 56 52 45 50
[316] 51 43 56 59 63 47 50 52 57 47 53 48 39 53 48 45 54 52 53 50 39
[337] 48 48 49 42 47 45 52 53 52 51 47 45 49 56 50 53 52 45 54 49 54
[358] 42 44 54 60 49 53 44 52 52 54 55 46 52 44 52 47 54 48 50 50 54
[379] 51 47 53 45 44 57 54 50 47 55 48 56 47 49 40 55 44 47 53 51 44
[400] 49 51 45 47 50 53 57 52 51 48 50 49 53 55 50 62 48 46 43 48 41
[421] 50 54 55 49 53 44 59 49 43 55 49 44 43 56 52 49 47 37 59 54 43
[442] 47 46 43 46 50 42 55 55 55 51 55 46 52 40 51 54 58 59 51 48 47
[463] 44 49 51 49 43 52 45 50 52 49 57 40 47 49 54 43 48 50 43 44 49
[484] 43 50 54 44 54 55 51 46 60 47 51 52 48 55 51 47 46 58 47 53 48
[505] 49 56 52 54 52 50 54 54 44 39 54 52 46 49 46 53 48 51 53 45 43
[526] 52 59 47 50 48 52 49 54 50 56 47 58 48 60 45 53 53 49 42 49 45
[547] 60 61 53 49 52 58 57 48 53 52 59 42 54 37 47 50 50 36 49 47 54
[568] 51 52 52 57 53 58 46 46 48 45 47 46 46 43 42 49 55 53 42 46 44
[589] 55 52 54 50 59 52 42 50 51 45 56 61
barplot(table(binom.tosses), xlab = 'X = Number of Heads', ylab = 'Frequency' ,col = 'light green')
qqnorm(binom.tosses, col = 'blue')
qqline(binom.tosses, col = "red", lwd = 3)
mean(binom.tosses) # average number heads tossed # sum(binom.tosses)/total.tosses
[1] 49.76167
No.Heads<-sum(binom.tosses) # Number of heads tosses
No.Heads
[1] 29857
summary(binom.tosses)
Min. 1st Qu. Median Mean 3rd Qu. Max.
36.00 46.00 50.00 49.76 53.00 63.00
sd(table(binom.tosses))
[1] 17.42163
binom.test(c(No.Heads, 60000-No.Heads), p = 0.95)
Exact binomial test
data: c(No.Heads, 60000 - No.Heads)
number of successes = 29857, number of trials = 60000, p-value
< 2.2e-16
alternative hypothesis: true probability of success is not equal to 0.95
95 percent confidence interval:
0.4936078 0.5016257
sample estimates:
probability of success
0.4976167
Results of the experiment included the mean number of Heads = 49.8%, Sum of heads = 29,857, summary statistics, and the standard deviation = 17.42163. A binomial test, shown in the last segment above, presents a p-value < 2.2e-16, confirmed the previous results with the exact same outcome in which the probability of success was 0.4976167. The problem states the coin is a fair coin, which means that the probability of a bias causing the coin’s true mean to be anything other than 0.5. Plotted results showed a normal distribution, displayed in the qq-plots above, although there were a few potential outliers indicated at the extremes. At least 95%, 570/600 = 0.95, of the tests contain the experiment true mean of 0.5 where Heads results shown in Table 3 were 0.4936078 / 0.5016257 = 0.9840162, or 98.4 %: Therefore, fail to reject the null hypothesis, Ho, that the coins are fair.
A company that manufactures light bulbs has advertised that its 75-watt bulbs burn an average of 800 hours before failing. This problem involved testing the accuracy by consumer watchdog organization of a company leader’s claim that their light bulbs lasted an average of 800 hours. The test was to determine whether the number of hours that the sample lights burned was statically significant different than the claim. Results of the test showed that the light bulb sample set lasted an average of 745.1 hours before failing, with a sample standard deviation of 238.0 hours. The null hypothesis is the claim that the light bulb average lifespan equals 800. The alternative hypothesis is that the light bulb average lifespan does not equal 800. The research question and hypotheses include:
RQ: Will the light bulb sample set have an average lifespan like the company claim?
Ho: There is no statistically significant difference between the company light bulb average lifespan and the light bulb sample set average lifespan.
Ha: There is a statistically significant difference between the company light bulb average lifespan and the light bulb sample set average lifespan.
The objective of this problem was to compute the critical value at .05 significance level. The Z test statistic – (-2.306723) was less than the critical value – (-1.644854), Table 2. Therefore, at a .05 significance level, reject the claim, Ho, that average light bulb lifetime is 800 hours. A p-value, Table 3, of 0.01053514 was less than the 0.05 significance level, therefore reject the null hypothesis, Ho. Results led to the conclusion that company leaders exaggerated the lifespan of their light bulbs; However, it cannot be concluded whether the exaggeration was deliberate considering the information provided.
Xbar = 745.1 # sample mean (average hours until failure)
Muo = 800 # hypothesized value Ho
pop = 238.0 # population standard deviation (provided above)
N = 100 # sample size
z = (Xbar - Muo)/(pop/sqrt(N))
z # test statistic
[1] -2.306723
Next, we compute the critical value at .05 significance level:
significance = .05 # significance Level
z.significance = qnorm(1-significance)
-z.significance # critical value
[1] -1.644854
Test statistic of -2.306723 is less than the critical value of -1.644854 Therefore, at a .05 significance level, reject the claim that average light bulb lifetime is 800 hours.
Alternative - we calculate the p-value:
pvalue = pnorm(z)
pvalue # lower tail p-value
[1] 0.01053514
The P value is less than the .05 significance level, therefore reject the null hypothesis that Muo = 800. The average lightbulb lifespan is not equal to 800.
For this section we evaluate Cereal metrics to answer a few questions about some of our favorite cereal brands. The data contains the following factors for interpretation:
cereal name [name]
manufacturer (e.g., Kellogg’s) [mfr]
type (cold/hot) [type]
calories (number) [calories]
protein(g) [protein]
fat(g) [fat]
sodium(mg) [sodium]
dietary fiber(g) [fiber]
complex carbohydrates(g) [carbo]
sugars(g) [sugars]
display shelf (1, 2, or 3, counting from the floor) [shelf]
potassium(mg) [potass]
vitamins & minerals (0, 25, or 100, respectively indicating ‘none added’; ‘enriched, often to 25% FDA recommended’; ‘100% of FDA recommended’) [vitamins]
weight (in ounces) of one serving (serving size) [weight]
cups per serving [cups]
In our analysis, words in the brackets are the names that we use for our factors.
We begin by reading in the data via a csv that was taken from the site: http://lib.stat.cmu.edu/datasets/1993.expo/cereal
Our hypothesis being kids will look at lower shelves and kid’s cereal will have more sugar.
setwd('C:\\Users\\JP\\Documents\\School\\PredictiveAnalytics')
cereal<-read.csv("cereals.csv",header=TRUE)
library("ggplot2")
library("caret")
library("car")
We begin by taking a look at the data presented.
head(cereal)
summary(cereal)
name mfr type calories
100%_Bran : 1 A: 1 C:74 Min. : 50.0
100%_Natural_Bran : 1 G:22 H: 3 1st Qu.:100.0
All-Bran : 1 K:23 Median :110.0
All-Bran_with_Extra_Fiber: 1 N: 6 Mean :106.9
Almond_Delight : 1 P: 9 3rd Qu.:110.0
Apple_Cinnamon_Cheerios : 1 Q: 8 Max. :160.0
(Other) :71 R: 8
protein fat sodium fiber
Min. :1.000 Min. :0.000 Min. : 0.0 Min. : 0.000
1st Qu.:2.000 1st Qu.:0.000 1st Qu.:130.0 1st Qu.: 1.000
Median :3.000 Median :1.000 Median :180.0 Median : 2.000
Mean :2.545 Mean :1.013 Mean :159.7 Mean : 2.152
3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:210.0 3rd Qu.: 3.000
Max. :6.000 Max. :5.000 Max. :320.0 Max. :14.000
carbo sugars shelf potass
Min. :-1.0 Min. :-1.000 Min. :1.000 Min. : -1.00
1st Qu.:12.0 1st Qu.: 3.000 1st Qu.:1.000 1st Qu.: 40.00
Median :14.0 Median : 7.000 Median :2.000 Median : 90.00
Mean :14.6 Mean : 6.922 Mean :2.208 Mean : 96.08
3rd Qu.:17.0 3rd Qu.:11.000 3rd Qu.:3.000 3rd Qu.:120.00
Max. :23.0 Max. :15.000 Max. :3.000 Max. :330.00
vitamins weight cups
Min. : 0.00 Min. :-1.0000 Min. :-1.0000
1st Qu.: 25.00 1st Qu.: 1.0000 1st Qu.: 0.5000
Median : 25.00 Median : 1.0000 Median : 0.7500
Mean : 28.25 Mean : 0.9777 Mean : 0.5873
3rd Qu.: 25.00 3rd Qu.: 1.0000 3rd Qu.: 1.0000
Max. :100.00 Max. : 1.5000 Max. : 1.5000
str(cereal)
'data.frame': 77 obs. of 15 variables:
$ name : Factor w/ 77 levels "100%_Bran","100%_Natural_Bran",..: 1 2 3 4 5 6 7 8 9 10 ...
$ mfr : Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
$ type : Factor w/ 2 levels "C","H": 1 1 1 1 1 1 1 1 1 1 ...
$ calories: int 70 120 70 50 110 110 110 130 90 90 ...
$ protein : int 4 3 4 4 2 2 2 3 2 3 ...
$ fat : int 1 5 1 0 2 2 0 2 1 0 ...
$ sodium : int 130 15 260 140 200 180 125 210 200 210 ...
$ fiber : num 10 2 9 14 1 1.5 1 2 4 5 ...
$ carbo : num 5 8 7 8 14 10.5 11 18 15 13 ...
$ sugars : int 6 8 5 0 8 10 14 8 6 5 ...
$ shelf : int 3 3 3 3 3 1 2 3 1 3 ...
$ potass : int 280 135 320 330 -1 70 30 100 125 190 ...
$ vitamins: int 25 0 25 25 25 25 25 25 25 25 ...
$ weight : num 1 1 1 1 1 1 1 1.33 1 1 ...
$ cups : num 0.33 -1 0.33 0.5 0.75 0.75 1 0.75 0.67 0.67 ...
The summary shows we have -1 in the data. We will remove these -1’s. A -1 signifies that the data is null or not provided.
cereal[cereal == -1] <- NA
cereal=na.omit(cereal)
The first step in our evaluation is to do some exploratory data analysis. We will start with a box plot of the various factors.
qplot(as.factor(shelf),sugars, data = cereal,
geom = "boxplot", main="Sugar Vs Shelf Level")+
theme_bw() #versus sugars
It looks like there is a strong correlation between the level of the shelf and the amount of sugar. This is displayed as the boxplot for the second level of shelving is much higher than levels 1 and 3. We will continue on by looking at a similar box plot, instead evaluating the Manufacturer instead of shelf level.
qplot(mfr,sugars, data = cereal,
geom = "boxplot", main="Sugar Vs Manufacturer Level")+
theme_bw()
Based on this box plot, certain manufacturers sell products with a larger sugar content. Perhaps they target children? Our next step is to see if higher sugar levels indicate lower nutritional content.
qplot(fiber, sugars, data=cereal)+
geom_smooth(method = "lm")
The linear model indicates a negative correlation but the plotted points do not appear promising. There are not many cereals with high fiber content, which skews our ability to interpret this model.
qplot(vitamins, sugars, data=cereal)+
geom_smooth(method = "lm")
This time is no relationship between sugars and vitamins. We suspect the cereal companies add both so they can better sell their products.
qplot(potass, sugars, data=cereal)+
geom_smooth(method = "lm")
It appears that we have a relationship between potassium levels and sugar levels. For completeness sake, we will print all the plots for continuous variables, its going to be hard to read but it might shine light on important relationships to sugar.
spm(cereal[,c(4:10,12)], reg.line=lm,
diagonal="histogram",smoother=TRUE, spread=TRUE)
While there are several interesting plot, we have clear correlations with calories, protein, fat and carbohydrates. We will build a model below to test this. We would like to test more thoroughly the relationship between sugar and shelf levels.
cereal.shelf.mod <- aov(sugars ~ as.factor(shelf),
data = cereal)
summary(cereal.shelf.mod)
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(shelf) 2 300.8 150.38 10.35 0.000132 ***
Residuals 62 900.6 14.53
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
TukeyHSD(cereal.shelf.mod)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = sugars ~ as.factor(shelf), data = cereal)
$`as.factor(shelf)`
diff lwr upr p adj
2-1 5.388889 2.338221 8.439557 0.0002199
3-1 1.128352 -1.617833 3.874538 0.5880147
3-2 -4.260536 -7.006722 -1.514351 0.0012168
Based on the output form the above there are a few things that stand out. First the relationship between the shelf level and sugar amounts is significant for levels 1 to 2 and 3 to 2.
Next we interpret an ANOVA for the manufacturer.
cereal.mfr.mod = lm(sugars ~ mfr,
data = cereal)
summary(cereal.mfr.mod)
Call:
lm(formula = sugars ~ mfr, data = cereal)
Residuals:
Min 1Q Median 3Q Max
-7.9048 -3.7778 0.0455 4.0000 7.0952
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.95455 0.88608 8.977 1.25e-12 ***
mfrK -0.04978 1.26794 -0.039 0.9688
mfrN -5.95455 2.55791 -2.328 0.0234 *
mfrP 0.82323 1.64450 0.501 0.6185
mfrQ -0.95455 2.05907 -0.464 0.6447
mfrR -4.15455 2.05907 -2.018 0.0482 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.156 on 59 degrees of freedom
Multiple R-squared: 0.1517, Adjusted R-squared: 0.07982
F-statistic: 2.11 on 5 and 59 DF, p-value: 0.0767
anova(cereal.mfr.mod)
Analysis of Variance Table
Response: sugars
Df Sum Sq Mean Sq F value Pr(>F)
mfr 5 182.26 36.453 2.1104 0.0767 .
Residuals 59 1019.12 17.273
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
confint(cereal.mfr.mod)
2.5 % 97.5 %
(Intercept) 6.181494 9.72759684
mfrK -2.586932 2.48736504
mfrN -11.072904 -0.83618699
mfrP -2.467412 4.11387642
mfrQ -5.074745 3.16565363
mfrR -8.274745 -0.03434637
Based on the results of the ANOVA above, there is no significant relationship between sugar and the manufacturer.
cereal.many.mod = lm(sugars ~ calories+protein+fat+
sodium+fiber+carbo+potass,
data = cereal)
summary(cereal.many.mod)
Call:
lm(formula = sugars ~ calories + protein + fat + sodium + fiber +
carbo + potass, data = cereal)
Residuals:
Min 1Q Median 3Q Max
-3.2535 -0.5569 -0.0633 0.6403 2.8434
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.444157 1.148128 3.871 0.000282 ***
calories 0.198943 0.011241 17.698 < 2e-16 ***
protein -1.069138 0.165966 -6.442 2.69e-08 ***
fat -1.896190 0.213634 -8.876 2.46e-12 ***
sodium 0.000183 0.002079 0.088 0.930156
fiber -0.622339 0.180278 -3.452 0.001055 **
carbo -0.978777 0.047746 -20.500 < 2e-16 ***
potass 0.017867 0.005931 3.013 0.003859 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.139 on 57 degrees of freedom
Multiple R-squared: 0.9384, Adjusted R-squared: 0.9308
F-statistic: 124 on 7 and 57 DF, p-value: < 2.2e-16
We can see that most of the continuous variables in the data have some sort of relationship with sugar. We can predict 93% of the variance in the sugar data with the continuous variables.
Our hypothesis is on an aggregate basis, the products will all be about the same. The alternate hypothesis is that a specific manufacturer will be significantly different than the others based on their cereals nutritional factors.
We begin this analysis with a box plot of the various continuous variables versus manufacturer.
qplot(mfr,calories, data = cereal,
geom = "boxplot", main="Calories Vs Manufacturer Level")+
theme_bw()
qplot(mfr,protein, data = cereal,
geom = "boxplot", main="Protein Vs Manufacturer Level")+
theme_bw()
qplot(mfr,fat, data = cereal,
geom = "boxplot", main="Fat Vs Manufacturer Level")+
theme_bw()
qplot(mfr,sodium, data = cereal,
geom = "boxplot", main="Sodium Vs Manufacturer Level")+
theme_bw()
qplot(mfr,vitamins, data = cereal,
geom = "boxplot", main="Vitamins Vs Manufacturer Level")+
theme_bw()
On order of the code above it looks like their are differences between manufacturers in the following items:
Protein, Fat, Sodium and vitamins
We will now perform an analysis of variance on each of the nutrition metrics.
cal.mfr.aov <- aov(calories ~ mfr, data = cereal)
summary(cal.mfr.aov)
Df Sum Sq Mean Sq F value Pr(>F)
mfr 5 2786 557.1 1.881 0.111
Residuals 59 17470 296.1
Here the p-value is just over .05 meaning we should not evaluate it further as there is unlikely a significant difference.
Next, we take a look at protein versus manufacturer.
protein.mfr.aov <- aov(protein ~ mfr, data = cereal)
summary(protein.mfr.aov,conf.level=0.95)
Df Sum Sq Mean Sq F value Pr(>F)
mfr 5 5.03 1.005 0.834 0.531
Residuals 59 71.13 1.206
Here we have no significance right off the bat. Moving forward - fat versus mfr:
fat.mfr.aov <- aov(fat ~ mfr, data = cereal)
summary(fat.mfr.aov)
Df Sum Sq Mean Sq F value Pr(>F)
mfr 5 9.76 1.9510 3.198 0.0127 *
Residuals 59 36.00 0.6101
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We have our first case where there manufacturer is significant, we can go further by performing the Tukey test.
TukeyHSD(fat.mfr.aov,conf.level=0.95)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = fat ~ mfr, data = cereal)
$mfr
diff lwr upr p adj
K-G -0.74458874 -1.4464957 -0.04268176 0.0314295
N-G -1.03030303 -2.4463066 0.38570058 0.2800858
P-G -0.47474747 -1.3851104 0.43561548 0.6428285
Q-G 0.03636364 -1.1034973 1.17622456 0.9999989
R-G -0.96363636 -2.1034973 0.17622456 0.1437442
N-K -0.28571429 -1.7057579 1.13432929 0.9911547
P-K 0.26984127 -0.6467929 1.18647545 0.9528269
Q-K 0.78095238 -0.3639233 1.92582811 0.3496508
R-K -0.21904762 -1.3639233 0.92582811 0.9929857
P-N 0.55555556 -0.9782668 2.08937792 0.8924565
Q-N 1.06666667 -0.6135515 2.74688488 0.4305356
R-N 0.06666667 -1.6135515 1.74688488 0.9999968
Q-P 0.51111111 -0.7721767 1.79439897 0.8477395
R-P -0.48888889 -1.7721767 0.79439897 0.8702683
R-Q -1.00000000 -2.4551117 0.45511166 0.3414106
We have one example of a significant difference between two manufacturers here in K-G. However for the most part, the differences are minimal or not significant.
sodium.mfr.aov <- aov(sodium ~ mfr, data = cereal)
summary(sodium.mfr.aov)
Df Sum Sq Mean Sq F value Pr(>F)
mfr 5 92311 18462 4.042 0.0032 **
Residuals 59 269455 4567
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
While evaluating sodium, the p-value has come back low, which indicates that there could be a significant difference between manufacturers. Let’s take a look using the same methodology:
TukeyHSD(sodium.mfr.aov,conf.level=0.95)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = sodium ~ mfr, data = cereal)
$mfr
diff lwr upr p adj
K-G -25.216450 -85.94295 35.510050 0.8238532
N-G -157.121212 -279.62882 -34.613607 0.0047673
P-G -54.343434 -133.10481 24.417937 0.3370251
Q-G -55.454545 -154.07127 43.162175 0.5654902
R-G 27.545455 -71.07127 126.162175 0.9621603
N-K -131.904762 -254.76189 -9.047633 0.0284047
P-K -29.126984 -108.43092 50.176951 0.8866822
Q-K -30.238095 -129.28868 68.812489 0.9451902
R-K 52.761905 -46.28868 151.812489 0.6217872
P-N 102.777778 -29.92309 235.478643 0.2181154
Q-N 101.666667 -43.69985 247.033182 0.3222238
R-N 184.666667 39.30015 330.033182 0.0053250
Q-P -1.111111 -112.13662 109.914399 1.0000000
R-P 81.888889 -29.13662 192.914399 0.2660781
R-Q 83.000000 -42.89109 208.891095 0.3877310
Manufacturer N appears three times in our Tukey test. If we were only evaluating manufactures based on sodium, manufacturer N produces cereals with significantly less sodium from the others. Finally, we can look at vitamins to see if they are significantly different across manufacturers.
vitamins.mfr.aov <- aov(vitamins ~ mfr, data = cereal)
summary(vitamins.mfr.aov)
Df Sum Sq Mean Sq F value Pr(>F)
mfr 5 2946 589.3 1.354 0.255
Residuals 59 25669 435.1
Vitamins do not appear to be significantly different across manufacturers.
Cereal manufacturers create very similar products in terms of their nutrituional value. Even when significance was found in a nutritional group it was between a minimal number of manufacturers. Our strongest case was for sodium, and manufacturer N, but as a whole, nothing stuck out, not enough at least to reject our null hypothesis.
This question evaluates whether or not a shelf level is dominated by a certain manufacturer or not.
RQ: Will there be a manufacturer that produces cereal for a certain shelf level?
Ho: There is not a significant relationship between the manufacturer and a cereal’s shelf level.
Ha: There is a significant relationship between the manufacturer and a cereal’s shelf level.
This question can be evaluated using the chisquared test to determine if the there is a relationship between manufacturers and shelf level, but first we need to reorganize the data. When we removed the records with missing values, manufacturer “A” was removed from the set. To perform this analysis correctly, we need to remove that as a factor level in order to avoid skewing the results.
cereal$mfr<-as.character(cereal$mfr)
cereals<-cereal[cereal$mfr!="A",]
cereal$mfr<-as.factor(cereal$mfr)
ManfShelf = table(cereals$mfr, as.factor(cereals$shelf))
ManfShelf
1 2 3
G 6 7 9
K 4 7 10
N 2 0 1
P 2 1 6
Q 0 3 2
R 4 0 1
chisq.test(ManfShelf)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: ManfShelf
X-squared = 15.885, df = 10, p-value = 0.103
We did get a warning, likely because there are too few records, but our p-value would indicate that there is no significant relationship between manufacturer and shelf level. We can try other configurations however, such as evaluating the problem as 3 vs other shelves:
ManfShelf <- cbind(ManfShelf[,"3"], ManfShelf[,"1"] + ManfShelf[,"2"])
ManfShelf
[,1] [,2]
G 9 13
K 10 11
N 1 2
P 6 3
Q 2 3
R 1 4
chisq.test(ManfShelf)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: ManfShelf
X-squared = 3.3937, df = 5, p-value = 0.6395
Again, this yields no significance. We can evaluate 1 and 3 together:
ManfShelf = table(cereals$mfr, as.factor(cereals$shelf))
ManfShelf <- cbind(ManfShelf[,"2"], ManfShelf[,"1"] + ManfShelf[,"3"])
ManfShelf
[,1] [,2]
G 7 15
K 7 14
N 0 3
P 1 8
Q 3 2
R 0 5
chisq.test(ManfShelf)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: ManfShelf
X-squared = 7.4267, df = 5, p-value = 0.1908
Again, nothing, but we do evaluate the last combination of shelves.
ManfShelf = table(cereals$mfr, as.factor(cereals$shelf))
ManfShelf <- cbind(ManfShelf[,"1"], ManfShelf[,"2"] + ManfShelf[,"3"])
ManfShelf
[,1] [,2]
G 6 16
K 4 17
N 2 1
P 2 7
Q 0 5
R 4 1
chisq.test(ManfShelf)
Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: ManfShelf
X-squared = 11.943, df = 5, p-value = 0.03558
In our final scenario, we have found something, though we still do not have many records, when we evaluate the problem as a dependence on manufacturer and shelf level, we have determined that a specific dependency does likely exist. In this case we had to group the top two shelves and compare those to the bottom shelf.
We have rejected our null hypothesis and accepted that there is a relationship between manufacturer and shelf level.
To answer this question, we need to determine what makes a cereal nutritionally superior. The data set for this exercise does come with some recommendations for determining which cereals are the healthiest.
Adults should consume between 20 and 35 grams of dietary fiber per day. The recommended daily intake (RDI) for calories is 2200 for women and 2900 for men. Calories come in three food components. There are 9 calories per gram of fat, and 4 calories per gram of carbohydrate and protein. Overall, in your diet, no more than 10% of your calories should be consumed from simple carbohydrates (sugars), and no more than 30% should come from fat. The RDI of protein is 50 grams for women and 63 grams for men. The balance of calories should be consumed in the form of complex carbohydrates (starches). The average adult with no defined risk factors or other dietary restrictions should consume between 1800 and 2400 mg of sodium per day.
To answer this question, we need to define some logic in accordance to the guidelines above. The first thing we do, is scale the cereals to match the average number of calories per day, given the guidelines. In scenarios where two different numbers are proved for males and females, we will take the mean of those two numbers.
AllDayCereal<-cereal[,c(1,4:10,12,13)]
AllDayCereal$portions<-2550/AllDayCereal$calories
We continue applying the logic of the description above, to get an objective measurement for measuring nutritional value. To specify, our goal is to match the guideline above. If a cereal contains more than the recommended amount of a nutrient, then it would be considered less nutritional that a cereal that has the recommended amount. Below we apply some distance functions to measure each nutrient.
AllDayCereal[,c(2:10)]<-AllDayCereal[,c(2:10)]*AllDayCereal$portions
AllDayCereal$targetFiber<-c(27.5)
AllDayCereal$targetSugar <- c(0)
AllDayCereal$targetFat<- c(0)
AllDayCereal$targetPotein<- c(56.5)
AllDayCereal$targetCarbs<-c(2550-56.5)
AllDayCereal$targetSodium<- c(2100)
AllDayCereal$fiberMeasure<-abs(AllDayCereal$targetFiber-AllDayCereal$fiber)
AllDayCereal$sugarMeasure<-abs(AllDayCereal$targetSugar-AllDayCereal$sugars)
AllDayCereal$fatMeasure<-abs(AllDayCereal$targetFat-AllDayCereal$fat)
AllDayCereal$proteinMeasure<-abs(AllDayCereal$targetPotein-AllDayCereal$protein)
AllDayCereal$carbsMeasure<-abs(AllDayCereal$targetCarbs-AllDayCereal$carbo)
AllDayCereal$sodiumMeasure<-abs(AllDayCereal$targetSodium-AllDayCereal$sodium)
AllDayCereal$nutrients<-AllDayCereal$fiberMeasure+AllDayCereal$sugarMeasure+AllDayCereal$fatMeasure+AllDayCereal$proteinMeasure+AllDayCereal$carbsMeasure+AllDayCereal$sodiumMeasure-AllDayCereal$vitamins
After compiling the numbers, we take the sum of the absolute value of the differences in recommended vs actual nutrients for each cereal. This is what we will use to see if a cereal is significantly better. Values closest to 0, will be the most in line with the nutritional recommendations above.
boxplot(AllDayCereal$nutrients, col=("gold"),
main="Nutrients", ylab="Nutrient Rating")
From the boxplot we can determine that is unlikely that a cereal performs significantly better, but we can see that there is a cereal that is way off the mark. We can find out which one that is. First we check for normality in the results.
shapiro.test(AllDayCereal$nutrients)
Shapiro-Wilk normality test
data: AllDayCereal$nutrients
W = 0.96091, p-value = 0.03826
qqnorm(AllDayCereal$nutrients)
The Shapiro-Wilk normality test could be a warning sign here, but looking at the qq-plot, it seems safe to assume our data is, for the most part, normally distributed.
We then need to evaluate the mean and standard deviation to determine if a cereal falls outside of the confidence intervals.
x<-mean(AllDayCereal$nutrients)
deviation<-sd(AllDayCereal$nutrients)
Lbound<-x-(2*deviation)
Ubound<-x+(2*deviation)
Results<-AllDayCereal[AllDayCereal$nutrients<Lbound,c(1,24)]
Results
We did not return any results, which means that we have failed to reject the null hypothesis and we cannot determine if one cereal is better than the rest in terms of nutritional values. We can still see which brand did the worst.
Results2<-AllDayCereal[AllDayCereal$nutrients>Ubound,c(1,24)]
Results2
Surprisingly All-Bran performed the worst, and is an outlier in our logical solution. This is a result of having much higher than recommended nutrients at scale.
To evaluate this question, we will need to determine if there is a correlation between the two factors. Before we can begin, we need to break up our data to look at these relationships for each manufacturer.
Ho = There is not a relationship between sugars and weight by manufacturers.
Ha = There is a relationship between sugars and weight for at least one manufacturer.
Before we get started, we take a look at the relationship in a graphical format, which can be seen below.
Next, we can determine if there is a relationship at a manufacturer by runnign a correlation test. Below we break up our data by manufacturer and perform the analysis.
unique(cereal$mfr)
[1] N K G R P Q
Levels: G K N P Q R
N<-cereal[cereal$mfr=="N",c(10,14)]
K<-cereal[cereal$mfr=="K",c(10,14)]
G<-cereal[cereal$mfr=="G",c(10,14)]
R<-cereal[cereal$mfr=="R",c(10,14)]
P<-cereal[cereal$mfr=="P",c(10,14)]
Q<-cereal[cereal$mfr=="Q",c(10,14)]
cor.test(N$sugars,N$weight)
the standard deviation is zero
Pearson's product-moment correlation
data: N$sugars and N$weight
t = NA, df = 1, p-value = NA
alternative hypothesis: true correlation is not equal to 0
sample estimates:
cor
NA
cor.test(K$sugars,K$weight)
Pearson's product-moment correlation
data: K$sugars and K$weight
t = 1.7024, df = 19, p-value = 0.105
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.08054844 0.68750690
sample estimates:
cor
0.3637879
cor.test(G$sugars,G$weight)
Pearson's product-moment correlation
data: G$sugars and G$weight
t = 1.5607, df = 20, p-value = 0.1343
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1069780 0.6594864
sample estimates:
cor
0.3294914
cor.test(R$sugars,R$weight)
the standard deviation is zero
Pearson's product-moment correlation
data: R$sugars and R$weight
t = NA, df = 3, p-value = NA
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
NA NA
sample estimates:
cor
NA
cor.test(P$sugars,P$weight)
Pearson's product-moment correlation
data: P$sugars and P$weight
t = 1.2535, df = 7, p-value = 0.2503
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.3297336 0.8504494
sample estimates:
cor
0.4281446
cor.test(Q$sugars,Q$weight)
Pearson's product-moment correlation
data: Q$sugars and Q$weight
t = 2.4445, df = 3, p-value = 0.09213
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.2367708 0.9874004
sample estimates:
cor
0.8159417
There is a correlation for manufacturer Q, but the p-value is not sufficient enough to prove that there is a correlation. As a result, we do not reject the null hypothesis and accept that there is no relationship between sugar and weight by manufacturer.