Load the data, check the variables, and find those that you will need for this assignment. Optional: Subset the dataset to only retain the variables that you need for this analysis.
library (dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library (sjstats)
## Warning: package 'sjstats' was built under R version 4.0.3
library (car)
## Warning: package 'car' was built under R version 4.0.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.0.3
## Registered S3 methods overwritten by 'car':
## method from
## influence.merMod lme4
## cooks.distance.influence.merMod lme4
## dfbeta.influence.merMod lme4
## dfbetas.influence.merMod lme4
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library (pwr)
## Warning: package 'pwr' was built under R version 4.0.3
df<-read.csv("Cereal.csv")
ls(df)
## [1] "Calories" "Carbs" "Company" "Fat" "Fiber" "Name"
## [7] "Protein" "Serving" "Sodium" "Sugars"
attach(df)
dim(df)
## [1] 30 10
names(df)
## [1] "Name" "Company" "Serving" "Calories" "Fat" "Sodium"
## [7] "Carbs" "Fiber" "Sugars" "Protein"
str(df)
## 'data.frame': 30 obs. of 10 variables:
## $ Name : chr "AppleJacks" "Boo Berry" "Cap'n Crunch" "Cinnamon Toast Crunch" ...
## $ Company : chr "K" "G" "Q" "G" ...
## $ Serving : num 1 1 0.75 0.75 1 1 1 1 1 1 ...
## $ Calories: int 117 118 144 169 130 117 117 101 117 113 ...
## $ Fat : num 0.6 0.8 2.1 4.4 1.2 1 0.9 0.1 0.2 0.3 ...
## $ Sodium : int 143 211 269 408 135 171 178 202 120 229 ...
## $ Carbs : int 27 27 31 32 29 26 26 24 28 26 ...
## $ Fiber : num 0.5 0.1 1.1 1.7 0.8 0.8 0.5 0.8 0.3 0.1 ...
## $ Sugars : num 15 14 16 13.3 16 14 13 3 15 3 ...
## $ Protein : num 1 1 1.3 2.7 1 1 1 2 1 2 ...
head(df)
summary(df$Fiber)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.575 0.950 1.797 2.800 7.300
sd(df$Fiber)
## [1] 1.879744
tapply (Fiber, Company, mean)
## G K Q
## 1.469231 1.763636 2.566667
tapply (Fiber, Company, sd)
## G K Q
## 1.247921 2.349158 2.174090
tapply (Fiber, Company, summary)
## $G
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.500 1.300 1.469 2.800 3.700
##
## $K
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.400 0.800 1.764 1.900 7.300
##
## $Q
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 0.950 1.900 2.567 3.300 6.400
group_by(df, Company) %>%
summarise(
count = n(),
mean = mean(Fiber, na.rm = TRUE),
sd = sd(Fiber, na.rm = TRUE)
)
## `summarise()` ungrouping output (override with `.groups` argument)
boxplot (Fiber ~ Company)
Looking at the side-by-side box plots, there looks to be a very slight difference in the means of grams of Fiber per serving between the 3 companies. G has a mean of 1.4, K has a mean of 1.8 and Q has a mean of 2.6.
We are checking the normality of treatment groups.
tapply (Fiber, Company, qqnorm)
## $G
## $G$x
## [1] -1.1983797 0.3957253 -0.3957253 -0.6151411 0.0000000 0.6151411
## [7] -0.1940281 0.1940281 0.8694238 -1.7688250 -0.8694238 1.7688250
## [13] 1.1983797
##
## $G$y
## [1] 0.1 1.7 0.8 0.5 1.3 2.8 0.8 1.5 2.8 0.0 0.1 3.7 3.0
##
##
## $K
## $K$x
## [1] -0.4727891 -0.2298841 -0.7478586 -1.0968036 0.0000000 1.0968036
## [7] 0.4727891 1.6906216 -1.6906216 0.7478586 0.2298841
##
## $K$y
## [1] 0.5 0.8 0.3 0.1 0.8 5.0 1.0 7.3 0.0 2.8 0.8
##
##
## $Q
## $Q$x
## [1] -0.2018935 -1.2815516 1.2815516 0.6433454 -0.6433454 0.2018935
##
## $Q$y
## [1] 1.1 0.8 6.4 3.5 0.9 2.7
With the leveneTest, we check the homogeneity of variance. The null is no difference in variances across groups.
leveneTest(Fiber ~ Company, data = df)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
One limitation is that not all the observations have the minumim 10 observations, so we need to use the average instead. Another limiation is that the data is not normally distrubuted for the normal distribution, which will change the data.
Run one-way ANOVA for randomly designed experiment
Ho: muG_Fiber = muK_Fiber = muQ_Fiber, all the same Ha: mu≠mu, at least one difference
aov.all <- aov(Fiber ~ Company, data = df)
summary (aov.all)
## Df Sum Sq Mean Sq F value Pr(>F)
## Company 2 4.96 2.482 0.687 0.512
## Residuals 27 97.51 3.611
The ANOVA table test is used to compares the variability between groups to the variability within groups, looking at the results we can see that there is slight variability within the companies but not much between the companies.
TukeyHSD(aov.all)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Fiber ~ Company, data = df)
##
## $Company
## diff lwr upr p adj
## K-G 0.2944056 -1.635883 2.224694 0.9244212
## Q-G 1.0974359 -1.228052 3.422924 0.4807525
## Q-K 0.8030303 -1.588286 3.194347 0.6863728
The two-way comparison gives the difference between the possibilities of each other. If the lower and upper interval contains it has a 0, there really is no difference. This is also supported from large p-values. So looking at K-G comparison, the CI is (-1.64, 2.22), which contains a 0 and looking at the large p-value it means there is no difference. Looking at Q-G comparison, the CI is (-1.22, 3.42), which contains a 0 and looking at the large p-value it means there is no difference. Looking at Q-K comparison, the CI is (-1.59, 3.19), which contains a 0 and looking at the large p-value it means there is no difference.
eta_sq(aov.all)
pwr.anova.test(k=3, sig.level=0.05, f=.048, n=10, power=NULL)
##
## Balanced one-way analysis of variance power calculation
##
## k = 3
## n = 10
## f = 0.048
## sig.level = 0.05
## power = 0.05467764
##
## NOTE: n is number in each group
10 is used for n because we take the average of the observations because not all companies had 10 observations.
pwr.anova.test(k=3, sig.level=0.05, f=.048, n=NULL, power=.8)
##
## Balanced one-way analysis of variance power calculation
##
## k = 3
## n = 1394.907
## f = 0.048
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
One thing to improve the power. If more observations are added the n value will increase and in turn it will make the power increase as well.
kruskal.test(Fiber ~ Company, data = df)
##
## Kruskal-Wallis rank sum test
##
## data: Fiber by Company
## Kruskal-Wallis chi-squared = 2.0418, df = 2, p-value = 0.3603
A non-parametric test is the better option for this data because from the output of the Kruskal-Wallis test, we know that there is a significant difference between groups, we just need to find the pairs of groups that have this significant difference. Another reason is because not all conditions are met.