INST 314 - Week 12 Lab 5

Load the data, check the variables, and find those that you will need for this assignment. Optional: Subset the dataset to only retain the variables that you need for this analysis.

library (dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library (sjstats)

## Warning: package 'sjstats' was built under R version 4.0.3

library (car)

## Warning: package 'car' was built under R version 4.0.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.0.3

## Registered S3 methods overwritten by 'car':
##   method                          from
##   influence.merMod                lme4
##   cooks.distance.influence.merMod lme4
##   dfbeta.influence.merMod         lme4
##   dfbetas.influence.merMod        lme4

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library (pwr)

## Warning: package 'pwr' was built under R version 4.0.3

df<-read.csv("Cereal.csv")
ls(df)

##  [1] "Calories" "Carbs"    "Company"  "Fat"      "Fiber"    "Name"    
##  [7] "Protein"  "Serving"  "Sodium"   "Sugars"

Summarize your data.

Create a table of summary statistics, showing the sample size, mean, and standard deviation for each company.

attach(df)
dim(df)

## [1] 30 10

names(df)

##  [1] "Name"     "Company"  "Serving"  "Calories" "Fat"      "Sodium"  
##  [7] "Carbs"    "Fiber"    "Sugars"   "Protein"

str(df)

## 'data.frame':    30 obs. of  10 variables:
##  $ Name    : chr  "AppleJacks" "Boo Berry" "Cap'n Crunch" "Cinnamon Toast Crunch" ...
##  $ Company : chr  "K" "G" "Q" "G" ...
##  $ Serving : num  1 1 0.75 0.75 1 1 1 1 1 1 ...
##  $ Calories: int  117 118 144 169 130 117 117 101 117 113 ...
##  $ Fat     : num  0.6 0.8 2.1 4.4 1.2 1 0.9 0.1 0.2 0.3 ...
##  $ Sodium  : int  143 211 269 408 135 171 178 202 120 229 ...
##  $ Carbs   : int  27 27 31 32 29 26 26 24 28 26 ...
##  $ Fiber   : num  0.5 0.1 1.1 1.7 0.8 0.8 0.5 0.8 0.3 0.1 ...
##  $ Sugars  : num  15 14 16 13.3 16 14 13 3 15 3 ...
##  $ Protein : num  1 1 1.3 2.7 1 1 1 2 1 2 ...

head(df)

summary(df$Fiber)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.575   0.950   1.797   2.800   7.300

sd(df$Fiber)

## [1] 1.879744

tapply (Fiber, Company, mean)

##        G        K        Q 
## 1.469231 1.763636 2.566667

tapply (Fiber, Company, sd)

##        G        K        Q 
## 1.247921 2.349158 2.174090

tapply (Fiber, Company, summary)

## $G
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.500   1.300   1.469   2.800   3.700 
## 
## $K
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.400   0.800   1.764   1.900   7.300 
## 
## $Q
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   0.950   1.900   2.567   3.300   6.400

group_by(df, Company) %>%
  summarise(
    count = n(),
    mean = mean(Fiber, na.rm = TRUE),
    sd = sd(Fiber, na.rm = TRUE)
  )

## `summarise()` ungrouping output (override with `.groups` argument)

Create a plot of side-by-side boxplots. Comment on whether you are seeing any potential differences in the mean number of grams of fiber per serving, based on the three different companies.

boxplot (Fiber ~ Company)

Looking at the side-by-side box plots, there looks to be a very slight difference in the means of grams of Fiber per serving between the 3 companies. G has a mean of 1.4, K has a mean of 1.8 and Q has a mean of 2.6.

Check to see if the conditions for ANOVA are met. Interpret your results in the context of this problem, using complete sentence(s). State any limitations to using the traditional ANOVA test.

We are checking the normality of treatment groups.

tapply (Fiber, Company, qqnorm)

## $G
## $G$x
##  [1] -1.1983797  0.3957253 -0.3957253 -0.6151411  0.0000000  0.6151411
##  [7] -0.1940281  0.1940281  0.8694238 -1.7688250 -0.8694238  1.7688250
## [13]  1.1983797
## 
## $G$y
##  [1] 0.1 1.7 0.8 0.5 1.3 2.8 0.8 1.5 2.8 0.0 0.1 3.7 3.0
## 
## 
## $K
## $K$x
##  [1] -0.4727891 -0.2298841 -0.7478586 -1.0968036  0.0000000  1.0968036
##  [7]  0.4727891  1.6906216 -1.6906216  0.7478586  0.2298841
## 
## $K$y
##  [1] 0.5 0.8 0.3 0.1 0.8 5.0 1.0 7.3 0.0 2.8 0.8
## 
## 
## $Q
## $Q$x
## [1] -0.2018935 -1.2815516  1.2815516  0.6433454 -0.6433454  0.2018935
## 
## $Q$y
## [1] 1.1 0.8 6.4 3.5 0.9 2.7

With the leveneTest, we check the homogeneity of variance. The null is no difference in variances across groups.

leveneTest(Fiber ~ Company, data = df)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

One limitation is that not all the observations have the minumim 10 observations, so we need to use the average instead. Another limiation is that the data is not normally distrubuted for the normal distribution, which will change the data.

Run one-way ANOVA for randomly designed experiment

State the hypotheses for this ANOVA using either words or symbols.

Ho: muG_Fiber = muK_Fiber = muQ_Fiber, all the same Ha: mu≠mu, at least one difference

Perform the ANOVA and two way comparisons.

Interpret the results of your ANOVA table in the context of this problem, using complete sentence(s).

aov.all <- aov(Fiber ~ Company, data = df)
summary (aov.all)

##             Df Sum Sq Mean Sq F value Pr(>F)
## Company      2   4.96   2.482   0.687  0.512
## Residuals   27  97.51   3.611

The ANOVA table test is used to compares the variability between groups to the variability within groups, looking at the results we can see that there is slight variability within the companies but not much between the companies.

Interpret the results of your two-way comparisons in the context of this problem, using complete sentence(s).

TukeyHSD(aov.all)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Fiber ~ Company, data = df)
## 
## $Company
##          diff       lwr      upr     p adj
## K-G 0.2944056 -1.635883 2.224694 0.9244212
## Q-G 1.0974359 -1.228052 3.422924 0.4807525
## Q-K 0.8030303 -1.588286 3.194347 0.6863728

The two-way comparison gives the difference between the possibilities of each other. If the lower and upper interval contains it has a 0, there really is no difference. This is also supported from large p-values. So looking at K-G comparison, the CI is (-1.64, 2.22), which contains a 0 and looking at the large p-value it means there is no difference. Looking at Q-G comparison, the CI is (-1.22, 3.42), which contains a 0 and looking at the large p-value it means there is no difference. Looking at Q-K comparison, the CI is (-1.59, 3.19), which contains a 0 and looking at the large p-value it means there is no difference.

Perform an analysis of the power of this test. Comment on the power of this test. What could be done to improve the power? What other limitations, if any, do you note in this test?

eta_sq(aov.all)

pwr.anova.test(k=3, sig.level=0.05, f=.048, n=10, power=NULL)

## 
##      Balanced one-way analysis of variance power calculation 
## 
##               k = 3
##               n = 10
##               f = 0.048
##       sig.level = 0.05
##           power = 0.05467764
## 
## NOTE: n is number in each group

10 is used for n because we take the average of the observations because not all companies had 10 observations.

pwr.anova.test(k=3, sig.level=0.05, f=.048, n=NULL, power=.8)

## 
##      Balanced one-way analysis of variance power calculation 
## 
##               k = 3
##               n = 1394.907
##               f = 0.048
##       sig.level = 0.05
##           power = 0.8
## 
## NOTE: n is number in each group

One thing to improve the power. If more observations are added the n value will increase and in turn it will make the power increase as well.

Refer to your answers in #2 and 5. Comment on whether or not a non-parametric test would be a better option for this data. Why or why not? If needed, perform the non-parametric tests to compare the means, with pairwise comparisons.

kruskal.test(Fiber ~ Company, data = df)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Fiber by Company
## Kruskal-Wallis chi-squared = 2.0418, df = 2, p-value = 0.3603

A non-parametric test is the better option for this data because from the output of the Kruskal-Wallis test, we know that there is a significant difference between groups, we just need to find the pairs of groups that have this significant difference. Another reason is because not all conditions are met.

INST 314 - Week 12 Lab 5 - Fall 2020

11-20-20