Lab_6_Assignment

Author

Micah Lohr

Essentials

1.

Load PalmerPenguins and other necessary packages

library(tidyverse) 
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(ggsci) 
library(patchwork)  
library(performance) 
library(see)

Attaching package: 'see'

The following objects are masked from 'package:ggsci':

    scale_color_material, scale_colour_material, scale_fill_material
library(rstatix)

Attaching package: 'rstatix'

The following object is masked from 'package:stats':

    filter
library(car) 
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
library(palmerpenguins)

penguins<- palmerpenguins::penguins %>% 
drop_na()

2.

Using the penguins data, perform a 1-way ANOVA involving the effect of a categorical variable (x) on a numerical variable (y). Group and filter the data (remove NA, for example), calculate means and error, then make a graph. Pair that graph with an ANOVA test. Use the graph + statistical test to assess the null hypothesis of ANOVA.

pengaov <- aov(body_mass_g~island, data=penguins)
summary(pengaov)
             Df    Sum Sq  Mean Sq F value Pr(>F)    
island        2  83740680 41870340   105.1 <2e-16 ***
Residuals   330 131518986   398542                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
meanmass<-penguins %>% 
group_by(island) %>% 
summarize(meanmass = mean(body_mass_g), sd=sd(body_mass_g),n=n(),se=sd/sqrt(n)) 

meanmass
# A tibble: 3 × 5
  island    meanmass    sd     n    se
  <fct>        <dbl> <dbl> <int> <dbl>
1 Biscoe       4719.  791.   163  61.9
2 Dream        3719.  413.   123  37.2
3 Torgersen    3709.  452.    47  65.9
ggplot(data=meanmass, aes(x=island, y=meanmass, color=island))+
  geom_point()+
  geom_errorbar(data=meanmass,aes(x=island,ymin=meanmass-se,ymax=meanmass+se, width=0.2))+
  theme_classic()

H0: no difference in the average body masses between the islands HA: there is some difference in the body masses between the islands alpha = 0.05

Our H0 would be that there is no difference in the average body masses of penguins between islands. The graph visually indicates that we might be able to reject the null because of the difference between the average body mass on Biscoe compared to the other two islands. The ANOVA further confirms this: the computed p-value is less than 0.05, and the F value is 110. This tells us that there is some statistically significant difference between average body masses among the island, though we can’t say specifically between which ones without a posthoc test.

3.

Test your assumptions (individually– do not use check_model) and interpret your assumption checks 1. Independence We cannot really tell if there is independence without knowing more about the experimental design for this data set. That said, I will continue anyways as this is a lab assignment. Also I trust palmerpenguins with my life.

  1. Outliers
#penguins$body_mass_g=as.numeric(penguins$body_mass_g)

penguins %>% 
  group_by(island) %>%
  identify_outliers(body_mass_g)
# A tibble: 1 × 10
  island species   bill_le…¹ bill_…² flipp…³ body_…⁴ sex    year is.ou…⁵ is.ex…⁶
  <fct>  <fct>         <dbl>   <dbl>   <int>   <int> <fct> <int> <lgl>   <lgl>  
1 Dream  Chinstrap        52    20.7     210    4800 male   2008 TRUE    FALSE  
# … with abbreviated variable names ¹​bill_length_mm, ²​bill_depth_mm,
#   ³​flipper_length_mm, ⁴​body_mass_g, ⁵​is.outlier, ⁶​is.extreme
ggplot(data=penguins, aes(x=island, y=body_mass_g, fill=island))+
  geom_boxplot()+
  theme_classic()

The outlier test and the graph indicate that there is an outlier. However, I will keep this in the data because the mass of the penguin is not anything particularly crazy for a penguin. It is around the same as the median on Biscoe island, which tells me it is not an outlandish mass for the penguin. You go chunky penguin you’re my hero.

  1. Normality
model1<-aov(body_mass_g~island, data=penguins)
shapiro_test(residuals(model1))
# A tibble: 1 × 3
  variable          statistic p.value
  <chr>                 <dbl>   <dbl>
1 residuals(model1)     0.986 0.00227
penguins %>% 
  group_by(island) %>%
  shapiro_test(body_mass_g)
# A tibble: 3 × 4
  island    variable    statistic       p
  <fct>     <chr>           <dbl>   <dbl>
1 Biscoe    body_mass_g     0.971 0.00159
2 Dream     body_mass_g     0.988 0.367  
3 Torgersen body_mass_g     0.973 0.341  

According to this test for normality, we cannot assume normality for the entire data set as p<0.05. However, the Shapiro-Wilk assessment for normality by groups tells us that we can assume normality for the Torgersen and Dream island data based on those p being >0.05, but we cannot assume this for the data from Biscoe. Despite the violation of this assumption, the dataset has a sufficiently large sample size, so I think that we can proceed regardless.

  1. Homogeneity of Variance
leveneTest(body_mass_g~island, data=penguins)
Levene's Test for Homogeneity of Variance (center = median)
       Df F value    Pr(>F)    
group   2  25.969 3.358e-11 ***
      330                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the Levene test, we do not have homogeneity of variance with this data as indicated by p<0.05. Therefore, we should run a Welch test one-way ANOVA. See below for this.

welchpenguin<-welch_anova_test(body_mass_g~island, data=penguins) 
welchpenguin
# A tibble: 1 × 7
  .y.             n statistic   DFn   DFd        p method     
* <chr>       <int>     <dbl> <dbl> <dbl>    <dbl> <chr>      
1 body_mass_g   333      102.     2  137. 6.06e-28 Welch ANOVA

According to the Welch test, there is a statistically significant difference in body mass by island, as indicated by the p<0.05.

4.

Run a TukeyHSD test on your ANOVA and interpret the results. Make sure your comparisons are easily visible and comparable in your graph (above).

TukeyHSD(pengaov)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = body_mass_g ~ island, data = penguins)

$island
                       diff        lwr       upr    p adj
Dream-Biscoe     -1000.2693 -1177.7875 -822.7512 0.000000
Torgersen-Biscoe -1010.6611 -1256.7392 -764.5830 0.000000
Torgersen-Dream    -10.3918  -265.2678  244.4842 0.994933

According to the Tukey post-hoc test, there is a significant difference in average body mass between the penguins on the Dream and Biscoe islands (p-value=0.000), as well as between the Torgersen and Biscoe islands (p-value=0.000). There is no statistically significant difference between the average body masses for penguins on Torgersen and Dream islands (p-value=0.997). This is consistent with the graph that we made earlier!

Depth

Repeat 1-4 above using 2 or more explanatory variables from Palmer Penguins. Assess the effecs of multiple variable on a single numerical variable in the data frame. Make a graph or graphs (if needed). You will need to group, filter, and summarize your data to do this. Perform the necessary ANOVA and TukeyHSD test. Interpret your results (using the graph(s) and stats outputs). Check your assumptions and interpret your assumption check (you can do this individually or with check_model() if you can get the later to work with your ANOVA– it is not optimized for ANOVA and often does not work)

depthaov<- aov(body_mass_g~species*sex, data=penguins)
summary(depthaov)
             Df    Sum Sq  Mean Sq F value   Pr(>F)    
species       2 145190219 72595110 758.358  < 2e-16 ***
sex           1  37090262 37090262 387.460  < 2e-16 ***
species:sex   2   1676557   838278   8.757 0.000197 ***
Residuals   327  31302628    95727                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pengraph2<-penguins %>% 
group_by(species, sex) %>% 
summarize(mean=mean(body_mass_g), sd=sd(body_mass_g),n=n(),se=sd/sqrt(n)) 
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
ggplot(data=pengraph2, aes(x=species, y=mean, color=sex))+
  geom_point()+
  geom_errorbar(data=pengraph2,aes(x=species,ymin=mean-se,ymax=mean+se, width=0.2))+
  theme_classic()

H0: there is no effect of species on the average body mass, no effect of sex on average body mass, and no interactive effect of sex and species on average body mass

HA: there is an interactive effect of species and sex on average body mass

alpha = 0.05

TukeyHSD(depthaov)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = body_mass_g ~ species * sex, data = penguins)

$species
                       diff       lwr       upr     p adj
Chinstrap-Adelie   26.92385  -80.0258  133.8735 0.8241288
Gentoo-Adelie    1386.27259 1296.3070 1476.2382 0.0000000
Gentoo-Chinstrap 1359.34874 1248.6108 1470.0866 0.0000000

$sex
                diff      lwr      upr p adj
male-female 667.4577 600.7462 734.1692     0

$`species:sex`
                                     diff       lwr       upr     p adj
Chinstrap:female-Adelie:female   158.3703  -25.7874  342.5279 0.1376213
Gentoo:female-Adelie:female     1310.9058 1154.8934 1466.9181 0.0000000
Adelie:male-Adelie:female        674.6575  527.8486  821.4664 0.0000000
Chinstrap:male-Adelie:female     570.1350  385.9773  754.2926 0.0000000
Gentoo:male-Adelie:female       2116.0004 1962.1408 2269.8601 0.0000000
Gentoo:female-Chinstrap:female  1152.5355  960.9603 1344.1107 0.0000000
Adelie:male-Chinstrap:female     516.2873  332.1296  700.4449 0.0000000
Chinstrap:male-Chinstrap:female  411.7647  196.6479  626.8815 0.0000012
Gentoo:male-Chinstrap:female    1957.6302 1767.8040 2147.4564 0.0000000
Adelie:male-Gentoo:female       -636.2482 -792.2606 -480.2359 0.0000000
Chinstrap:male-Gentoo:female    -740.7708 -932.3460 -549.1956 0.0000000
Gentoo:male-Gentoo:female        805.0947  642.4300  967.7594 0.0000000
Chinstrap:male-Adelie:male      -104.5226 -288.6802   79.6351 0.5812048
Gentoo:male-Adelie:male         1441.3429 1287.4832 1595.2026 0.0000000
Gentoo:male-Chinstrap:male      1545.8655 1356.0392 1735.6917 0.0000000
check_model(depthaov)
Variable `Component` is not in your data frame :/

ggplot(data=penguins, aes(x=species, y=body_mass_g, fill=sex))+
  geom_boxplot()+
  theme_classic()

For independence, we cannot definitively say that these data are independent without knowing the experimental design of the dataset. We shall persist nonetheless. According to check_model, homogeneity of variance and normality/homodescascity appear to look ok. For the former, there is a bit of a wave in the line, but it is mostly ok. For normality, the dots fall along the line, indicating that this assumption is met. Finally, for outliers, the boxplot above indicates that there are two of them. However, I feel it is ok to move forward with this dataset because they represent one large and one small penguin that are still overall ok sizes for penguins to be. In other words, they do not appear to be the results of experimental error.

I did this wildy out of order so my interpretation is all going to be here.

According to the ANOVA, there is a significant effect of species and sex separately on body mass, as well as an interactive effect of species and sex on body mass (all p-values are <0.05). This tells us that there is at least one statistically significant difference in the body mass of different species. We cannot say specifically WHERE this difference lies until we run a posthoc test. Additionally, there is a statistically significant difference in the body mass of male and female penguins. Finally, there is a significant interactive effect of species and sex on body mass.

The Tukey HSD test indicates that the body masses are significantly different between Gentoo and Adelie (p-value=0.000) and Gentoo and Chinstrap penguins (p-value=0.000), respectively. There is not a significant effect on species for Chinstrap and Adelie penguins (p-value=0.824). In terms of the interactive effect, every specific interaction of species and sex has a significant effect on average body mass except for Chinstrap males vs Adelie males and Chinstrap females vs Adelie females. Given that the effect of species was not significant for this pairing of species, this result makes sense. These differences are visually represented in the graph above! It demonstrates differences between species, as well as within species as delineated by sex. While maybe a bit harder, it also visualizes the interactive effect.