Lab_6_Assignment

Author

Kathleen Strachota

Essentials

Load PalmerPenguins and other necessary packages

library(palmerpenguins)
library(tidyverse) 
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1      ✔ purrr   1.0.1 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.3.0      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.1 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(ggsci) 
library(patchwork)
library(performance)
library(see)

Attaching package: 'see'

The following objects are masked from 'package:ggsci':

    scale_color_material, scale_colour_material, scale_fill_material
library(rstatix)

Attaching package: 'rstatix'

The following object is masked from 'package:stats':

    filter
library(car)
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

Using the penguins data, perform a 1-way ANOVA involving the effect of a categorical variable (x) on a numerical variable (y). Group and filter the data (remove NA, for example), calculate means and error, then make a graph. Pair that graph with an ANOVA test. Use the graph + statistical test to assess the null hypothesis of ANOVA.

penguin <- penguins %>%
  drop_na()

aov_penguins <- aov(body_mass_g ~ island, penguin)
summary(aov_penguins)
             Df    Sum Sq  Mean Sq F value Pr(>F)    
island        2  83740680 41870340   105.1 <2e-16 ***
Residuals   330 131518986   398542                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
penguin_graph <- penguin %>%
  group_by(island) %>%
  summarise(mean = mean(body_mass_g), sd = sd(body_mass_g), n = n(), se = sd/sqrt(n))

ggplot(data = penguin_graph, aes(x = island, y = mean, color = island)) +
  geom_point() +
  geom_errorbar(data = penguin_graph, aes(x = island, ymin = mean-se, ymax = mean+se), width = 0.2) +
  theme_classic() +
  labs(y = "Mean body mass", x = "Island") +
  scale_color_npg()

\(H_0:\) There is no difference between the mean body masses found on each island. \(H_A:\) At least one island has a different mean body mass. \(\alpha = 0.05\) The F-statistic is 105.1 and the p-value is less than 2e-16 which is less than our significance threshold (\(\alpha = 0.05\)). There is extremely strong evidence against the null hypothesis. We would conclude that at least one island has a mean body mass that has a statistically significant difference than the others.

Test your assumptions (individually– do not use check_model) and interpret your assumption checks

Independence

We don’t know much about the experimental design so I cannot say anything concrete regarding whether or not we will be violating the independence assumption. That being said, I will continue to run this test because it is a lab assignment.

Outliers

ggplot(data = penguin, aes(x = island, y = body_mass_g, fill = island)) +
  geom_boxplot() +
  theme_classic() +
  scale_fill_npg()

penguin %>%
  group_by(island) %>%
  identify_outliers(body_mass_g)
# A tibble: 1 × 10
  island species bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
  <fct>  <fct>            <dbl>         <dbl>            <int>       <int> <fct>
1 Dream  Chinst…             52          20.7              210        4800 male 
# … with 3 more variables: year <int>, is.outlier <lgl>, is.extreme <lgl>

There is one outlier: a penguin found on Dream island who has a higher body mass than the others. However, his/her body mass is approximately the same as the median on Biscoe Island and it is possible that this is just a large penguin in comparison to the rest on the island. Therefore, I will keep the single outlier.

Normality

shapiro_test(residuals(aov_penguins))
# A tibble: 1 × 3
  variable                statistic p.value
  <chr>                       <dbl>   <dbl>
1 residuals(aov_penguins)     0.986 0.00227
penguin %>%
  group_by(island) %>%
  shapiro_test(body_mass_g)
# A tibble: 3 × 4
  island    variable    statistic       p
  <fct>     <chr>           <dbl>   <dbl>
1 Biscoe    body_mass_g     0.971 0.00159
2 Dream     body_mass_g     0.988 0.367  
3 Torgersen body_mass_g     0.973 0.341  

According to these tests, we cannot assume normality for the whole dataset as the p-value is less than 0.05. The Shapiro-Wilk test for each group indicates that we can assume normality for the Torgersen and Dream island data since the p-values are higher than 0.05 but we cannot assume normality for Biscoe island because the p-value is less than 0.05. However, the sample size for this dataset is 276 which is sufficiently large so we can run an ANOVA and TukeyHSD even though the assumption regarding normality is violated.

Homoscedaticity

leveneTest(body_mass_g ~ island, data = penguin)
Levene's Test for Homogeneity of Variance (center = median)
       Df F value    Pr(>F)    
group   2  25.969 3.358e-11 ***
      330                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the levene test, we do not have homogeneity of variance because we have a p-value of 3.358e-11 which is smaller than 0.05. So we should run a welch test one-way anova.

welchpenguin <- welch_anova_test(body_mass_g ~ island, data = penguin)
welchpenguin
# A tibble: 1 × 7
  .y.             n statistic   DFn   DFd        p method     
* <chr>       <int>     <dbl> <dbl> <dbl>    <dbl> <chr>      
1 body_mass_g   333      102.     2  137. 6.06e-28 Welch ANOVA

The p-value is less than 0.05 so we reject the null hypothesis and conclude that there is a significant difference of mean body masses between islands even when using a test that does not require homogeneity of variance.

Run a TukeyHSD test on your ANOVA and interpret the results. Make sure your comparisons are easily visible and comparable in your graph (above).

TukeyHSD(aov_penguins)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = body_mass_g ~ island, data = penguin)

$island
                       diff        lwr       upr    p adj
Dream-Biscoe     -1000.2693 -1177.7875 -822.7512 0.000000
Torgersen-Biscoe -1010.6611 -1256.7392 -764.5830 0.000000
Torgersen-Dream    -10.3918  -265.2678  244.4842 0.994933

According to the TukeyHSD, there is a statistically significant difference between the mean body mass found on Dream and Biscoe islands (p-value = 0.00) and between Torgersen and Bisoce islands (p-value = 0.00). There is not a statistically significant difference between the amen body mass found on Torgersen and Dream islands (p-value = 0.99). This is consistent with our graph from above where visually the mean body mass for Torgersen island is much higher than the other two.

Depth

Repeat 1-4 above using 2 or more explanatory variables from Palmer Penguins. Assess the effecs of multiple variable on a single numerical variable in the data frame. Make a graph or graphs (if needed). You will need to group, filter, and summarize your data to do this. Perform the necessary ANOVA and TukeyHSD test. Interpret your results (using the graph(s) and stats outputs). Check your assumptions and interpret your assumption check (you can do this individually or with check_model() if you can get the later to work with your ANOVA– it is not optimized for ANOVA and often does not work)

penguin_graph2 <- penguin %>%
  group_by(species, sex) %>%
  summarise(mean = mean(body_mass_g), sd = sd(body_mass_g), n = n(), se = sd/sqrt(n))
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
ggplot(data = penguin_graph2, aes(x = species, y = mean, color = sex)) +
  geom_point() +
  geom_errorbar(data = penguin_graph2, aes(x = species, ymin = mean-se, ymax = mean+se), width = 0.2) +
  theme_classic() +
  labs(y = "Mean body mass", x = "Species") +
  scale_color_npg()

aov_penmss <- aov(body_mass_g ~ species*sex, penguin)
check_model(aov_penmss)
Variable `Component` is not in your data frame :/

ggplot(data = penguin, aes(x = species, y = body_mass_g, fill = sex)) +
  geom_boxplot() +
  theme_classic() +
  scale_fill_npg()

penguin %>%
  group_by(species, sex) %>%
  identify_outliers(body_mass_g)
# A tibble: 2 × 10
  species sex   island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
  <fct>   <fct> <fct>           <dbl>         <dbl>            <int>       <int>
1 Chinst… fema… Dream            46.9          16.6              192        2700
2 Chinst… male  Dream            52            20.7              210        4800
# … with 3 more variables: year <int>, is.outlier <lgl>, is.extreme <lgl>

According to these graphs the assumption of normality and homoscedasticity are both met. There are two outliers but I think that it is okay in this case because it is probably just one larger male Chinstrap and one smaller female Chinstrap. In terms of the assumption of independence, we still don’t know much about the experimental design so we cannot make a definite conclusion but I will continue with this anova anyways.

summary(aov_penmss)
             Df    Sum Sq  Mean Sq F value   Pr(>F)    
species       2 145190219 72595110 758.358  < 2e-16 ***
sex           1  37090262 37090262 387.460  < 2e-16 ***
species:sex   2   1676557   838278   8.757 0.000197 ***
Residuals   327  31302628    95727                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(H_0:\) There is no interactive effect of species and sex on mean body mass, or an effect of species or sex on mean body mass.

\(H_A:\) There is an interactive effect between species and sex on mean body mass.

\(\alpha = 0.05\)

According to this ANOVA, at least one species has a significantly different mean body mass (p-value <2e-16) and male and female penguins have a significantly different mean body masses (p-value <2e-16). There is a significant interactive effect of species and sex on mean body mass (p-value = 0.000197). Therefore we reject the null hypothesis that says

TukeyHSD(aov_penmss)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = body_mass_g ~ species * sex, data = penguin)

$species
                       diff       lwr       upr     p adj
Chinstrap-Adelie   26.92385  -80.0258  133.8735 0.8241288
Gentoo-Adelie    1386.27259 1296.3070 1476.2382 0.0000000
Gentoo-Chinstrap 1359.34874 1248.6108 1470.0866 0.0000000

$sex
                diff      lwr      upr p adj
male-female 667.4577 600.7462 734.1692     0

$`species:sex`
                                     diff       lwr       upr     p adj
Chinstrap:female-Adelie:female   158.3703  -25.7874  342.5279 0.1376213
Gentoo:female-Adelie:female     1310.9058 1154.8934 1466.9181 0.0000000
Adelie:male-Adelie:female        674.6575  527.8486  821.4664 0.0000000
Chinstrap:male-Adelie:female     570.1350  385.9773  754.2926 0.0000000
Gentoo:male-Adelie:female       2116.0004 1962.1408 2269.8601 0.0000000
Gentoo:female-Chinstrap:female  1152.5355  960.9603 1344.1107 0.0000000
Adelie:male-Chinstrap:female     516.2873  332.1296  700.4449 0.0000000
Chinstrap:male-Chinstrap:female  411.7647  196.6479  626.8815 0.0000012
Gentoo:male-Chinstrap:female    1957.6302 1767.8040 2147.4564 0.0000000
Adelie:male-Gentoo:female       -636.2482 -792.2606 -480.2359 0.0000000
Chinstrap:male-Gentoo:female    -740.7708 -932.3460 -549.1956 0.0000000
Gentoo:male-Gentoo:female        805.0947  642.4300  967.7594 0.0000000
Chinstrap:male-Adelie:male      -104.5226 -288.6802   79.6351 0.5812048
Gentoo:male-Adelie:male         1441.3429 1287.4832 1595.2026 0.0000000
Gentoo:male-Chinstrap:male      1545.8655 1356.0392 1735.6917 0.0000000

The Tukey Post-Hoc Honest Significant Difference indicates that there is a significant difference in mean body mass between male and female penguins (p-value = 0) which is consistent with the ANOVA test. There is not a significant difference in mean body mass between Chinstrap and Adelie (p-value = 0.824) which follows what the graph shows us visually but there is a significant difference in mean body mass between Gentoo and Adelie penguins (p= 0.00) and between Gentoo and Chinstrap (p=0.00). There is a significant interactive effects between sex and species on mean body mass for all groups except Chinstrap and Adelie males (p-value = 0.5812) and Chinstrap and Adelie females (p-value = 0.1376). This makes sense because when one looks at the effect of species alone on mean body mass, there is not a significant difference between Chinstrap and Adelie. Additionally, in the graph, we can see that male Chinstrap and male Adelie appear to have similar mean body masses and the same is true for female Chinstrap and Adelie penguins.