Lab_6_Assignment

Author

DINAH MARION ABEJA

1.

Load PalmerPenguins and other necessary packages

library(tidyverse) #For data manipulation and graphing
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.2.1     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(ggsci) #for color scales
library(patchwork)  #to panel our graphs
library(performance) #check stat/model assumptions
library(see)

Attaching package: 'see'

The following objects are masked from 'package:ggsci':

    scale_color_material, scale_colour_material, scale_fill_material
library(rstatix) #test for outliers, welch_anova_test

Attaching package: 'rstatix'

The following object is masked from 'package:stats':

    filter
library(palmerpenguins)
library(car)
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula

Attaching package: 'Hmisc'

The following objects are masked from 'package:dplyr':

    src, summarize

The following objects are masked from 'package:base':

    format.pval, units

2.

Using the penguins data, perform a 1-way ANOVA involving the effect of a categorical variable (x) on a numerical variable (y). Group and filter the data (remove NA, for example), calculate means and error, then make a graph. Pair that graph with an ANOVA test. Use the graph + statistical test to assess the null hypothesis of ANOVA.

General notes: \(H_O\): \(\mu_1\) = \(\mu_2\) = \(\mu_3\) and

\(H_A\):at least one mean is different from the others.

penguins_2 <-penguins %>%
group_by(species) %>% 
  drop_na(body_mass_g)  
head(penguins_2)
# A tibble: 6 × 8
# Groups:   species [1]
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
4 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
5 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
6 Adelie  Torgersen           38.9          17.8         181    3625 fema…  2007
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
penguin_i <- penguins_2 %>%
  group_by(species) %>% 
  drop_na(body_mass_g)%>%
  dplyr::summarize(mean_bm = mean(body_mass_g), sd = sd(body_mass_g),n=n(),se= sd/sqrt(n)) %>%
  as.data.frame()
head(penguin_i)
    species  mean_bm       sd   n       se
1    Adelie 3700.662 458.5661 151 37.31758
2 Chinstrap 3733.088 384.3351  68 46.60747
3    Gentoo 5076.016 504.1162 123 45.45463
p1<-ggplot(data = penguin_i, aes(x=species, y=mean_bm, color=species))+
  geom_point()+
  geom_jitter(data = penguins_2, aes(x = species, y = body_mass_g))+
  geom_errorbar(aes(x=species, ymin=mean_bm-se, ymax=mean_bm+se), width=0.2)+
  theme_classic()
p1

model_2 <- aov(body_mass_g~species, data=penguins_2)
summary(model_2)
             Df    Sum Sq  Mean Sq F value Pr(>F)    
species       2 146864214 73432107   343.6 <2e-16 ***
Residuals   339  72443483   213698                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value for species = 2e-16 < 0.05. there is statistical significant difference in the mean body mass of at least one species. The plot also shows that Gentoo has higher mean body mass as compared to the Chinstrap and Adelie

3.

Test your assumptions (individually– do not use check_model) and interpret your assumption checks

a) Outliers

p2<- ggplot(data = penguins_2, aes(x=species, y=body_mass_g, color = species), group = species)+
  geom_boxplot()+
    theme_classic()
p2

penguins_2$species=as.factor(penguins_2$species)

# identifying outliers using rstatix package 
penguins_2 %>%
  group_by(species) %>%
  identify_outliers(body_mass_g)
# A tibble: 2 × 10
  species   island bill_le…¹ bill_…² flipp…³ body_…⁴ sex    year is.ou…⁵ is.ex…⁶
  <fct>     <fct>      <dbl>   <dbl>   <int>   <int> <fct> <int> <lgl>   <lgl>  
1 Chinstrap Dream       52      20.7     210    4800 male   2008 TRUE    FALSE  
2 Chinstrap Dream       46.9    16.6     192    2700 fema…  2008 TRUE    FALSE  
# … with abbreviated variable names ¹​bill_length_mm, ²​bill_depth_mm,
#   ³​flipper_length_mm, ⁴​body_mass_g, ⁵​is.outlier, ⁶​is.extreme

There are two outliers all from Chinstrap on Dream Island. I did not remove these outliers because they are not extreme and thus most likely not very influential. Besides these two values might hold very meaningful information/relevance in the actual population

b) Homoscedasticity

model_2 <- aov(body_mass_g~species, data=penguins_2)
check_model(model_2)

leveneTest(body_mass_g~species, data=penguins_2)
Levene's Test for Homogeneity of Variance (center = median)
       Df F value   Pr(>F)   
group   2  5.1203 0.006445 **
      339                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p.value<0.05 and so we cannot assume equal variance. For p<0.05 indicates that there IS a significant difference in variances between the treatment groups. However, our sample size for individual species are big and so we can assume that this condition is met

c) Normality

shapiro_test(residuals(model_2))
# A tibble: 1 × 3
  variable           statistic p.value
  <chr>                  <dbl>   <dbl>
1 residuals(model_2)     0.992  0.0512
penguins_2 %>%
  group_by(species) %>%
  shapiro_test(body_mass_g)
# A tibble: 3 × 4
  species   variable    statistic      p
  <fct>     <chr>           <dbl>  <dbl>
1 Adelie    body_mass_g     0.981 0.0324
2 Chinstrap body_mass_g     0.984 0.561 
3 Gentoo    body_mass_g     0.986 0.234 
p3<-ggplot(data = penguins_2, mapping =aes(x=body_mass_g, color = species))+
  geom_density()+
  theme_classic() +
  facet_wrap(~ species)+
  scale_color_aaas()
p3

Since p.value = 0.05118 > 0.05, then normality CAN be assumed. From the normality tests for each groups, we see that the body mass for Gentoo and Chinstrap (p values are greater than 0.05) but Adelie’s p value = 0.032 is less than 0.05. The Histogram plots show that the distributions of body mass for the species are normal except for the Gentoo. This is the best we can get with normality and so we can say that this assumption is met for all species

summary(model_2)
             Df    Sum Sq  Mean Sq F value Pr(>F)    
species       2 146864214 73432107   343.6 <2e-16 ***
Residuals   339  72443483   213698                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

d) Independence

We do not have enough information about the experimental design however, the sample sizes of all the species studied are fairly large and so we can assume that this condition is met.

4.

Run a TukeyHSD test on your ANOVA and interpret the results. Make sure your comparisons are easily visible and comparable in your graph (above).

model_3 <- aov(body_mass_g~species, data=penguins_2)
TukeyHSD(model_3)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = body_mass_g ~ species, data = penguins_2)

$species
                       diff       lwr       upr     p adj
Chinstrap-Adelie   32.42598 -126.5002  191.3522 0.8806666
Gentoo-Adelie    1375.35401 1243.1786 1507.5294 0.0000000
Gentoo-Chinstrap 1342.92802 1178.4810 1507.3750 0.0000000

The Tukey test above shows that there is significant differences between only two groups: Gentoo-Adelie and Gentoo-Chinstrap. Overall Gentoo has the highest body mass than the other species. The difference in the body mass of Adelie and Chinstrap is not significant and this is supported by the first plot

Depth

Similar plots as essentials!

penguin_tw <- penguins 
  fit1<-aov(body_mass_g~species*island, data=penguin_tw)
  summary(fit1)
             Df    Sum Sq  Mean Sq F value Pr(>F)    
species       2 146864214 73432107 341.663 <2e-16 ***
island        2     13655     6827   0.032  0.969    
Residuals   337  72429829   214925                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2 observations deleted due to missingness

Assumptions

  • Independence

We do not have enough information about the experimental design however, the sample sizes of all the species studied are fairly large and so we can assume that this condition is met.

  • outliers
penguin_tw <- penguins %>%
  drop_na(body_mass_g, island, island) %>% 
  group_by("species","island", "body_mass_g") 

p6<- ggplot(data = penguin_tw, aes(x=species, y=body_mass_g, color =species)) +
  geom_boxplot() +
  theme_classic()
p6

penguin_tw %>% 
  group_by(species, island) %>%
  identify_outliers(body_mass_g)
# A tibble: 2 × 13
  species   island bill_le…¹ bill_…² flipp…³ body_…⁴ sex    year "spec…⁵ "isla…⁶
  <fct>     <fct>      <dbl>   <dbl>   <int>   <int> <fct> <int> <chr>   <chr>  
1 Chinstrap Dream       52      20.7     210    4800 male   2008 species island 
2 Chinstrap Dream       46.9    16.6     192    2700 fema…  2008 species island 
# … with 3 more variables: `"body_mass_g"` <chr>, is.outlier <lgl>,
#   is.extreme <lgl>, and abbreviated variable names ¹​bill_length_mm,
#   ²​bill_depth_mm, ³​flipper_length_mm, ⁴​body_mass_g, ⁵​`"species"`, ⁶​`"island"`

There are two outliers all from Chinstrap on Dream Island. I did not remove these outliers because they are not extreme and thus most likely not very influential. Besides these two values might hold very meaningful information/relevance in the actual population

  • Normality [If p<0.05 then normality CANNOT be assumed]
penguin_tw %>% 
  group_by(species, island) %>%
 shapiro_test(body_mass_g)
# A tibble: 5 × 5
  species   island    variable    statistic     p
  <fct>     <fct>     <chr>           <dbl> <dbl>
1 Adelie    Biscoe    body_mass_g     0.977 0.526
2 Adelie    Dream     body_mass_g     0.966 0.110
3 Adelie    Torgersen body_mass_g     0.972 0.264
4 Chinstrap Dream     body_mass_g     0.984 0.561
5 Gentoo    Biscoe    body_mass_g     0.986 0.234
p4<-ggplot(data = penguin_tw, mapping =aes(x=body_mass_g, color = species))+
  geom_density()+
  theme_classic() +
  facet_wrap(~ island)+
  scale_color_aaas()
p4

The p.values are all greater than 0.05. Therefore we can assume normality is met. This is also supported by the histogram plots

  • Homoscedasticity
leveneTest(body_mass_g~species*island, data=penguin_tw)
Levene's Test for Homogeneity of Variance (center = median)
       Df F value Pr(>F)  
group   4  2.4947 0.0428 *
      337                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A levene test p<0.05 indicates that there IS a significant difference in variances between the treatment groups. The p-value is less than 0.05 and so there is significant difference in the bodymass of penguins. However, since our sample size is relatively big, we can assume equal variance is met

TukeyHSD(fit1)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = body_mass_g ~ species * island, data = penguin_tw)

$species
                       diff       lwr       upr     p adj
Chinstrap-Adelie   32.42598 -126.9602  191.8122 0.8813058
Gentoo-Adelie    1375.35401 1242.7960 1507.9120 0.0000000
Gentoo-Chinstrap 1342.92802 1178.0050 1507.8511 0.0000000

$island
                      diff       lwr      upr     p adj
Dream-Biscoe     -7.911442 -137.2861 121.4632 0.9886403
Torgersen-Biscoe  3.339873 -171.2651 177.9448 0.9988827
Torgersen-Dream  11.251314 -170.2981 192.8007 0.9883345

The Tukey shows us comparisons of each species with one another. We see that there are significant differences between the Gentoo- Adelie, Gentoo- Chinstrap, but not Chinstrap-Adelie.This also tells us, that the body mass of Gentoo is highest as compared to the other species. For comparison between islands, the body mass of penguins is overall not significantly different between islands. However, we see that Torgersen island tends to have penguins with the biggest body mass.