Attaching package: 'see'
The following objects are masked from 'package:ggsci':
scale_color_material, scale_colour_material, scale_fill_material
library(rstatix) #test for outliers, welch_anova_test
Attaching package: 'rstatix'
The following object is masked from 'package:stats':
filter
library(palmerpenguins)library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
library(Hmisc)
Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':
src, summarize
The following objects are masked from 'package:base':
format.pval, units
2.
Using the penguins data, perform a 1-way ANOVA involving the effect of a categorical variable (x) on a numerical variable (y). Group and filter the data (remove NA, for example), calculate means and error, then make a graph. Pair that graph with an ANOVA test. Use the graph + statistical test to assess the null hypothesis of ANOVA.
General notes: \(H_O\): \(\mu_1\) = \(\mu_2\) = \(\mu_3\) and
\(H_A\):at least one mean is different from the others.
Df Sum Sq Mean Sq F value Pr(>F)
species 2 146864214 73432107 343.6 <2e-16 ***
Residuals 339 72443483 213698
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value for species = 2e-16 < 0.05. there is statistical significant difference in the mean body mass of at least one species. The plot also shows that Gentoo has higher mean body mass as compared to the Chinstrap and Adelie
3.
Test your assumptions (individually– do not use check_model) and interpret your assumption checks
a) Outliers
p2<-ggplot(data = penguins_2, aes(x=species, y=body_mass_g, color = species), group = species)+geom_boxplot()+theme_classic()p2
penguins_2$species=as.factor(penguins_2$species)# identifying outliers using rstatix package penguins_2 %>%group_by(species) %>%identify_outliers(body_mass_g)
# A tibble: 2 × 10
species island bill_le…¹ bill_…² flipp…³ body_…⁴ sex year is.ou…⁵ is.ex…⁶
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> <lgl> <lgl>
1 Chinstrap Dream 52 20.7 210 4800 male 2008 TRUE FALSE
2 Chinstrap Dream 46.9 16.6 192 2700 fema… 2008 TRUE FALSE
# … with abbreviated variable names ¹bill_length_mm, ²bill_depth_mm,
# ³flipper_length_mm, ⁴body_mass_g, ⁵is.outlier, ⁶is.extreme
There are two outliers all from Chinstrap on Dream Island. I did not remove these outliers because they are not extreme and thus most likely not very influential. Besides these two values might hold very meaningful information/relevance in the actual population
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 5.1203 0.006445 **
339
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p.value<0.05 and so we cannot assume equal variance. For p<0.05 indicates that there IS a significant difference in variances between the treatment groups. However, our sample size for individual species are big and so we can assume that this condition is met
# A tibble: 3 × 4
species variable statistic p
<fct> <chr> <dbl> <dbl>
1 Adelie body_mass_g 0.981 0.0324
2 Chinstrap body_mass_g 0.984 0.561
3 Gentoo body_mass_g 0.986 0.234
p3<-ggplot(data = penguins_2, mapping =aes(x=body_mass_g, color = species))+geom_density()+theme_classic() +facet_wrap(~ species)+scale_color_aaas()p3
Since p.value = 0.05118 > 0.05, then normality CAN be assumed. From the normality tests for each groups, we see that the body mass for Gentoo and Chinstrap (p values are greater than 0.05) but Adelie’s p value = 0.032 is less than 0.05. The Histogram plots show that the distributions of body mass for the species are normal except for the Gentoo. This is the best we can get with normality and so we can say that this assumption is met for all species
summary(model_2)
Df Sum Sq Mean Sq F value Pr(>F)
species 2 146864214 73432107 343.6 <2e-16 ***
Residuals 339 72443483 213698
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
d) Independence
We do not have enough information about the experimental design however, the sample sizes of all the species studied are fairly large and so we can assume that this condition is met.
4.
Run a TukeyHSD test on your ANOVA and interpret the results. Make sure your comparisons are easily visible and comparable in your graph (above).
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = body_mass_g ~ species, data = penguins_2)
$species
diff lwr upr p adj
Chinstrap-Adelie 32.42598 -126.5002 191.3522 0.8806666
Gentoo-Adelie 1375.35401 1243.1786 1507.5294 0.0000000
Gentoo-Chinstrap 1342.92802 1178.4810 1507.3750 0.0000000
The Tukey test above shows that there is significant differences between only two groups: Gentoo-Adelie and Gentoo-Chinstrap. Overall Gentoo has the highest body mass than the other species. The difference in the body mass of Adelie and Chinstrap is not significant and this is supported by the first plot
Df Sum Sq Mean Sq F value Pr(>F)
species 2 146864214 73432107 341.663 <2e-16 ***
island 2 13655 6827 0.032 0.969
Residuals 337 72429829 214925
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
2 observations deleted due to missingness
Assumptions
Independence
We do not have enough information about the experimental design however, the sample sizes of all the species studied are fairly large and so we can assume that this condition is met.
outliers
penguin_tw <- penguins %>%drop_na(body_mass_g, island, island) %>%group_by("species","island", "body_mass_g") p6<-ggplot(data = penguin_tw, aes(x=species, y=body_mass_g, color =island)) +geom_boxplot() +theme_classic()p6
# A tibble: 2 × 13
species island bill_le…¹ bill_…² flipp…³ body_…⁴ sex year "spec…⁵ "isla…⁶
<fct> <fct> <dbl> <dbl> <int> <int> <fct> <int> <chr> <chr>
1 Chinstrap Dream 52 20.7 210 4800 male 2008 species island
2 Chinstrap Dream 46.9 16.6 192 2700 fema… 2008 species island
# … with 3 more variables: `"body_mass_g"` <chr>, is.outlier <lgl>,
# is.extreme <lgl>, and abbreviated variable names ¹bill_length_mm,
# ²bill_depth_mm, ³flipper_length_mm, ⁴body_mass_g, ⁵`"species"`, ⁶`"island"`
There are two outliers all from Chinstrap on Dream Island. I did not remove these outliers because they are not extreme and thus most likely not very influential. Besides these two values might hold very meaningful information/relevance in the actual population
Normality [If p<0.05 then normality CANNOT be assumed]
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 2.4947 0.0428 *
337
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A levene test p<0.05 indicates that there IS a significant difference in variances between the treatment groups. The p-value is less than 0.05 and so there is significant difference in the bodymass of penguins. However, since our sample size is relatively big, we can assume equal variance is met
TukeyHSD(fit1)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = body_mass_g ~ species * island, data = penguin_tw)
$species
diff lwr upr p adj
Chinstrap-Adelie 32.42598 -126.9602 191.8122 0.8813058
Gentoo-Adelie 1375.35401 1242.7960 1507.9120 0.0000000
Gentoo-Chinstrap 1342.92802 1178.0050 1507.8511 0.0000000
$island
diff lwr upr p adj
Dream-Biscoe -7.911442 -137.2861 121.4632 0.9886403
Torgersen-Biscoe 3.339873 -171.2651 177.9448 0.9988827
Torgersen-Dream 11.251314 -170.2981 192.8007 0.9883345
The Tukey shows us comparisons of each species with one another. We see that there are significant differences between the Gentoo- Adelie, Gentoo- Chinstrap, but not Chinstrap-Adelie.This also tells us, that the body mass of Gentoo is highest as compared to the other species.For comparison between islands, the body mass of penguins is overall not significantly different between islands. However, we see that Torgersen island tends to have penguins with the biggest body mass.
Island was not a great variable to use for anova analysis- missing species in some of the islands and also the sample sizes are not comparable (Gentoo in Biscoe = 123 as opposed to the other species with 0 or lower sample size)