At Palmer Station in Antarctica, three species of penguins–Adelie, Gentoo, and Chinstrap–are observed within the surrounding islands. The researchers at Palmer Station have sampled over 300 of these penguins, gathering data for specific physical measurements, species classification, and observed location. For this project, our goal is to analyze how the biological sex of the Palmer penguins within each species affects body mass. In doing so, we aim to better understand how the complex relationship between sex and species influence physical traits, contributing to the overall understanding of the ecological dynamics of the Palmer penguins. To achieve this goal, we hope to complete a Two-Way ANOVA to determine the impacts of the categorical, explanatory variables (species and biological sex) on the numerical, response variable (body mass).
Loading the data:
#install.packages("tidytuesdayR")
tuesdata <- tidytuesdayR::tt_load('2020-07-28')
## ---- Compiling #TidyTuesday Information for 2020-07-28 ----
## --- There are 2 files available ---
##
##
## ── Downloading files ───────────────────────────────────────────────────────────
##
## 1 of 2: "penguins.csv"
## 2 of 2: "penguins_raw.csv"
PenguinData <- tuesdata$penguins
To remove penguins whose sex could not be identified from our data set, we must omit any missing values within the Palmer penguins data frame
PenguinData <- na.omit(PenguinData)
Before further exploration, let’s observe the body mass from each penguin group within our sample of 333 penguins with a confirmed sex:
is.numeric(PenguinData$body_mass_g)
## [1] TRUE
PenguinData %>% group_by(species,sex) %>% summarize(mean = mean(body_mass_g),
median=median(body_mass_g),
sd=sd(body_mass_g),
se=(sd(body_mass_g)/333))
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 6
## # Groups: species [3]
## species sex mean median sd se
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie female 3369. 3400 269. 0.809
## 2 Adelie male 4043. 4000 347. 1.04
## 3 Chinstrap female 3527. 3550 285. 0.857
## 4 Chinstrap male 3939. 3950 362. 1.09
## 5 Gentoo female 4680. 4700 282. 0.846
## 6 Gentoo male 5485. 5500 313. 0.940
The Two-Way ANOVA test assumes that the data has equal variance and normal distribution, so before we can run our test, we must confirm that these assumptions match our data.
To test for equal variance, we may use Levene’s test:
leveneTest(body_mass_g ~ sex*species, data = PenguinData)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 5 1.3908 0.2272
## 327
# The variances are not statistically significant (P=0.2272)
To test for normal distribution, let’s explore the residuals from a linear model through a histogram, Q-Q plot, and a shapiro-wilk:
model.1 <- lm(body_mass_g ~ sex*species, data = PenguinData)
hist(resid(model.1))
# This histogram appears to have a normal distribution, appearing in a standard bell-curve
qqnorm(model.1$residuals)
qqline(model.1$residuals)
# Points in this Q-Q plot appear linear, signifying again that our data is normal
shapiro.test(model.1$residuals)
##
## Shapiro-Wilk normality test
##
## data: model.1$residuals
## W = 0.99776, p-value = 0.9367
# The result from the shapiro-wilk test also confirms that the distribution between our groups is normal (P=0.9367)
Let’s visualize the data in a box plot to better show the differences between groups before running our statistical test.
Unfinished.boxplot <- ggplot(PenguinData, aes(x=species, y=body_mass_g, fill=sex, na.rm=TRUE)) + geom_boxplot()
Penguin.boxplot <- Unfinished.boxplot + geom_jitter(shape=1, position=position_jitter(0.2)) + theme_classic() + labs(x="Species", y="Body Mass (g)", fill="Sex")
Penguin.boxplot
# On initial observation, the Gentoo species appears to have a significantly higher body mass then the Adelie and Chinstrap species in both sexes.
Now that we have met the assumptions for a Two-Way ANOVA, we are able to run the test. But first, we must understand our hypotheses.
H0 states: within each species, sex has no effect on body mass. HA states: within each species, sex has an effect on body mass.
anova.2 <- aov(model.1)
summary(anova.2)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 38878897 38878897 406.145 < 2e-16 ***
## species 2 143401584 71700792 749.016 < 2e-16 ***
## sex:species 2 1676557 838278 8.757 0.000197 ***
## Residuals 327 31302628 95727
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# According to the ANOVA, within each species, sex has an effect on body mass (F(2)=8.757, P=0.000197)
Post Hoc - pairwise comparisons to determine which specific groups are statistically significant from each other:
Custom comparisons need to be made in order to ensure that only valid comparisons (comparisons that differ by one variable). A Bonferroni correction also needs to be included to adjust the p-values when using a statistical test where many individual tests are being performed at once.
To compare the effects of sex within each species on body mass, we set up this comparison: Comparisons within each species:
SexContrasts <- PenguinData %>%
group_by(species) %>%
emmeans_test(body_mass_g ~ sex, p.adjust.method = "bonferroni")
SexContrasts
## # A tibble: 3 × 10
## species term .y. group1 group2 df statistic p p.adj
## * <fct> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie sex body_mass_g female male 327 -13.2 4.37e-32 4.37e-32
## 2 Chinstrap sex body_mass_g female male 327 -5.49 8.19e- 8 8.19e- 8
## 3 Gentoo sex body_mass_g female male 327 -14.2 6.13e-36 6.13e-36
## # ℹ 1 more variable: p.adj.signif <chr>
# Within each species, male and female penguins statistically significantly differ in their body mass (P<0.0001)
For further analysis, we’ll compare each species to one another within each sex Comparisons of species across sexes:
SpeciesContrasts <- PenguinData %>%
group_by(sex) %>%
emmeans_test(body_mass_g ~ species, p.adjust.method = "bonferroni")
SpeciesContrasts
## # A tibble: 6 × 10
## sex term .y. group1 group2 df statistic p p.adj p.adj.signif
## * <fct> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 fema… spec… body… Adelie Chins… 327 -2.47 1.42e- 2 4.26e- 2 *
## 2 fema… spec… body… Adelie Gentoo 327 -24.1 1.92e-74 5.76e-74 ****
## 3 fema… spec… body… Chins… Gentoo 327 -17.2 7.42e-48 2.22e-47 ****
## 4 male spec… body… Adelie Chins… 327 1.63 1.05e- 1 3.14e- 1 ns
## 5 male spec… body… Adelie Gentoo 327 -26.9 1.03e-84 3.08e-84 ****
## 6 male spec… body… Chins… Gentoo 327 -23.3 1.26e-71 3.78e-71 ****
Comparisons that were strongly statistically significant: 1. Female Adelie to Gentoo (P<0.0001) 2. Female Chinstrap to Gentoo (P<0.0001) 3. Male Adelie to Gentoo (P<0.0001) 4. Male Chinstrap to Gentoo (P<0.0001) Comparisons that were weakly statistically significant: 1. Female Adelie to Chinstrap (P=0.00426) Comparisons that were NOT statistically significant: 1. Male Adelie to Chinstrap (P=0.314)
From these comparisons, we can see that species has a strong influence on body mass between both sexes, especially when comparing to the larger Gentoo penguins. Additionally, the difference between Adelie and Chinstrap penguins is heavily impacted by the sex.
Lets look at How bill depth is related to body size. we will first observe these two data points separately and then together
# A histogram of bill depth
ggplot(PenguinData, aes(x = bill_depth_mm)) + geom_histogram(binwidth = 1, fill = "blue", color = "black") +
labs(title = "Histogram of Bill Depth", x = "Bill Depth (mm)", y = "Frequency")
# A histogram of body size
ggplot(PenguinData, aes(x = body_mass_g)) + geom_histogram(binwidth = 200, fill = "blue", color = "black") +
labs(title = "Histogram of Body Mass", x = "Body Mass (g)", y = "Frequency")
# A scatterplot of bill depth and body size
ggplot(PenguinData, aes(x = bill_depth_mm, y = body_mass_g)) + geom_point(alpha = 0.6, color = "blue") + labs(title = "Scatterplot of Bill Depth vs Body Mass", x = "Bill Depth (mm)", y = "Body Mass (g)")
The bill depth histogram shows that most penguins have a bill depth
between 15 mm and 19 mm, and about a normal distribution but it is
slighty skewed to the right. The size histogram shows that Body mass
clusters between 3000–5000 grams, with most penguins falling around
4000g. This demonstrates some variability across individuals and
species. On initial observation of the scatter plot, there appears to be
a negative linear correlation between penguins with deeper bills and
slightly lower body mass, though ever so slight it is present.
Now we’re going to run a correlation test and a linear regression to learn of the relationship of these two variables. These two test assume normality and linearity of the residuals of these two variables so we will check that first. The variable body mass was already proven linear so we won’t look at that.
model.2 <- lm(body_mass_g ~ bill_depth_mm, data = PenguinData)
hist(resid(model.2))
# The histogram looks about normal
qqnorm(model.2$residuals)
qqline(model.2$residuals)
# Points in this Q-Q plot appear linear. there is a slight deviation
shapiro.test(model.2$residuals)
##
## Shapiro-Wilk normality test
##
## data: model.2$residuals
## W = 0.98887, p-value = 0.01204
# my P value is low suggesting that we reject the null hypothesis of normality (0.01204. The shapiro test is highly sensitive so it should be taken with a grain of salt in large data sets
outlierTest(model.2) #this test is from online.
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferroni p
## 179 2.572927 0.010521 NA
# I proceeded with an outlier test to see if I have any extreme points that might be affecting my data. There is only one outlier and it is statistically significant. The data is real biological data so I decided to keep it.
Based off of the histogram, the scatterplot, and the qqplot, The data is normal enough to do a parametric correlations test. A Pearson test is not too sensitive to non-normality when the sample size is as large as it is in our data, N > 300. To be safe we will do both a parametric and nonparametric test and compare to see if the outlier is really affecting our data. we will also do a linear Regression test
#pearsons parametric correlations test
cor.test(PenguinData$bill_depth_mm, PenguinData$body_mass_g)
##
## Pearson's product-moment correlation
##
## data: PenguinData$bill_depth_mm and PenguinData$body_mass_g
## t = -9.741, df = 331, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5515130 -0.3840214
## sample estimates:
## cor
## -0.4720157
# The test shows that there is a moderate negative correlation (p = -0.472). As bill depth increases, body mass tends to decrease. The result was statistically significant (t = -9.741, df = 331, p-value = 2.2e-16.)
#spearmans non-Parametric correlations test
cor.test(PenguinData$bill_depth_mm, PenguinData$body_mass_g, method = "spearman")
## Warning in cor.test.default(PenguinData$bill_depth_mm, PenguinData$body_mass_g,
## : Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: PenguinData$bill_depth_mm and PenguinData$body_mass_g
## S = 8796211, p-value = 2.307e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.4292826
#There is a statistically significant and moderate negative correlation (t = -9.741, df = 331, p-value = 2.2e-16) between bill depth and body mass
#linear Regressions
summary(model.2)
##
## Call:
## lm(formula = body_mass_g ~ bill_depth_mm, data = PenguinData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1616.08 -510.38 -68.67 486.51 1811.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7519.98 342.33 21.967 <2e-16 ***
## bill_depth_mm -193.01 19.81 -9.741 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 710.9 on 331 degrees of freedom
## Multiple R-squared: 0.2228, Adjusted R-squared: 0.2205
## F-statistic: 94.89 on 1 and 331 DF, p-value: < 2.2e-16
The pearsons’s and Spearman’s test resulted in about the same conclusion. There is a slight statically significant correlation between Body mass and bill depth. As bill depth goes up, body mass goes down.
We will Remake with the scatterplot with the correlation test added + linear equation added
ggscatter(PenguinData, x = "bill_depth_mm", y = "body_mass_g",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Bill Depth (mm)", ylab = "Body Mass (g)",
title = "Bill Depth vs Body Mass with Correlation and Regression")
#This plot overlays the regression line and correlation coefficient, visualy showing the statistically significant but weak negative relationship.
This plot overlays the regression line and correlation coefficient, visualy showing the statistically significant but weak negative relationship.
The goal of our project was to analyze how the biological sex of the Palmer penguins within each species affects body mass. To do so, we decided to test two hypotheses: H0 states: within each species, sex has no effect on body mass. HA states: within each species, sex has an effect on body mass. To then test our hypotheses, we had to prepare our data to be analysed. We first eliminated any data set where the sex could not be identified. We then tested for equal variance and normal distribution, as we needed to run a Two-Way ANOVA, a statistical test that assumes equal variance. We did a Levene’s test and found that the variance was not statistically significant (P=0.2272). We also did a histogram, Q-Q plot, and finally a Shapiro test, which showed us our data had a normal distribution(P=0.9367). Our data was now ready for a Two-Way ANOVA. Based on our Two-Way ANOVA analysis, we can reject the null hypothesis that both species and biological sex don’t affect the body mass of Palmer penguins. We can also conclude that both species and biological sex significantly affect the body mass of Palmer penguins. The interaction between sex and species was statistically significant (F(2) = 8.757, P = 0.000197), meaning that the effect of sex on body mass varies by species. Post-hoc comparisons with Bonferroni corrections revealed that male and female body mass sizes vary widely by species (P < 0.0001). Moreover, there are statistically significant differences in the sizes of male and female penguins between the species. For example, we found that Gentoo penguins were significantly larger than Adelie and Chinstrap penguins across both sexes. Female Adelie to Chinstrap penguins were weakly significant (P=0.00426), and male Adelie to Chinstrap, were not statistically significant (P = 0.314).
This analysis is important because it provides insight into the physical differences among penguin species at Palmer Station. It helps us to better understand how the complex relationship between sex and species influences physical traits, contributing to the overall understanding of the ecological dynamics of the Palmer penguins. This better understanding of the relationship between body mass and sex goes a long way when looking at larger biological questions like why they differ in size evolutionarily. What pros and cons does their size produce? It also serves as a basis for how we monitor these creatures, as body mass is a key indicator of health.
However, the analysis has limitations. The dataset was filtered to exclude missing values which reduced the sample size and could reduce its accuracy. This data also did not consider other factors, such as age, diet, or season, which may influence body mass. Regardless of these limitations, the results are a good starting point for many studies to come after it.
Gorman, K. (2020, July 28). Palmer Penguins. GitHub. https://github.com/rfordatascience/tidytuesday/blob/main/data/2020/2020-07-28/readme.md Horst, A. (n.d.). Palmerpenguins R Data Package. GitHub. https://allisonhorst.github.io/palmerpenguins/