This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

library(palmerpenguins)
## Warning: package 'palmerpenguins' was built under R version 4.1.2
data(package="palmerpenguins")
penguins=na.omit(penguins)
help(penguins)

I elected to use the Palmer Penguins dataset, collected by Dr. Kristen Gorman and the Palmer Station in Antarctica. It contains data on the bill length, bill depth, flipper length, body mass, and sex of 333 adult foraging penguins in Antarctica in 2007. These penguins came from three different islands (also noted in the dataset): Biscoe, Dream, and Torgersen. Furthermore, these penguins were also of three different species: Adelie, Chinstrap, and Gentoo.The goal of this observational study was to examine sexual dimorphism, defined as the difference in size or appearance between the sexes of an animal, in order to determine if environmental variability is associated with differences in male and female pre-breeding foraging niche. Study nests with pairs of adults present were marked and monitored before eggs were laid. After one egg was laid, both adults of the sex pair were captured and measurements were obtained.

Null Hypothesis: There is no difference in bill length between the penguin species Adelie, Chinstrap, and Gentoo. Alternative Hypothesis: There is a significant difference in bill length between the penguin species Adelie, Chinstrap, and Gentoo.

boxplot(bill_length_mm~species, data=penguins, ylab='Bill Length')

lmfit = lm(bill_length_mm~species, data=penguins)
anova(lmfit)

The p-value is very small (2.2e-16) and the F value is extremely large. The F value is extremely large due to the mean squared between groups being relatively large compared to the mean squared within groups. In conclusion, we can confidently reject the null hypothesis due to the low p-value. There is sufficient evidence that the three penguin species have different bill lengths.

qqnorm(lmfit$residuals)

A further look through a QQ-plot of the residuals shows a normal distribution. A normal distribution of residuals should appear to be linear. This satisfies the condition for the ANOVA analysis and reinforces the aforementioned conclusion.

plot(lmfit$fitted.values, lmfit$residuals, xlab="group means", ylab="residuals")

A plot of the residuals as a function of the fitted values (group means) shows that the standard deviation of the residuals are approximately the same for all categories.

Next, a regression analysis of two continuous numerical variables can also be completed.For this, we choose our X axis to be bill length and our Y axis to be bill depth. The Adelie species is in blue, Gentoo in black, and Chinstrap in red.

mycol = 1:length(penguins$species)
mycol[penguins$species == "Adelie"] = "blue"
mycol[penguins$species == "Gentoo"] = "black"
mycol[penguins$species == "Chinstrap"] = "red"
plot(penguins$bill_length_mm,penguins$bill_depth_mm, xlab = "bill length", ylab = "bill depth", col=mycol)
lmfit = lm(penguins$bill_depth_mm ~ penguins$bill_length_mm)
abline(lmfit)

cor.test(penguins$bill_depth_mm, penguins$bill_length_mm)
## 
##  Pearson's product-moment correlation
## 
## data:  penguins$bill_depth_mm and penguins$bill_length_mm
## t = -4.2726, df = 331, p-value = 2.528e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3280409 -0.1242017
## sample estimates:
##        cor 
## -0.2286256

The correlation coefficient is r= -0.229. The p-value is 2.53e-0.5. The 95% confidence interval is (-0.328, -0.124).

summary(lmfit)
## 
## Call:
## lm(formula = penguins$bill_depth_mm ~ penguins$bill_length_mm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1548 -1.4291  0.0122  1.3994  4.5004 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             20.78665    0.85417  24.335  < 2e-16 ***
## penguins$bill_length_mm -0.08233    0.01927  -4.273 2.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.92 on 331 degrees of freedom
## Multiple R-squared:  0.05227,    Adjusted R-squared:  0.04941 
## F-statistic: 18.26 on 1 and 331 DF,  p-value: 2.528e-05

The estimated slope is -0.08233, with a p-value of 2.53e-05. The estimated y-intercept is 20.78665, with a p-value of less than 2e-16. 4.941% of the variance in bill depth is explained by bill length, as given by the R squared value.

qqnorm(lmfit$residuals)

The residuals appear to be normally distributed.

plot(lmfit$residuals,lmfit$bill_length_mm,xlab="X: bill length",ylab="Residuals")

The residuals do not appear to strongly depend on the x variable of bill length.

Finally, I will use the ANOVA test to answer the question of if there is a significant difference in flipper length between the species Adelie, Gentoo, and Chinstrap penguins. Null Hypothesis: There is no difference in flipper length between the penguin species Adelie, Chinstrap, and Gentoo. Alternative Hypothesis: There is a significant difference in flipper length between the penguin species Adelie, Chinstrap, and Gentoo.

boxplot(flipper_length_mm~species, data=penguins, ylab='Flipper Length (mm)')

lmfit = lm(flipper_length_mm~species, data=penguins)
anova(lmfit)
qqnorm(lmfit$residuals)

plot(lmfit$fitted.values, lmfit$residuals, xlab="group means", ylab="residuals")

An ANOVA analysis comparing the flipper lengths across the three species returns a p-value that is very small (2.2e-16) and an F value that is extremely large (567.41). The F value is extremely large due to the mean squared between groups being relatively large compared to the mean squared within groups. In conclusion, we can confidently reject the null hypothesis that there is no difference in flipper lengths between the three species due to the low p-value and high F value. There is sufficient evidence that the three penguin species have different flipper lengths. Additionally, a look at the side-by-side boxplot of flipper lengths between the species shows a clear difference in the median flipper lengths of the species. It appears that Gentoo penguins in the region have the longest flippers, followed by Chinstrap penguins and finally Adelie penguins have the shortest flippers. There are four conditions to be satisfied in order to run an ANOVA test. We know, based on the study, that the observational data collected was from simple random samples in the groups. We also know that the samples were collected independent of each other. As evident by the QQ-plot presented, the samples have an approximately normal distribution due to the linear nature of the plot created. The different samples also have approximately the same standard deviation as evidenced by the plot of residuals.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.