Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program set out to study the impact of Ecological Sexual Dimorphism in arctic penguins. In 2014, that particular study compared deep-dive foraging habits of male and female penguins and how that resulted in different biological outcomes. To read more abut it, you can go to read the full study here. And you can find more of Dr. Gorman’s more recent work here
We only have a portion of the data from the larger scale study, so our goals and scope of analysis will be different. Our aim is to determine which factors impact the body mass of a penguin. This is useful to know as body mass is an important indicator of penguin’s health.
Our data table contains the following columns:
pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
## [1] "Observation" "body_mass_g" "species"
## [4] "island" "bill_length_mm" "bill_depth_mm"
## [7] "flipper_length_mm" "sex" "year"
hist(pendata$body_mass_g, xlab="Body Mass", main="Histogram of Body Mass (in grams)")
ADD YOUR NOTES ON THE SHAPE + DISTRIBUTION OF RESPONSE:
RECALL: Our goal through the exploratory data analysis phase is to identify the individual relationships between the explanatory variables and the response.
For quantitative variables, we explore the scatter plots and correlations for each explanatory variable then classify the relationships as linear, curvilinear, other, or none.
For qualitative variables, we explore the box plots and means value of y for each level of the qualitative explanatory variables. Then classify the relationships as existent or not.
In this example we will evaluate the qualitative variables species (with three levels) and sex (with two levels)
#Summary Statistics for response variable grouped by each level of the response
tapply(pendata$body_mass_g,pendata$species,summary)
$Adelie
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2850 3350 3700 3701 4000 4775 1
$Chinstrap
Min. 1st Qu. Median Mean 3rd Qu. Max.
2700 3488 3700 3733 3950 4800
$Gentoo
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3950 4700 5000 5076 5500 6300 1
tapply(pendata$body_mass_g,pendata$sex,summary)
$female
Min. 1st Qu. Median Mean 3rd Qu. Max.
2700 3350 3650 3862 4550 5200
$male
Min. 1st Qu. Median Mean 3rd Qu. Max.
3250 3900 4300 4546 5312 6300
#Box plots for species and sex
boxplot(body_mass_g~species,pendata, ylab="Body Mass (grams)")
boxplot(body_mass_g~sex,pendata, ylab="Body Mass (grams)")
There appears to be a relationship with species and the response (body mass) because the mean value of y is quite different for each level. We see Adelie and Chinstrap have very similar means (3701 grams and 3733 grams respectively), but Gentoo is much higher at 5076 grams.
There appears to be a relationship with sex and the response (body mass) because the mean value of y is quite different for each level. We see males with an average body mass of 4546 grams and females much lower at 3862 grams.
For the sake of this example, we will hold off on adding the quantitative variables right now.
Therefore, our hypothesized model is:
\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM\)
What does each \(\beta\) coefficient represent?
#note: the syntax in R does not necessarily match the parameters
penmod1<-lm(body_mass_g~species+sex,data=pendata)
summary(penmod1)
Call:
lm(formula = body_mass_g ~ species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-816.87 -217.80 -16.87 227.61 882.20
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3372.39 31.43 107.308 <2e-16 ***
speciesChinstrap 26.92 46.48 0.579 0.563
speciesGentoo 1377.86 39.10 35.236 <2e-16 ***
sexmale 667.56 34.70 19.236 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 316.6 on 329 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8454
F-statistic: 606.1 on 3 and 329 DF, p-value: < 2.2e-16
The prediction equation is:
\(\widehat{bodymass}=3372.39+26.92SpeciesC+1377.86SpeciesG+667.56SexM\)
We can recode our base level using the relevel function. This is optional, but could be useful if we have a particular interest group.
In this example, Gentoo appears to be the most different from the other species, so it might be useful to define it as the base level. Then each of our estimates will compare to Gentoo.
#This is only done if we desire to change the base level
recodedSpecies<-relevel(factor(pendata$species),ref="Gentoo")
penmod6<-lm(body_mass_g~recodedSpecies,data=pendata)
summary(penmod6)
Call:
lm(formula = body_mass_g ~ recodedSpecies, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1126.02 -333.09 -33.09 316.91 1223.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5076.02 41.68 121.78 <2e-16 ***
recodedSpeciesAdelie -1375.35 56.15 -24.50 <2e-16 ***
recodedSpeciesChinstrap -1342.93 69.86 -19.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 462.3 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677
F-statistic: 343.6 on 2 and 339 DF, p-value: < 2.2e-16
Recall the model:
\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM\)
Hypotheses:
Distribution of test statistic: F with 3, 329 DF
Test Statistic: F=606.1
Pvalue: <0.0000001
Decision: pvalue<0.05 -> REJECT H0
Conclusion: The model with species and sex is adequate at body mass for the penguins.
Species is determined by two parameters (since SpeciesC and SpeciesG are the dummy variables), so we would want to keep or remove BOTH of those dummy variables. We will cover this later.
We can test sex because it is only associated with one parameter in the model.
Hypotheses:
Distribution of test statistic: T with 329 DF
Test Statistic: t=19.236
Pvalue: <0.0000001
Decision: pvalue<0.05 -> REJECT H0
Conclusion: The sex of the penguin is significant at predicting the body mass of a penguin given species is a constant in the model. We will keep it in the model.
Note we can include both qualitative and quantitative predictors in the model.
penmod2<-lm(body_mass_g~species+sex+bill_depth_mm,data=pendata)
summary(penmod2)
Call:
lm(formula = body_mass_g ~ species + sex + bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-851.2 -187.7 -3.5 221.7 936.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1576.99 357.08 4.416 1.37e-05 ***
speciesChinstrap 19.44 44.87 0.433 0.665
speciesGentoo 1721.70 77.88 22.106 < 2e-16 ***
sexmale 513.99 45.24 11.360 < 2e-16 ***
bill_depth_mm 102.04 20.22 5.046 7.48e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 305.5 on 328 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8578, Adjusted R-squared: 0.8561
F-statistic: 494.8 on 4 and 328 DF, p-value: < 2.2e-16
Write the model that was fit
\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM+\beta_4BillDepth\)
Create a scatter plot of the quantitative variable vs the response. Then label the observations by the qualitative variable.
#Bill Depth / Species plot with no regression lines
plot(x=pendata$bill_depth_mm, y=pendata$body_mass_g,pch=as.numeric(factor(pendata$species)),xlab="Bill Depth",ylab="Body Mass (grams)")
legend("topright",legend = levels(factor(pendata$species)),
pch = 1:nlevels(factor(pendata$species)))
NOTE: When we group the observations on the scatter plot for Bill Depth by species, the relationships for each level are positive, despite the overall negative trend. Also we see each species would have a similar slope, but different y-intercepts.
Let’s suppose we simplify the model to just species and bill depth:
\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 BillDepth\) , where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise
What if there were different slopes for each group? We will explore this next time!
We will explore the various styles of interactions through several combinations of variables.
If an interaction is present, we would note non-parallel (different) slopes for each level of the qualitative variable.
# Bill Depth / Species Interaction Scatter Plot with regression lines
plot(x=pendata$bill_depth_mm, y=pendata$body_mass_g,pch=as.numeric(factor(pendata$species)),xlab="Bill Depth",ylab="Body Mass (grams)")
legend("topright",legend = levels(factor(pendata$species)),
pch = 1:nlevels(factor(pendata$species)))
for (g in levels(as.factor(pendata$species))) {
group_subset <- subset(pendata, species == g)
model <- lm(body_mass_g ~ bill_depth_mm, data = group_subset)
abline(model,lwd = 1)
}
There does not appear to be strong evidence of an interaction because the bill depth-body mass slopes for each species appear to be the same.
How would these “different slopes” show up in the model?
Write a model for E(y) as a function of bill depth and species that hypothesizes different bill depth-body mass slopes for each species.
\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 BillDepth\)
\(+\beta_4SpeciesC*BillDepth+ \beta_5SpeciesG*BillDepth\)
, where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise
The bill depth-body mass slope for Adelie (SpeciesC=0,SpeciesG=0) will be: \(\beta_3\)
The bill depth-body mass slope for Chinstrap (SpeciesC=1,SpeciesG=0) will be: \(\beta_3+\beta_4\)
The bill depth-body mass slope for Gentoo (SpeciesC=0,SpeciesG=1) will be: \(\beta_3+\beta_5\)
In the last section we saw that by just including the dummy variables, we account for different y-intercepts for each species. By adding an interaction, we allow for different slopes.
penmod3<-lm(body_mass_g~species+bill_depth_mm+species*bill_depth_mm,data=pendata)
summary(penmod3)
Call:
lm(formula = body_mass_g ~ species + bill_depth_mm + species *
bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-845.89 -254.74 -28.46 228.01 1161.41
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -283.28 437.94 -0.647 0.5182
speciesChinstrap 247.06 829.77 0.298 0.7661
speciesGentoo -175.71 658.43 -0.267 0.7897
bill_depth_mm 217.15 23.82 9.117 <2e-16 ***
speciesChinstrap:bill_depth_mm -12.53 45.01 -0.278 0.7809
speciesGentoo:bill_depth_mm 152.29 40.49 3.761 0.0002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 354.9 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.807, Adjusted R-squared: 0.8041
F-statistic: 281 on 5 and 336 DF, p-value: < 2.2e-16
#Note we can also simplify the syntax
penmod3a<-lm(body_mass_g~species*bill_depth_mm,data=pendata)
summary(penmod3a)
Call:
lm(formula = body_mass_g ~ species * bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-845.89 -254.74 -28.46 228.01 1161.41
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -283.28 437.94 -0.647 0.5182
speciesChinstrap 247.06 829.77 0.298 0.7661
speciesGentoo -175.71 658.43 -0.267 0.7897
bill_depth_mm 217.15 23.82 9.117 <2e-16 ***
speciesChinstrap:bill_depth_mm -12.53 45.01 -0.278 0.7809
speciesGentoo:bill_depth_mm 152.29 40.49 3.761 0.0002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 354.9 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.807, Adjusted R-squared: 0.8041
F-statistic: 281 on 5 and 336 DF, p-value: < 2.2e-16
Write the prediction equation
\(\widehat{body mass}=-283.28+247.06SpeciesC-175.71SpeciesG+217.15 BillDepth\)
\(-12.53SpeciesC*BillDepth+ 152.29SpeciesG*BillDepth\)
Interpret the slope estimate for bill depth
If an interaction is present, we would notice extreme crossing of levels. The relationship of one variable with the response DEPENDS on the other variable.
# We need to verify there are non-zero observations for every combination of level
table(pendata$species,pendata$sex)
female male
Adelie 73 73
Chinstrap 34 34
Gentoo 58 61
# We can plot interactions either way
interaction.plot(pendata$species, pendata$sex, pendata$body_mass_g,fun=mean,trace.label="Sex", xlab="Species",ylab="Mean Body Mass")
interaction.plot(pendata$sex, pendata$species, pendata$body_mass_g,fun=mean,trace.label="Species", xlab="Sex",ylab="Mean Body Mass")
There is not strong evidence of an interaction because there is not drastic switching of the levels.
Write a model for E(y) as a function of sex and species and its interaction.
\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 SexM\)
\(+\beta_4SpeciesC*SexM+ \beta_5SpeciesG*SexM\)
Coefficient for SpeciesC = \(\beta_1 +\beta_4*SexM\)
Coefficient for SpeciesG = \(\beta_2 +\beta_5*SexM\)
Coefficient for SexM = \(\beta_3 +\beta_4*SpeciesC+\beta_5*SpeciesG\)
There are many ways to interpret these betas. For instance, the average value of males’ body mass depends on the species. Or the average body mass of a particular species depends on the sex of the penguin. Remember qualitative variables do not have “slope” coefficients. Instead we interpret as the (difference in the) average value of y, given a combination of categorical levels.
penmod4<-lm(body_mass_g~species+sex+species*sex,data=pendata)
summary(penmod4)
Call:
lm(formula = body_mass_g ~ species + sex + species * sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-827.21 -213.97 11.03 206.51 861.03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3368.84 36.21 93.030 < 2e-16 ***
speciesChinstrap 158.37 64.24 2.465 0.01420 *
speciesGentoo 1310.91 54.42 24.088 < 2e-16 ***
sexmale 674.66 51.21 13.174 < 2e-16 ***
speciesChinstrap:sexmale -262.89 90.85 -2.894 0.00406 **
speciesGentoo:sexmale 130.44 76.44 1.706 0.08886 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 309.4 on 327 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8546, Adjusted R-squared: 0.8524
F-statistic: 384.3 on 5 and 327 DF, p-value: < 2.2e-16
#Note we can use simplified syntax
penmod4a<-lm(body_mass_g~species*sex,data=pendata)
summary(penmod4a)
Call:
lm(formula = body_mass_g ~ species * sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-827.21 -213.97 11.03 206.51 861.03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3368.84 36.21 93.030 < 2e-16 ***
speciesChinstrap 158.37 64.24 2.465 0.01420 *
speciesGentoo 1310.91 54.42 24.088 < 2e-16 ***
sexmale 674.66 51.21 13.174 < 2e-16 ***
speciesChinstrap:sexmale -262.89 90.85 -2.894 0.00406 **
speciesGentoo:sexmale 130.44 76.44 1.706 0.08886 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 309.4 on 327 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8546, Adjusted R-squared: 0.8524
F-statistic: 384.3 on 5 and 327 DF, p-value: < 2.2e-16
Write the prediction equation
\(\widehat{body mass}=3368.84+158.37SpeciesC+1310.91SpeciesG+674.66 SexM\)
\(-262.89SpeciesC*SexM+ 130.44SpeciesG*SexM\)
Plug in the appropriate 1s and 0s for the dummy variables to interpret the coefficients.
Estimate of coefficient for SpeciesC = \(158.37 -262.89*SexM\)
The estimated mean body mass for Chinstrap penguins is (158.31-262.89)=104.58 grams lower than Adelie penguins, for male penguins.
The estimated mean body mass for Chinstrap penguins is 158.31 higher than Adelie penguins, for female penguins.
Estimate of coefficient for SpeciesG = \(1310.91 +130.44*SexM\)
The estimated mean body mass for Gentoo penguins is (1310.91+130.44)=1441.35 grams higher than Adelie penguins, for male penguins.
The estimated mean body mass for Gentoo penguins is 1310.91 grams higher than Adelie penguins, for female penguins.
Estimate of coefficient for SexM = \(674.66 -262.89*SpeciesC+130.44*SpeciesG\)
The estimated mean body mass for male penguins is (674.66-262.89)=411.77 grams higher than female penguins, for Chinstrap penguins.
The estimated mean body mass for male penguins is (674.66+130.44)=805.1 grams higher than female penguins, for Gentoo penguins.
The estimated mean body mass for male penguins is 674.66 grams higher than female penguins, for Adelie penguins.
There is no visualization for two quantitative variables having an interaction. We base our justificaiton off of intuition or research.
Write a model for E(y) as a function of bill depth and bill length that allows for an interaction.
\(E(body mass)=\beta_0+\beta_1BillLength+\beta_2 BillDepth\)
\(+\beta_3BillLength*BillDepth\)
penmod5<-lm(body_mass_g~bill_length_mm+bill_depth_mm+bill_length_mm*bill_depth_mm,data=pendata)
summary(penmod5)
Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + bill_length_mm *
bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1811.29 -355.81 4.35 354.80 1606.90
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25583.278 2668.939 -9.586 <2e-16 ***
bill_length_mm 715.006 58.681 12.185 <2e-16 ***
bill_depth_mm 1484.934 149.405 9.939 <2e-16 ***
bill_length_mm:bill_depth_mm -36.079 3.297 -10.944 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 503.5 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6093, Adjusted R-squared: 0.6058
F-statistic: 175.7 on 3 and 338 DF, p-value: < 2.2e-16
#We can used shortened syntax
penmod5a<-lm(body_mass_g~bill_length_mm*bill_depth_mm,data=pendata)
summary(penmod5a)
Call:
lm(formula = body_mass_g ~ bill_length_mm * bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1811.29 -355.81 4.35 354.80 1606.90
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25583.278 2668.939 -9.586 <2e-16 ***
bill_length_mm 715.006 58.681 12.185 <2e-16 ***
bill_depth_mm 1484.934 149.405 9.939 <2e-16 ***
bill_length_mm:bill_depth_mm -36.079 3.297 -10.944 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 503.5 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6093, Adjusted R-squared: 0.6058
F-statistic: 175.7 on 3 and 338 DF, p-value: < 2.2e-16
The prediction equation is:
\(\widehat{body mass}=-25583.278+715.006BillLength+1484.934 BillDepth-36.079 BillLength*BillDepth\)
Suppose a penguin has a bill depth of 17mm. What is the slope of bill length?
Suppose a penguin has a bill length of 40mm. What is the slope of bill depth?