Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program set out to study the impact of Ecological Sexual Dimorphism in arctic penguins. In 2014, that particular study compared deep-dive foraging habits of male and female penguins and how that resulted in different biological outcomes. To read more abut it, you can go to read the full study here. And you can find more of Dr. Gorman’s more recent work here

We only have a portion of the data from the larger scale study, so our goals and scope of analysis will be different. Our aim is to determine which factors impact the body mass of a penguin. This is useful to know as body mass is an important indicator of penguin’s health.

Our data table contains the following columns:

Unit 2.1 Qualitative Variables

Step 1: Collect the Data

pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
## [1] "Observation"       "body_mass_g"       "species"          
## [4] "island"            "bill_length_mm"    "bill_depth_mm"    
## [7] "flipper_length_mm" "sex"               "year"
hist(pendata$body_mass_g, xlab="Body Mass", main="Histogram of Body Mass (in grams)") 

ADD YOUR NOTES ON THE SHAPE + DISTRIBUTION OF RESPONSE:

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

RECALL: Our goal through the exploratory data analysis phase is to identify the individual relationships between the explanatory variables and the response.

In this example we will evaluate the qualitative variables species (with three levels) and sex (with two levels)

#Summary Statistics for response variable grouped by each level of the response
tapply(pendata$body_mass_g,pendata$species,summary)
$Adelie
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2850    3350    3700    3701    4000    4775       1 

$Chinstrap
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2700    3488    3700    3733    3950    4800 

$Gentoo
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   3950    4700    5000    5076    5500    6300       1 
tapply(pendata$body_mass_g,pendata$sex,summary)
$female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2700    3350    3650    3862    4550    5200 

$male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3250    3900    4300    4546    5312    6300 
#Box plots for species and sex
boxplot(body_mass_g~species,pendata, ylab="Body Mass (grams)")

boxplot(body_mass_g~sex,pendata, ylab="Body Mass (grams)")

\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM\)

What does each \(\beta\) coefficient represent?

Step 3: Estimate the model parameters (fit the model using R)

#note: the syntax in R does not necessarily match the parameters
penmod1<-lm(body_mass_g~species+sex,data=pendata)
summary(penmod1)

Call:
lm(formula = body_mass_g ~ species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-816.87 -217.80  -16.87  227.61  882.20 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3372.39      31.43 107.308   <2e-16 ***
speciesChinstrap    26.92      46.48   0.579    0.563    
speciesGentoo     1377.86      39.10  35.236   <2e-16 ***
sexmale            667.56      34.70  19.236   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 316.6 on 329 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8468,    Adjusted R-squared:  0.8454 
F-statistic: 606.1 on 3 and 329 DF,  p-value: < 2.2e-16

The prediction equation is:

\(\widehat{bodymass}=3372.39+26.92SpeciesC+1377.86SpeciesG+667.56SexM\)

Interpret the estimates

  • The estimated mean body mass for Adelie female penguins is 3372.39 grams.
  • The estimated mean body mass for Chinstrap penguins is 26.92 grams higher than Adelie penguins, given gender is held constant.
  • The estimated mean body mass for Gentoo penguins is 1377.86 grams higher than Adelie penguins, given gender is held constant.
  • The estimated mean body mass for male penguins is 667.56 grams higher than female penguins, given species is held constant.

Note

We can recode our base level using the relevel function. This is optional, but could be useful if we have a particular interest group.

In this example, Gentoo appears to be the most different from the other species, so it might be useful to define it as the base level. Then each of our estimates will compare to Gentoo.

#This is only done if we desire to change the base level 
recodedSpecies<-relevel(factor(pendata$species),ref="Gentoo")
penmod6<-lm(body_mass_g~recodedSpecies,data=pendata)
summary(penmod6)

Call:
lm(formula = body_mass_g ~ recodedSpecies, data = pendata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1126.02  -333.09   -33.09   316.91  1223.98 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              5076.02      41.68  121.78   <2e-16 ***
recodedSpeciesAdelie    -1375.35      56.15  -24.50   <2e-16 ***
recodedSpeciesChinstrap -1342.93      69.86  -19.22   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 462.3 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6697,    Adjusted R-squared:  0.6677 
F-statistic: 343.6 on 2 and 339 DF,  p-value: < 2.2e-16

Step 4: Specify the distribution of the errors and find the estimate of the variance

Step 5: Evaluate the Utility of the model

Recall the model:

\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM\)

First we Perform the Global F Test:

  • Hypotheses:

    • \(H_0: \beta_1= \beta_2=\beta_3=0\) (the model is not adequate)
    • \(H_a\):at least one of \(\beta_1 , \beta_2 , \beta_3\) does not equal 0 (the model is adequate)
  • Distribution of test statistic: F with 3, 329 DF

  • Test Statistic: F=606.1

  • Pvalue: <0.0000001

  • Decision: pvalue<0.05 -> REJECT H0

  • Conclusion: The model with species and sex is adequate at body mass for the penguins.

What terms can we test “individually?”

  • Species is determined by two parameters (since SpeciesC and SpeciesG are the dummy variables), so we would want to keep or remove BOTH of those dummy variables. We will cover this later.

  • We can test sex because it is only associated with one parameter in the model.

  • Hypotheses:

    • \(H_0: \beta_3=0\)
    • \(H_a:\beta_3 \neq 0\)
  • Distribution of test statistic: T with 329 DF

  • Test Statistic: t=19.236

  • Pvalue: <0.0000001

  • Decision: pvalue<0.05 -> REJECT H0

  • Conclusion: The sex of the penguin is significant at predicting the body mass of a penguin given species is a constant in the model. We will keep it in the model.

Quantitative predictors

Note we can include both qualitative and quantitative predictors in the model.

penmod2<-lm(body_mass_g~species+sex+bill_depth_mm,data=pendata)
summary(penmod2)

Call:
lm(formula = body_mass_g ~ species + sex + bill_depth_mm, data = pendata)

Residuals:
   Min     1Q Median     3Q    Max 
-851.2 -187.7   -3.5  221.7  936.3 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       1576.99     357.08   4.416 1.37e-05 ***
speciesChinstrap    19.44      44.87   0.433    0.665    
speciesGentoo     1721.70      77.88  22.106  < 2e-16 ***
sexmale            513.99      45.24  11.360  < 2e-16 ***
bill_depth_mm      102.04      20.22   5.046 7.48e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 305.5 on 328 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8578,    Adjusted R-squared:  0.8561 
F-statistic: 494.8 on 4 and 328 DF,  p-value: < 2.2e-16

Write the model that was fit

\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM+\beta_4BillDepth\)

  • where SpeciesC= 1 if Chinstrap, 0 otherwise
  • SpeciesG - 1 if Gentoo, 0 otherwise
  • SexM = 1 if male, 0 otherwise

Interpret the estimates

  • Quantitative Variable: \(\hat{\beta_4}=102.04\) For each millimeter increase in bill depth, we expect the body weight of a penguin to increase by 102.04 grams, given species and sex are held constant.
    • Qualitative Variable: The estimated mean body mass for male penguins is 513.99 grams higher than female penguins, given species and bill depth are held constant.
  • Interpret species coefficients

Visualizing Quantitative and Qualitative Variables

Create a scatter plot of the quantitative variable vs the response. Then label the observations by the qualitative variable.

#Bill Depth / Species plot with no regression lines
plot(x=pendata$bill_depth_mm, y=pendata$body_mass_g,pch=as.numeric(factor(pendata$species)),xlab="Bill Depth",ylab="Body Mass (grams)") 
legend("topright",legend = levels(factor(pendata$species)),
       pch = 1:nlevels(factor(pendata$species)))

NOTE: When we group the observations on the scatter plot for Bill Depth by species, the relationships for each level are positive, despite the overall negative trend. Also we see each species would have a similar slope, but different y-intercepts.

Let’s suppose we simplify the model to just species and bill depth:

\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 BillDepth\) , where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise

  • The body mass- bill depth y intercept for Adelie (SpeciesC=0,SpeciesG=0) will be: \(\beta_0\)
  • The body mass- bill depth y intercept for Chinstrap (SpeciesC=1,SpeciesG=0) will be: \(\beta_0+\beta_1\)
  • The body mass- bill depth y intercept for Gentoo (SpeciesC=0,SpeciesG=1) will be: \(\beta_0+\beta_2\)
  • All species will have the same body mass- bill depth slope of \(\beta_3\)

What if there were different slopes for each group? We will explore this next time!