These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.

Step 1: Collect the Data

Check the appropriateness of response variable for regression: View a histogram of response variable. It should be continuous, and approximately unimodal and symmetric, with few outliers.

pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
## [1] "X"                 "species"           "island"           
## [4] "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm"
## [7] "body_mass_g"       "sex"               "year"
hist(pendata$body_mass_g, xlab="Body Mass", main="Histogram of Body Mass (in grams)") 

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

We explore the box plots and means for each qualitative variable explanatory variable then classify the relationships as existent or not.

#Summary Statistics for response variable grouped by each level of the response
tapply(pendata$body_mass_g,pendata$species,summary)
$Adelie
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2850    3350    3700    3701    4000    4775       1 

$Chinstrap
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2700    3488    3700    3733    3950    4800 

$Gentoo
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   3950    4700    5000    5076    5500    6300       1 
tapply(pendata$body_mass_g,pendata$sex,summary)
$female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2700    3350    3650    3862    4550    5200 

$male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3250    3900    4300    4546    5312    6300 
#Box plots for species and sex
boxplot(body_mass_g~species,pendata, ylab="Body Mass (grams)")

boxplot(body_mass_g~sex,pendata, ylab="Body Mass (grams)")

#Scatter plots for quantitative variables
for (i in names(pendata)[4:6]) {
  plot(pendata[,i], pendata$body_mass_g,xlab=i,ylab="Body Mass (grams)")
}

#Correlations for quantitative variables
round(cor(pendata[4:6],pendata$body_mass_g,use="complete.obs"),3)
                    [,1]
bill_length_mm     0.595
bill_depth_mm     -0.472
flipper_length_mm  0.871

\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 Sex\)

Step 3: Estimate the model parameters (fit the model using R)

penmod1<-lm(body_mass_g~species+sex,data=pendata)
summary(penmod1)

Call:
lm(formula = body_mass_g ~ species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-816.87 -217.80  -16.87  227.61  882.20 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3372.39      31.43 107.308   <2e-16 ***
speciesChinstrap    26.92      46.48   0.579    0.563    
speciesGentoo     1377.86      39.10  35.236   <2e-16 ***
sexmale            667.56      34.70  19.236   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 316.6 on 329 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8468,    Adjusted R-squared:  0.8454 
F-statistic: 606.1 on 3 and 329 DF,  p-value: < 2.2e-16
The prediction equation is:

\(\widehat{bodymass}=3372.39+26.92SpeciesC+1377.86SpeciesG+667.56Sex\)

Interpret the estimates

  • The average body mass for Adelie female penguins is 3372.39 grams.
  • The estimated mean body mass for Chinstrap penguins is 26.92 grams higher than Adelie penguins, given gender is held constant.
  • The estimated mean body mass for Gentoo penguins is 1377.86 grams higher than Adelie penguins, given gender is held constant.
  • The estimated mean body mass for male penguins is 667.56 grams higher than female penguins, given species is held constant.

Step 4: Specify the distribution of the errors and find the estimate of the variance

Step 5: Evaluate the Utility of the model

First we Perform the Global F Test:

  • Hypotheses:

    • \(H_0: \beta_1= \beta_2=\beta_3=0\) (the model is not adequate)
    • \(H_a\):at least one of \(\beta_1 , \beta_2 , \beta_3\) does not equal 0 (the model is adequate)
  • Distribution of test statistic: F with 3,329 DF

  • Test Statistic: F=606.1

  • Pvalue: <2.2e-16

  • Decision: pvalue<0.05 -> REJECT H0

  • Conclusion: The model with species and sex is adequate at body mass for the penguins.

What terms can we test “individually?”

  • Species is determined by two parameters (SpeciesC and SpeciesG dummy variables), so we would want to keep or remove BOTH of those dummy variables. We will cover this later.

  • We can test sex because it is only one parameter in the model.

  • Hypotheses:

    • \(H_0: \beta_3=0\)
    • \(H_a:\beta_3 \neq 0\)
  • Distribution of test statistic: T with 329 DF

  • Test Statistic: t=19.236

  • Pvalue: <2.2e-16

  • Decision: pvalue<0.05 -> REJECT H0

  • Conclusion: The sex of the penguin is significant at predicting the body mass of a penguin given species is a constant in the model. We will keep it in the model.

Quantative predictors

Note we can include both qualitative and quantitative predictors in the model.

penmod2<-lm(body_mass_g~species+sex+bill_depth_mm,data=pendata)
summary(penmod2)

Call:
lm(formula = body_mass_g ~ species + sex + bill_depth_mm, data = pendata)

Residuals:
   Min     1Q Median     3Q    Max 
-851.2 -187.7   -3.5  221.7  936.3 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       1576.99     357.08   4.416 1.37e-05 ***
speciesChinstrap    19.44      44.87   0.433    0.665    
speciesGentoo     1721.70      77.88  22.106  < 2e-16 ***
sexmale            513.99      45.24  11.360  < 2e-16 ***
bill_depth_mm      102.04      20.22   5.046 7.48e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 305.5 on 328 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8578,    Adjusted R-squared:  0.8561 
F-statistic: 494.8 on 4 and 328 DF,  p-value: < 2.2e-16

Interpretation: For each millimeter increase in bill depth, we expect the body weight of a penguin to increase by 102.04 grams, given species and sex are held constant.

plot(body_mass_g~bill_depth_mm , col=factor(species),data=pendata,xlab="Bill Depth",ylab="Body Mass (grams)") 
legend("topright",legend = levels(factor(pendata$species)), pch = 19,
       col = factor(levels(factor(pendata$species))))

NOTE: When we regroup the scatter plot for Bill Depth by species, the relationships for each level are positive. Note also we see each species would have a similar slope, but different y-intercept.

Let’s suppose we simplify the model to just species and bill depth:

\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 BillDepth\) , where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise

  • The body mass- bill depth y intercept for Adelie (SpeciesC=0,SpeciesG=0) will be: \(\beta_0\)
  • The body mass- bill depth y intercept for Chinstrap (SpeciesC=1,SpeciesG=0) will be: \(\beta_0+\beta_1\)
  • The body mass- bill depth y intercept for Gentoo (SpeciesC=0,SpeciesG=1) will be: \(\beta_0+\beta_2\)
  • All species will have the same body mass- bill depth slope of \(beta_3\)

What if there were different slopes for each group? We will explore this next time!

Unit 2.2 Interactions

We will explore the various styles of interactions through several combinations of variables.

Qualitative X Quantative

EDA: Grouped Scatter Plot (Species X Bill Depth)

plot(body_mass_g~bill_depth_mm , col=factor(species),data=pendata,xlab="Bill Depth",ylab="Body Mass (grams)") 
legend("topright",legend = levels(factor(pendata$species)), pch = 19,
       col = factor(levels(factor(pendata$species))))

Write a model for E(y) as a function of bill depth and species that hypothesizes different bill depth-body mass slopes for each species.

\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 BillDepth\)

\(+\beta_4SpeciesC*BillDepth+ \beta_5SpeciesG*BillDepth\)

, where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise

  • The coefficient (slope) for BillDepth is \(\beta_3+\beta_4SpeciesC+ \beta_5SpeciesG\)
    • The bill depth-body mass slope for Adelie (X1=0,X2=0) will be: \(\beta_3\)

    • The bill depth-body mass slope for Chinstrap (X1=1,X2=0) will be: \(\beta_3+\beta_4\)

    • The bill depth-body mass slope for Gentoo (X1=0,X2=1) will be: \(\beta_3+\beta_5\)

Fit the model with the interaction Species X Bill Depth

penmod3<-lm(body_mass_g~species*bill_depth_mm,data=pendata)
summary(penmod3)

Call:
lm(formula = body_mass_g ~ species * bill_depth_mm, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-845.89 -254.74  -28.46  228.01 1161.41 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     -283.28     437.94  -0.647   0.5182    
speciesChinstrap                 247.06     829.77   0.298   0.7661    
speciesGentoo                   -175.71     658.43  -0.267   0.7897    
bill_depth_mm                    217.15      23.82   9.117   <2e-16 ***
speciesChinstrap:bill_depth_mm   -12.53      45.01  -0.278   0.7809    
speciesGentoo:bill_depth_mm      152.29      40.49   3.761   0.0002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 354.9 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.807, Adjusted R-squared:  0.8041 
F-statistic:   281 on 5 and 336 DF,  p-value: < 2.2e-16

Write the prediction equation

\(\widehat{body mass}=-283.28+247.06SpeciesC-175.71SpeciesG+217.15 BillDepth\)

\(-12.53SpeciesC*BillDepth+ 152.29SpeciesG*BillDepth\)

Interpret the slope for bill depth

  • The slope is estimated as \(217.15 BillDepth -12.53SpeciesC+ 152.29SpeciesG\)
    • For a one millimeter increase in bill depth, the expected body mass will increase by 217.15 grams for the Adelie penguins.
    • For a one millimeter increase in bill depth, the expected body mass will increase by 204.62 (= 217.15-12.53) grams for the Chinstrap penguins.
      • For a one millimeter increase in bill depth, the expected body mass will increase by 369.44 (= 217.15+152.29) grams for the Chinstrap penguins.

Qualitative X Qualitative

EDA: Interaction Plot (Sex X Species)

# We need to verify there are observations for every combination of level
table(pendata$species,pendata$sex)
           
            female male
  Adelie        73   73
  Chinstrap     34   34
  Gentoo        58   61
# We can plot interactions either way
interaction.plot(pendata$species, pendata$sex, pendata$body_mass_g,fun=mean,trace.label="Sex", xlab="Species",ylab="Mean Body Mass")

interaction.plot(pendata$sex, pendata$species, pendata$body_mass_g,fun=mean,trace.label="Species", xlab="Sex",ylab="Mean Body Mass")

There is not strong evidence of an interaction because there is not crossing of the levels.

Write a model for E(y) as a function of sex and species and its interaction.

\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 SexM\)

\(+\beta_4SpeciesC*SexM+ \beta_5SpeciesG*SexM\)

, where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise, SexM= 1 if Male, 0 if female

  • There are many ways to interpret these betas. For instance, the average value of males’ body mass depends on the species. Or the average body mass of a particular species depends on the sex of the penguin. Remember qualitative variables do not have “slope” coefficients.

Fit the model with the interaction Species X Sex

penmod4<-lm(body_mass_g~species*sex,data=pendata)
summary(penmod4)

Call:
lm(formula = body_mass_g ~ species * sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-827.21 -213.97   11.03  206.51  861.03 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3368.84      36.21  93.030  < 2e-16 ***
speciesChinstrap           158.37      64.24   2.465  0.01420 *  
speciesGentoo             1310.91      54.42  24.088  < 2e-16 ***
sexmale                    674.66      51.21  13.174  < 2e-16 ***
speciesChinstrap:sexmale  -262.89      90.85  -2.894  0.00406 ** 
speciesGentoo:sexmale      130.44      76.44   1.706  0.08886 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 309.4 on 327 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8546,    Adjusted R-squared:  0.8524 
F-statistic: 384.3 on 5 and 327 DF,  p-value: < 2.2e-16

Write the prediction equation

\(\widehat{body mass}=368.84+158.37SpeciesC+1310.91SpeciesG+674.66 SexM\)

\(-262.89SpeciesC*SexM+ 130.44SpeciesG*SexM\)

Quantitative X Quantative

EDA: There is none (Bill Length X Bill Depth)

Write a model for E(y) as a function of bill depth and bill length that allows for an interaction.

\(E(body mass)=\beta_0+\beta_1BillLength+\beta_2 BillDepth\)

\(+\beta_3BillLength*BillDepth\)

  • The coefficient (slope) for BillLength is \(\beta_1+\beta_3BillDepth\)
  • The coefficient (slope) for BillDepth is \(\beta_2+\beta_3BillLength\)

Fit the model with the interaction Bill Length X Bill Depth

penmod5<-lm(body_mass_g~bill_length_mm*bill_depth_mm,data=pendata)
summary(penmod5)

Call:
lm(formula = body_mass_g ~ bill_length_mm * bill_depth_mm, data = pendata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1811.29  -355.81     4.35   354.80  1606.90 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -25583.278   2668.939  -9.586   <2e-16 ***
bill_length_mm                  715.006     58.681  12.185   <2e-16 ***
bill_depth_mm                  1484.934    149.405   9.939   <2e-16 ***
bill_length_mm:bill_depth_mm    -36.079      3.297 -10.944   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 503.5 on 338 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6093,    Adjusted R-squared:  0.6058 
F-statistic: 175.7 on 3 and 338 DF,  p-value: < 2.2e-16

\(\widehat{body mass}=-25583.278+715.006BillLength+1484.934 BillDepth+36.079 BillLength*BillDepth\)

Suppose a penguin has a bill depth of 17mm. What is the slope of bill length?

  • The slope estimate for BillLength is \(715.006-36.079*BillDepth\). When bill depth is 17, our slope is \(715.006-36.079*17= 101.663\). Therefore, for a one millimeter increase in bull length, the expected body mass will increase by 101.66 grams when bill length is held constant at 17mm.

Final Note

We can recoded our base level using the relevel function.

recodedSpecies<-relevel(factor(pendata$species),ref="Chinstrap")
penmod6<-lm(body_mass_g~recodedSpecies,data=pendata)
summary(penmod6)

Call:
lm(formula = body_mass_g ~ recodedSpecies, data = pendata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1126.02  -333.09   -33.09   316.91  1223.98 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           3733.09      56.06   66.59   <2e-16 ***
recodedSpeciesAdelie   -32.43      67.51   -0.48    0.631    
recodedSpeciesGentoo  1342.93      69.86   19.22   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 462.3 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6697,    Adjusted R-squared:  0.6677 
F-statistic: 343.6 on 2 and 339 DF,  p-value: < 2.2e-16