These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.

Unit 2.3: A Complete Analysis

Step 1: Collect the Data

Check the appropriateness of response variable for regression: View a histogram of response variable. It should be continuous, and approximately unimodal and symmetric, with few outliers.

pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
## [1] "X"                 "species"           "island"           
## [4] "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm"
## [7] "body_mass_g"       "sex"               "year"
hist(pendata$body_mass_g, xlab="Body Mass", main="Histogram of Body Mass (in grams)") 

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

We will explore the relationship with quantitative variables with scatter plots and correlations and classify each relationship as linear, curvilinear, or none. We explore the box plots and means for each qualitative variable explanatory variable then classify the relationships as existent or not. Additionally, we can explore interactions.

#Scatter plots for quantitative variables
for (i in names(pendata)[4:6]) {
  plot(pendata[,i], pendata$body_mass_g,xlab=i,ylab="Body Mass (grams)")
}

#Correlations for quantitative variables
round(cor(pendata[4:6],pendata$body_mass_g,use="complete.obs"),3)
                    [,1]
bill_length_mm     0.595
bill_depth_mm     -0.472
flipper_length_mm  0.871
#Summary Statistics for response variable grouped by each level of the response
tapply(pendata$body_mass_g,pendata$species,summary)
$Adelie
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2850    3350    3700    3701    4000    4775       1 

$Chinstrap
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2700    3488    3700    3733    3950    4800 

$Gentoo
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   3950    4700    5000    5076    5500    6300       1 
tapply(pendata$body_mass_g,pendata$sex,summary)
$female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2700    3350    3650    3862    4550    5200 

$male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3250    3900    4300    4546    5312    6300 
#Box plots for Qualitative species and sex
boxplot(body_mass_g~species,pendata, ylab="Body Mass (grams)")

boxplot(body_mass_g~sex,pendata, ylab="Body Mass (grams)")

# Interactions for species X Sex We need to verify there are observations for every combination of level
table(pendata$species,pendata$sex)
           
            female male
  Adelie        73   73
  Chinstrap     34   34
  Gentoo        58   61
# plot interaction for Species X Sex
interaction.plot(pendata$species, pendata$sex, pendata$body_mass_g,fun=mean,trace.label="Sex", xlab="Species",ylab="Mean Body Mass")

# Interaction between Bill Depth X Species
plot(body_mass_g~bill_depth_mm , col=factor(species),data=pendata,xlab="Bill Depth",ylab="Body Mass (grams)") 
legend("topright",legend = levels(factor(pendata$species)), pch = 19,
       col = factor(levels(factor(pendata$species))))

# Interaction bill depth/sex
plot(body_mass_g~bill_depth_mm , col=factor(sex),data=pendata,xlab="Bill Depth",ylab="Body Mass (grams)") 
legend("topright",
       legend = levels(factor(pendata$sex)),
       pch = 19,
       col = factor(levels(factor(pendata$sex))))

# Interaction bill length/species
plot(body_mass_g~bill_length_mm , col=factor(species),data=pendata,xlab="Bill Length",ylab="Body Mass (grams)") 
legend("topright",
       legend = levels(factor(pendata$species)),
       pch = 19,
       col = factor(levels(factor(pendata$species))))

# Interaction bill length/sex
plot(body_mass_g~bill_length_mm , col=factor(sex),data=pendata,xlab="Bill Length",ylab="Body Mass (grams)") 
legend("topright",
       legend = levels(factor(pendata$sex)),
       pch = 19,
       col = factor(levels(factor(pendata$sex))))

# Interaction flipper length/species
plot(body_mass_g~flipper_length_mm , col=factor(species),data=pendata,xlab="Flipper Length",ylab="Body Mass (grams)") 
legend("topright",
       legend = levels(factor(pendata$species)),
       pch = 19,
       col = factor(levels(factor(pendata$species))))

# Interaction flipper/sex
plot(body_mass_g~flipper_length_mm , col=factor(sex),data=pendata,xlab="Flipper Length",ylab="Body Mass (grams)") 
legend("topright",
       legend = levels(factor(pendata$sex)),
       pch = 19,
       col = factor(levels(factor(pendata$sex))))

Step 3: Estimate the model parameters (fit the model using R)

Quantitative Variables

First we will add the quantitative variables of interest and the QUANT X QUANT interactions that we believe may be important. Here, we believe Bill Depth X Bill Length may be a useful interaction.

penmod1<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm,data=pendata)
summary(penmod1)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm, data = pendata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1161.66  -252.74   -39.99   239.65  1139.91 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -20370.313   1960.714 -10.389  < 2e-16 ***
bill_depth_mm                   858.001    114.277   7.508 5.42e-13 ***
bill_length_mm                  352.102     47.415   7.426 9.25e-13 ***
flipper_length_mm                43.350      2.486  17.441  < 2e-16 ***
bill_depth_mm:bill_length_mm    -19.071      2.585  -7.379 1.26e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 365.6 on 337 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.7946,    Adjusted R-squared:  0.7922 
F-statistic:   326 on 4 and 337 DF,  p-value: < 2.2e-16

Start with Global F test

  • Hypotheses:

    • \(H_0: \beta_1= \beta_2=\beta_3=\beta_4=0\) (the model is not adequate)
    • \(H_a\):at least one of \(\beta_1 , \beta_2 , \beta_3,\beta_4\) does not equal 0 (the model is adequate)
  • Distribution of test statistic: F with 4, 337 DF

  • Test Statistic: F=326

  • Pvalue: <0.0001

  • Decision: 0.0001<0.05 -> REJECT H0

  • Conclusion: The model with bill depth, bill length, flipper length, and the interaction of bill length and bill depth is adequate at predicting the body mass of penguins.

Test the “most important predictors”: Is the interaction significant?: We look at a t-test here because this interaction accounts for just one parameter.

  • Hypotheses:
    • \(H_0: \beta_4=0\) (the interaction does not contribute to predicting body mass)
    • \(H_a:\beta_4 \neq 0\) (the interaction contributes to predicting body mass)
  • Distribution of test statistic: T with 337 DF
  • Test Statistic: t=-7.379
  • Pvalue: <0.0001
  • Decision: <0.0001<0.05 -> REJECT H0
  • Conclusion: The interaction between bill depth and bill length is significant at predicting body mass of penguins.

We will not do any further testing as we do not have another variables of interest.

Qualitative variables

We will add the qualitative variables to the model. We did not believe the interaction between species and sex looked important, so we will not add it.

penmod2<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(penmod2)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-706.90 -175.12   -6.14  176.80  880.08 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6290.446   1832.633  -3.432 0.000675 ***
bill_depth_mm                  333.924     98.231   3.399 0.000760 ***
bill_length_mm                 130.274     41.058   3.173 0.001653 ** 
flipper_length_mm               16.535      2.888   5.725 2.35e-08 ***
speciesChinstrap              -200.763     82.322  -2.439 0.015273 *  
speciesGentoo                  923.788    132.379   6.978 1.68e-11 ***
sexmale                        391.603     47.370   8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm    -6.344      2.290  -2.770 0.005919 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 284.4 on 325 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8778,    Adjusted R-squared:  0.8752 
F-statistic: 333.7 on 7 and 325 DF,  p-value: < 2.2e-16

NOTE: The model that is fit to match this syntax is:

\(E(bodymass)=\beta_0+\beta_1BillDep+\beta_2BillLen+\beta_3FlipperLen+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDep*BillLen\)

  • where Species C= 1 if Chinstrap, 0 otherwise
  • SpeciesG = 1 if Gentoo, 0 otherwise
  • SexM = 1 if male, 0 otherwise

Test the “most important predictors”: Is the species significant?: We look at a nested f test here because species accounts for two parameters.

redpenmod2<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+sex,data=pendata)
summary(redpenmod2)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + sex, data = pendata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1027.84  -216.45   -26.22   200.94   903.55 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -14086.152   1833.599  -7.682 1.83e-13 ***
bill_depth_mm                   610.319    103.472   5.898 9.16e-09 ***
bill_length_mm                  283.535     42.278   6.707 8.77e-11 ***
flipper_length_mm                34.146      2.396  14.253  < 2e-16 ***
sexmale                         492.047     49.009  10.040  < 2e-16 ***
bill_depth_mm:bill_length_mm    -15.633      2.300  -6.798 5.03e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 319.5 on 327 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8449,    Adjusted R-squared:  0.8426 
F-statistic: 356.3 on 5 and 327 DF,  p-value: < 2.2e-16
anova(redpenmod2,penmod2)
  • Hypotheses:
    • \(H_0: \beta_4=\beta_5=0\) (the species does not contribute to predicting body mass)
    • \(H_a:\beta_4, \beta_5 \neq 0\) (the species contributes to predicting body mass)
  • Distribution of test statistic: F 2 with 325 DF
  • Test Statistic: F=43.795
  • Pvalue: <0.0001
  • Decision: <0.0001<0.05 -> REJECT H0
  • Conclusion: The species is significant at predicting body mass of penguins. We will keep both dummy variables in the model and not test them individually.

We will not do any further testing as we did not have another variable of interest.

Qual X Quant Interaction

We will add the interaction between bill depth and species because we thought that looked most important.

penmod3<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex+species*bill_depth_mm,data=pendata)
summary(penmod3)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex + species * 
    bill_depth_mm, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-763.71 -170.22   -3.61  167.06  887.00 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    -4562.344   2259.560  -2.019   0.0443 *  
bill_depth_mm                    241.956    126.493   1.913   0.0567 .  
bill_length_mm                    90.930     55.312   1.644   0.1012    
flipper_length_mm                 15.946      2.924   5.453 9.88e-08 ***
speciesChinstrap                 742.233    911.667   0.814   0.4162    
speciesGentoo                    400.335    584.241   0.685   0.4937    
sexmale                          386.571     47.305   8.172 6.96e-15 ***
bill_depth_mm:bill_length_mm      -4.100      3.160  -1.298   0.1953    
bill_depth_mm:speciesChinstrap   -52.218     51.172  -1.020   0.3083    
bill_depth_mm:speciesGentoo       38.098     36.315   1.049   0.2949    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 283.6 on 323 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8793,    Adjusted R-squared:  0.876 
F-statistic: 261.6 on 9 and 323 DF,  p-value: < 2.2e-16

NOTE: The model that is fit to match this syntax is:

\(E(bodymass)=\beta_0+\beta_1BillDep+\beta_2BillLen+\beta_3FlipperLen+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDep*BillLen+\beta_8SpeciesC*BillDep+\beta_9SpeciesG*BillDep\)

  • where Species C= 1 if Chinstrap, 0 otherwise
  • SpeciesG = 1 if Gentoo, 0 otherwise
  • SexM = 1 if male, 0 otherwise

Test the “most important predictors”: Is the species X bill depth interaction significant?: We look at a nested f test here because this interaction accounts for two parameters.

redpenmod3<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(redpenmod3)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-706.90 -175.12   -6.14  176.80  880.08 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6290.446   1832.633  -3.432 0.000675 ***
bill_depth_mm                  333.924     98.231   3.399 0.000760 ***
bill_length_mm                 130.274     41.058   3.173 0.001653 ** 
flipper_length_mm               16.535      2.888   5.725 2.35e-08 ***
speciesChinstrap              -200.763     82.322  -2.439 0.015273 *  
speciesGentoo                  923.788    132.379   6.978 1.68e-11 ***
sexmale                        391.603     47.370   8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm    -6.344      2.290  -2.770 0.005919 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 284.4 on 325 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8778,    Adjusted R-squared:  0.8752 
F-statistic: 333.7 on 7 and 325 DF,  p-value: < 2.2e-16
anova(redpenmod3,penmod3)
  • Hypotheses:
    • \(H_0: \beta_8=\beta_9=0\) (the interaction does not contribute to predicting body mass)
    • \(H_a:\beta_8, \beta_9 \neq 0\) (the interaction contributes to predicting body mass)
  • Distribution of test statistic: F 2 with 323 DF
  • Test Statistic: F=1.99
  • Pvalue: 0.137
  • Decision: 0.137>0.05 -> FAIL TO REJECT H0
  • Conclusion: The interaction between species and bill depth is not significant at predicting body mass of penguins. We will remove it from the model.

We will not do any further testing as we did not have another variable of interest.

Our final model is:

\(E(bodymass)=\beta_0+\beta_1BillDep+\beta_2BillLen+\beta_3FlipperLen+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDep*BillLen\)

  • where Species C= 1 if Chinstrap, 0 otherwise
  • SpeciesG = 1 if Gentoo, 0 otherwise
  • SexM = 1 if male, 0 otherwise

Interpret the estimates

penmod2<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(penmod2)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-706.90 -175.12   -6.14  176.80  880.08 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6290.446   1832.633  -3.432 0.000675 ***
bill_depth_mm                  333.924     98.231   3.399 0.000760 ***
bill_length_mm                 130.274     41.058   3.173 0.001653 ** 
flipper_length_mm               16.535      2.888   5.725 2.35e-08 ***
speciesChinstrap              -200.763     82.322  -2.439 0.015273 *  
speciesGentoo                  923.788    132.379   6.978 1.68e-11 ***
sexmale                        391.603     47.370   8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm    -6.344      2.290  -2.770 0.005919 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 284.4 on 325 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8778,    Adjusted R-squared:  0.8752 
F-statistic: 333.7 on 7 and 325 DF,  p-value: < 2.2e-16

Step 4: Specify the distribution of the errors and find the estimate of the variance

Step 5: Evaluate the Utility of the model

First we Perform the Global F Test:

  • Hypotheses:

    • \(H_0: \beta_1= ...=\beta_7=0\) (the model is not adequate)
    • \(H_a\):at least one of \(\beta_1 , ... \beta_7\) does not equal 0 (the model is adequate)
  • Distribution of test statistic: F with 7,325 DF

  • Test Statistic: F=333.7

  • Pvalue: <2.2e-16

  • Decision: pvalue<0.05 -> REJECT H0

  • Conclusion: The model is adequate at body mass for the penguins.

Further Assessment:

  • Adjusted R-Sq: 0.87 ( it is close to R2, which indicates a good fit)
    • 87.5% of the variation in body explained by the model with bill depth, bill length, flipper length, species, sex, and the interaction between bill depth and bill length.
  • Confidence Interval for Betas

Step 6: Check the Model Assumptions

We will cover this in Unit 3

Step 7: Use the model for prediction or estimation

# Or we can create a data frame with the new values.
new<-data.frame(bill_depth_mm=17,bill_length_mm=40,flipper_length_mm=205,species="Gentoo",sex="female")
new
predict(penmod2,new, interval="prediction")
       fit      lwr      upr
1 4596.685 4009.935 5183.436

We are 95% confident that a penguin with a bill depth of 17, bill length of 40, flipper length of 205, Gentoo species, and female to have a body mass between 4009.935 and 5183 grams.