Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program set out to study the impact of Ecological Sexual Dimorphism in arctic penguins. In 2014, that particular study compared deep-dive foraging habits of male and female penguins and how that resulted in different biological outcomes. To read more abut it, you can go to read the full study here. And you can find more of Dr. Gorman’s more recent work here

We only have a portion of the data from the larger scale study, so our goals and scope of analysis will be different. Our aim is to determine which factors impact the body mass of a penguin. This is useful to know as body mass is an important indicator of penguin’s health.

Our data table contains the following columns:

Observation: the observation number in the data table
body_mass_g: an integer denoting the weight of a penguin’s body (grams) (Response Variable)
species: a factor denoting the penguin species (Adélie, Chinstrap, or Gentoo)
island: a factor denoting the Palmer Archipelago island in Antarctica where each penguin was observed (Biscoe Point, Dream Island, or Torgersen Island)
bill_length_mm: a number denoting length of the dorsal ridge of a penguin bill (millimeters)
bill_depth_mm: a number denoting the depth of a penguin bill (millimeters)
flipper_length_mm: an integer denoting the length of a penguin flipper (millimeters)
sex: a factor denoting the sex of a penguin sex (male, female) based on molecular data
year: an integer denoting the year of study (2007, 2008, or 2009)

Step 1-5 Recap

pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)

names(pendata)

[1] "Observation"       "body_mass_g"       "species"          
[4] "island"            "bill_length_mm"    "bill_depth_mm"    
[7] "flipper_length_mm" "sex"               "year"

Hypothesized Linear Model After going through our model building process:

\(E(bodymass)=\beta_0+\beta_1BillDep+\beta_2BillLen+\beta_3FlipperLen+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDep*BillLen\)

where Species C= 1 if Chinstrap, 0 otherwise
SpeciesG = 1 if Gentoo, 0 otherwise
SexM = 1 if male, 0 otherwise

Step 6: Check Model Assumptions (Residual Analysis)

Check the model assumptions.

penmodfinal<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)

#There are a few options to view plots for assumptions

#Residuals Plots of explanatory variables vs residuals
residualPlots(penmodfinal,tests=F)

#Residual vs Fitted and QQ plot
plot(penmodfinal, which=c(1,2))

#histogram of residuals
hist(residuals(penmodfinal))

Lack of Fit - Plots to evaluate: Residual Plots - There were no clear trends that could be modeled in the residual plots. The mean 0 (lack of fit) assumption was not violated.

Constant Variance - Plots to evaluate: Residual Plots and Residual vs Fitted - There were no clear fanning patterns in the residual plots nor the residual vs predicted plots. For qualitative variables we look for even spread of residuals across the levels. The constant variance (homoscedasticity) assumption was not violated.

Normality - Plots to evaluate: QQ Plot and Histogram of Residuals - The observations fall linearly on the QQ plot. (The residuals are unimodal and symmetric in the histogram). The normality assumption is not violated.

Independence - Plots to evaluate: No specific plots. We could perform the Durbin Watson test, but we will not do that in the scope of this course. - We do not believe the information of one penguin depends on another penguin. The independence assumption is likely not violated.

#Within lm() we can add both transformations

# You Can Try On Your Own
penmodT<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
#summary(penmodT)

## Re-evaluate Assumptions
#Residuals Plots of explanatory variables vs residuals
residualPlots(penmodT,tests=F)

#Residual vs Fitted and QQ plot
plot(penmodT, which=c(1,2))

#histogram of residuals
hist(residuals(penmodT))

Interpretations on Variable Transformations (Unit 3.3):

When we transform the explanatory variables, the interpretations do not change drastically

EX: for a one unit increase in the transformed value of X (square root of flipper length), the predicted value of y changes by \(\hat{\beta_i}\)

Similarly, the transformation for the response variable, y, follow:

EX: for a one unit increase in the value of x, the predicted value of the transformed value of y (y*) changes by \(\hat{\beta_i}\)
Prediction Interval: When we have this set of explanatory variables we expect y* to be between the interval computed.
Prediction Interval Transformed: When we have this set of explanatory variables we expect the value y to be between the interval computed if we take the inverse transformation of the end points.
Confidence Interval: When we have this set of explanatory variables we expect the average of y* (the transformed value of y) to be between the interval computed.
Confidence Interval Transformed: we cannot perform the inverse of the transformation to get the original units.

Identify outliers and influential observations.

Label the observations that exceed the threshold for each statistic:

Cook’s Distance: Identifies influential observations. Threshold is 4/n
- Circles on “influence plot” are indicative of the Cook’s distance value
Leverage (Hat): Identifies outliers in the x direction (extreme values of explanatory variables). Threshold is 2(k+1)/n
Studentizied Residuals: Identifies outliers in the y direction. Threshold is +/- 2
Deleted Studentizied Residuals (R Student): Identifies general outliers and influential observations. Threshold is +/- 2

Computing the Thresholds

n will be how many observations were used to fit the model, not necessarily how many observations were in the original data table. To determine this number we can go to the model F degrees of freedom k, n-k-1. In the pen model we have 7, 325. So k=7 and n=333.

#Determine n
summary(penmodfinal)


Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-706.90 -175.12   -6.14  176.80  880.08 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6290.446   1832.633  -3.432 0.000675 ***
bill_depth_mm                  333.924     98.231   3.399 0.000760 ***
bill_length_mm                 130.274     41.058   3.173 0.001653 ** 
flipper_length_mm               16.535      2.888   5.725 2.35e-08 ***
speciesChinstrap              -200.763     82.322  -2.439 0.015273 *  
speciesGentoo                  923.788    132.379   6.978 1.68e-11 ***
sexmale                        391.603     47.370   8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm    -6.344      2.290  -2.770 0.005919 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 284.4 on 325 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8778,    Adjusted R-squared:  0.8752 
F-statistic: 333.7 on 7 and 325 DF,  p-value: < 2.2e-16

#numDF + (k+1) = n-k-1+(k+1) = n
n<-df.residual(penmodfinal)+length(coef(penmodfinal))
n

[1] 333

#Leverage Cut off
levcut<-(2*(length(coef(penmodfinal))))/n
levcut

[1] 0.04804805

#Cook's Distance Cutoff
cdcut<-4/n
cdcut

[1] 0.01201201

Now we can locate these values on the influence plots to identify outliers and influential observations.

#Cooks Distance Thresholds
plot(penmodfinal,which=4)

#Leverage vs Studentized Residuals
influencePlot(penmodfinal,fill=F)

#Deleted Studentized Residals
ols_plot_resid_stud_fit(penmodfinal)

Based on the labeled observations from the plots:

With respect to Leverage, observations 314, 15, 295 (and others) are considered “leverage”=outliers in x because their values exceed the threshold of 2(7+1)/333=0.048.
With respect to Studentizied Residuals, observations 170, 40, 315, 314 (and others) are considered outliers in y because their values exceed the threshold of +/-2 .
With respect to Cook’s Distance, observations 293, 314, 315 are considered influential because their values exceed the threshold of 4/n=4/333=0.012.
With respect to Deleted Studentizied Residuals, observations that are labelled red (164, 35, 304, etc) are considered influential because their values exceed the threshold of +/-2.
Note: The “influence output” only labels some of the observations that exceed our thresholds.

Determine all of the observations that exceed our thresholds

We can use the which() function to print values that meet certain criteria from a logical expression.

#General expressions about the variables in the data table

#Quantitative Variables
which(pendata$bill_depth_mm<17.3)

# Qualitative Variables
which(pendata$species=="Gentoo")

#Multiple arguments
which(pendata$species=="Gentoo"&pendata$bill_depth_mm<17.3)
which(pendata$species=="Gentoo"|pendata$bill_depth_mm<17.3)

We have several functions that will compute the values of our influence diagnostics for every observation. The input argument for each of these functions is the model that was fit.

Leverage: hatvalues()
Studentized Residuals: stdres()
Cook’s Distance: cooks.distance()
Deleted Studentized Residuals: rstudent()

# Using our influential diagnostics
## Leverage
which(hatvalues(penmodfinal)>16/333)

 15  20 115 130 143 186 190 240 288 294 308 314 340 
 10  15 109 124 137 179 183 232 277 283 297 303 329

## Studentized Residual
which(abs(stdres(penmodfinal))>2)

  8  40  46  82  86  94 105 110 120 122 132 166 170 196 232 285 293 314 315 325 
  7  35  41  76  80  88  99 104 114 116 126 160 164 189 224 274 282 303 304 314

## Cooks Distance
which(cooks.distance(penmodfinal)>4/333)

 40  82 112 118 122 166 170 193 232 285 293 294 297 298 314 315 325 
 35  76 106 112 116 160 164 186 224 274 282 283 286 287 303 304 314

## Deleted Studentized Residual
which(abs(rstudent(penmodfinal))>2)

  8  40  46  82  86  94 105 110 120 122 132 166 170 196 232 285 293 314 315 325 
  7  35  41  76  80  88  99 104 114 116 126 160 164 189 224 274 282 303 304 314

#Multiple arguments
which(hatvalues(penmodfinal)>16/333&cooks.distance(penmodfinal)>4/333&abs(rstudent(penmodfinal))>2&abs(stdres(penmodfinal))>2)

314 
303

The top row is the original row position of the observation from the data frame, the bottom row is the observation’s position index on the list of influence diagnostics. These do not match in this example because there are NA values in the data frame.

Observations 15, 20, 115, 130, 143, 186, 190, 240, 288, 294, 308, 314, 340 exceed the threshold for leverage.
Observations 40, 82, 112, 118, 122, 166, 170, 193, 232, 285, 293, 294, 297, 298, 314, 315, 325 exceed the threshold for Cook’s Distance.
Observations 8, 40, 46, 82, 86, 94, 105, 110, 120, 122, 132, 166, 170, 196, 232, 285, 293, 314, 315, 325 exceed the threshold for Deleted Studentized Residuals.
Observations 8, 40, 46, 82, 86, 94, 105, 110, 120, 122, 132, 166, 170, 196, 232, 285, 293, 314, 315, 325 exceed the threshold for Studentized Residuals.
Observation 314 exceeds in all 4 categories.

Evaluate Extreme Observations Closer

Next steps would be to evaluate closer with dffits or dfbetas and see how those observations compare to the other observations in the data. Then look at the EDA. Determine if the observations are very influential and/or truly representative of the data. We should NOT just remove without further exploration.

#First we will store the DFFits to make them easier to work with
pendffits<-dffits(penmodfinal)

# Calling the observation numbers (row names)
# don't get too worried on the syntax
pendffits[c("15","40","170","294","314","315")]

        15         40        170        294        314        315 
 0.1286874  0.3818284  0.3833745  0.3424947  0.5436063 -0.4620501

# Now we can compare these values to the rest
summary(dffits(penmodfinal))

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.462050 -0.093274 -0.003084  0.001633  0.101550  0.543606

# Slightly different syntax for DFBetas
# Because DFBetas is a matrix. not just a vector
pendfbeta<-dfbetas(penmodfinal)
pendfbeta[c("15","40","170","294","314","315"),]

    (Intercept) bill_depth_mm bill_length_mm flipper_length_mm speciesChinstrap
15  -0.08557838    0.09310769     0.06910227        0.02660027     0.0655398436
40   0.02653293    0.05209415     0.03999337       -0.25112153    -0.0043988499
170 -0.12168270    0.11147375     0.13543939        0.04868884     0.0246383890
294 -0.01852371    0.02920232     0.08105004       -0.17108150    -0.1584788882
314  0.26622725   -0.32579551    -0.36372431        0.14860592    -0.0004064124
315  0.21521173   -0.21042438    -0.25248533       -0.04511119    -0.1638646983
    speciesGentoo      sexmale bill_depth_mm:bill_length_mm
15     0.02492217  0.009146321                  -0.08271890
40     0.12938768  0.170995071                  -0.04103063
170   -0.05802485  0.169876311                  -0.13743246
294   -0.04443237 -0.119146477                  -0.03129413
314    0.07753382 -0.100034429                   0.36711794
315    0.18047644 -0.001915482                   0.25222563

summary(dfbetas(penmodfinal))

  (Intercept)         bill_depth_mm        bill_length_mm      
 Min.   :-2.241e-01   Min.   :-0.3257955   Min.   :-3.637e-01  
 1st Qu.:-1.834e-02   1st Qu.:-0.0121734   1st Qu.:-9.957e-03  
 Median :-3.307e-04   Median : 0.0009279   Median : 2.128e-04  
 Mean   : 7.712e-05   Mean   :-0.0000870   Mean   :-7.541e-05  
 3rd Qu.: 1.552e-02   3rd Qu.: 0.0194839   3rd Qu.: 1.921e-02  
 Max.   : 2.662e-01   Max.   : 0.2017456   Max.   : 2.407e-01  
 flipper_length_mm    speciesChinstrap     speciesGentoo       
 Min.   :-2.511e-01   Min.   :-2.327e-01   Min.   :-3.006e-01  
 1st Qu.:-1.948e-02   1st Qu.:-1.623e-02   1st Qu.:-1.962e-02  
 Median : 1.204e-04   Median :-7.051e-05   Median : 6.127e-04  
 Mean   :-3.359e-05   Mean   :-9.018e-05   Mean   : 8.558e-06  
 3rd Qu.: 2.059e-02   3rd Qu.: 1.830e-02   3rd Qu.: 1.995e-02  
 Max.   : 2.652e-01   Max.   : 2.515e-01   Max.   : 2.159e-01  
    sexmale           bill_depth_mm:bill_length_mm
 Min.   :-2.001e-01   Min.   :-2.387e-01          
 1st Qu.:-2.291e-02   1st Qu.:-1.882e-02          
 Median :-1.235e-03   Median :-4.818e-04          
 Mean   :-9.972e-05   Mean   : 9.593e-05          
 3rd Qu.: 1.990e-02   3rd Qu.: 1.117e-02          
 Max.   : 2.072e-01   Max.   : 3.671e-01

# Look at those values in the table
pendata[c("15","40","170","294","314","315"),]

summary(pendata)

  Observation      body_mass_g     species             island         
 Min.   :  1.00   Min.   :2700   Length:344         Length:344        
 1st Qu.: 86.75   1st Qu.:3550   Class :character   Class :character  
 Median :172.50   Median :4050   Mode  :character   Mode  :character  
 Mean   :172.50   Mean   :4202                                        
 3rd Qu.:258.25   3rd Qu.:4750                                        
 Max.   :344.00   Max.   :6300                                        
                  NA's   :2                                           
 bill_length_mm  bill_depth_mm   flipper_length_mm     sex           
 Min.   :32.10   Min.   :13.10   Min.   :172.0     Length:344        
 1st Qu.:39.23   1st Qu.:15.60   1st Qu.:190.0     Class :character  
 Median :44.45   Median :17.30   Median :197.0     Mode  :character  
 Mean   :43.92   Mean   :17.15   Mean   :200.9                       
 3rd Qu.:48.50   3rd Qu.:18.70   3rd Qu.:213.0                       
 Max.   :59.60   Max.   :21.50   Max.   :231.0                       
 NA's   :2       NA's   :2       NA's   :2                           
      year     
 Min.   :2007  
 1st Qu.:2007  
 Median :2008  
 Mean   :2008  
 3rd Qu.:2009  
 Max.   :2009

Now we can compare the DFFits and DFBetas to the rest. Additionally, we can compare the x values to the rest of the data. Upon investigating, observation 314 appears to be very influential and has high values of DFFits. We may consider removing this.

Subsetting our data frame to remove influential observations

We can use the which function inside our square bracket notation to create new data frames based on subsetting requirements. We can store these as new data objects, or use the subsetting directly into another function’s arguments.

# General formatting for subsetting with the which() function
## Quantitative Variables
quantsubsetpen<-pendata[which(pendata$bill_depth_mm<17.3),]

#Example
plot((body_mass_g)~flipper_length_mm,data=quantsubsetpen)

plot((body_mass_g)~flipper_length_mm,data=pendata[which(pendata$bill_depth_mm<17.3),])

# Qualitative Variables

qualsubsetpen<-pendata[which(pendata$species=="Gentoo"),]

# Multiple Arguments

multaubsetpen1<-pendata[which(pendata$species=="Gentoo"&pendata$bill_depth_mm<17.3),]

multaubsetpen2<-pendata[which(pendata$species=="Gentoo"|pendata$bill_depth_mm<17.3),]

If you want to remove just one or more observations, or a set of observations, we can use a - (“minus”) or the which() function within a subset.

# You can do this directly into the data argument in lm
# remove one single value, based on the observation's original row number
penmodrem.1<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[-314,])
summary(penmodrem.1)


Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata[-314, 
    ])

Residuals:
   Min     1Q Median     3Q    Max 
-704.2 -170.2    0.9  175.3  875.8 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6775.864   1838.196  -3.686 0.000267 ***
bill_depth_mm                  365.764     98.923   3.697 0.000256 ***
bill_length_mm                 145.132     41.470   3.500 0.000531 ***
flipper_length_mm               16.108      2.881   5.592 4.78e-08 ***
speciesChinstrap              -200.730     81.904  -2.451 0.014782 *  
speciesGentoo                  913.577    131.798   6.932 2.26e-11 ***
sexmale                        396.318     47.184   8.399 1.42e-15 ***
bill_depth_mm:bill_length_mm    -7.180      2.313  -3.104 0.002079 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 283 on 324 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8793,    Adjusted R-squared:  0.8767 
F-statistic: 337.1 on 7 and 324 DF,  p-value: < 2.2e-16

# note DF is different

#remove more than one, based on the observations' original row numbers
penmodrem.2<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[-c(15,40,170,294,314,315),])
summary(penmodrem.2)


Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata[-c(15, 
    40, 170, 294, 314, 315), ])

Residuals:
    Min      1Q  Median      3Q     Max 
-681.17 -171.80    1.54  171.99  688.00 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6820.422   1833.965  -3.719 0.000236 ***
bill_depth_mm                  359.304     99.122   3.625 0.000336 ***
bill_length_mm                 142.566     41.217   3.459 0.000616 ***
flipper_length_mm               17.212      2.859   6.020 4.79e-09 ***
speciesChinstrap              -181.755     81.946  -2.218 0.027260 *  
speciesGentoo                  883.023    128.661   6.863 3.52e-11 ***
sexmale                        385.230     46.275   8.325 2.51e-15 ***
bill_depth_mm:bill_length_mm    -7.107      2.309  -3.079 0.002260 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 274.2 on 319 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8845,    Adjusted R-squared:  0.882 
F-statistic:   349 on 7 and 319 DF,  p-value: < 2.2e-16

#remove based on logical expression
penmodrem.3<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[which(pendata$bill_depth_mm<17.3),])
summary(penmodrem.3)


Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata[which(pendata$bill_depth_mm < 
    17.3), ])

Residuals:
    Min      1Q  Median      3Q     Max 
-683.38 -162.91  -14.99  155.11  860.97 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -4913.958   4618.410  -1.064  0.28901    
bill_depth_mm                  254.936    281.470   0.906  0.36650    
bill_length_mm                  82.344     94.554   0.871  0.38519    
flipper_length_mm               17.357      4.703   3.691  0.00031 ***
speciesChinstrap              -117.223    135.779  -0.863  0.38930    
speciesGentoo                 1036.777    213.654   4.853 2.98e-06 ***
sexmale                        447.063     71.778   6.228 4.34e-09 ***
bill_depth_mm:bill_length_mm    -3.869      6.021  -0.642  0.52152    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 269 on 153 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.917, Adjusted R-squared:  0.9132 
F-statistic: 241.4 on 7 and 153 DF,  p-value: < 2.2e-16

There are TONS of ways we can use R to streamline (or potentially overcomplicate) this process. We will stick with removing just a few observations at once.

Cautions on removing observations

We want to retain as much data as possible
If we remove some extreme observations then other observations may be identified as extreme

Residual Analysis Conclusions

We may need to iteratively recheck model assumptions and influential diagnostics as we transform variables and/or remove observations. It is good practice to make one change at a time There is not necessarily one right way to end model building in the residual analysis.

# Re-perform the residual analysis without observation 314

#Residuals Plots of explanatory variables vs residuals
residualPlots(penmodrem.1,tests=F)

#Residual vs Fitted and QQ plot
plot(penmodrem.1, which=c(1,2))

#Cooks Distance Thresholds
plot(penmodrem.1,which=4)

#Leverage vs Studentized Residuals
influencePlot(penmodrem.1,fill=F)

#Deleted Studentized Residals
ols_plot_resid_stud_fit(penmodrem.1)

The assumptions don’t seem much different and we still have pretty extreme values. It is likely not worth it to alter this model.

STAT 3220 Unit 3