Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program set out to study the impact of Ecological Sexual Dimorphism in arctic penguins. In 2014, that particular study compared deep-dive foraging habits of male and female penguins and how that resulted in different biological outcomes. To read more abut it, you can go to read the full study here. And you can find more of Dr. Gorman’s more recent work here
We only have a portion of the data from the larger scale study, so our goals and scope of analysis will be different. Our aim is to determine which factors impact the body mass of a penguin. This is useful to know as body mass is an important indicator of penguin’s health.
Our data table contains the following columns:
pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
[1] "Observation" "body_mass_g" "species"
[4] "island" "bill_length_mm" "bill_depth_mm"
[7] "flipper_length_mm" "sex" "year"
Hypothesized Linear Model After going through our model building process:
\(E(bodymass)=\beta_0+\beta_1BillDep+\beta_2BillLen+\beta_3FlipperLen+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDep*BillLen\)
penmodfinal<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
#There are a few options to view plots for assumptions
#Residuals Plots of explanatory variables vs residuals
residualPlots(penmodfinal,tests=F)
#Residual vs Fitted and QQ plot
plot(penmodfinal, which=c(1,2))
#histogram of residuals
hist(residuals(penmodfinal))
Lack of Fit - Plots to evaluate: Residual Plots - There were no clear trends that could be modeled in the residual plots. The mean 0 (lack of fit) assumption was not violated.
Constant Variance - Plots to evaluate: Residual Plots and Residual vs Fitted - There were no clear fanning patterns in the residual plots nor the residual vs predicted plots. For qualitative variables we look for even spread of residuals across the levels. The constant variance (homoscedasticity) assumption was not violated.
Normality - Plots to evaluate: QQ Plot and Histogram of Residuals - The observations fall linearly on the QQ plot. (The residuals are unimodal and symmetric in the histogram). The normality assumption is not violated.
Independence - Plots to evaluate: No specific plots. We could perform the Durbin Watson test, but we will not do that in the scope of this course. - We do not believe the information of one penguin depends on another penguin. The independence assumption is likely not violated.
#Within lm() we can add both transformations
# You Can Try On Your Own
penmodT<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
#summary(penmodT)
## Re-evaluate Assumptions
#Residuals Plots of explanatory variables vs residuals
residualPlots(penmodT,tests=F)
#Residual vs Fitted and QQ plot
plot(penmodT, which=c(1,2))
#histogram of residuals
hist(residuals(penmodT))
Interpretations on Variable Transformations (Unit 3.3):
Label the observations that exceed the threshold for each statistic:
n will be how many observations were used to fit the model, not necessarily how many observations were in the original data table. To determine this number we can go to the model F degrees of freedom k, n-k-1. In the pen model we have 7, 325. So k=7 and n=333.
#Determine n
summary(penmodfinal)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-706.90 -175.12 -6.14 176.80 880.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6290.446 1832.633 -3.432 0.000675 ***
bill_depth_mm 333.924 98.231 3.399 0.000760 ***
bill_length_mm 130.274 41.058 3.173 0.001653 **
flipper_length_mm 16.535 2.888 5.725 2.35e-08 ***
speciesChinstrap -200.763 82.322 -2.439 0.015273 *
speciesGentoo 923.788 132.379 6.978 1.68e-11 ***
sexmale 391.603 47.370 8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm -6.344 2.290 -2.770 0.005919 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 284.4 on 325 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8778, Adjusted R-squared: 0.8752
F-statistic: 333.7 on 7 and 325 DF, p-value: < 2.2e-16
#numDF + (k+1) = n-k-1+(k+1) = n
n<-df.residual(penmodfinal)+length(coef(penmodfinal))
n
[1] 333
#Leverage Cut off
levcut<-(2*(length(coef(penmodfinal))))/n
levcut
[1] 0.04804805
#Cook's Distance Cutoff
cdcut<-4/n
cdcut
[1] 0.01201201
Now we can locate these values on the influence plots to identify outliers and influential observations.
#Cooks Distance Thresholds
plot(penmodfinal,which=4)
#Leverage vs Studentized Residuals
influencePlot(penmodfinal,fill=F)
#Deleted Studentized Residals
ols_plot_resid_stud_fit(penmodfinal)
Based on the labeled observations from the plots:
With respect to Leverage, observations 314, 15, 295 (and others) are considered “leverage”=outliers in x because their values exceed the threshold of 2(7+1)/333=0.048.
With respect to Studentizied Residuals, observations 170, 40, 315, 314 (and others) are considered outliers in y because their values exceed the threshold of +/-2 .
With respect to Cook’s Distance, observations 293, 314, 315 are considered influential because their values exceed the threshold of 4/n=4/333=0.012.
With respect to Deleted Studentizied Residuals, observations that are labelled red (164, 35, 304, etc) are considered influential because their values exceed the threshold of +/-2.
Note: The “influence output” only labels some of the observations that exceed our thresholds.
We can use the which() function to print values that
meet certain criteria from a logical expression.
#General expressions about the variables in the data table
#Quantitative Variables
which(pendata$bill_depth_mm<17.3)
# Qualitative Variables
which(pendata$species=="Gentoo")
#Multiple arguments
which(pendata$species=="Gentoo"&pendata$bill_depth_mm<17.3)
which(pendata$species=="Gentoo"|pendata$bill_depth_mm<17.3)
We have several functions that will compute the values of our influence diagnostics for every observation. The input argument for each of these functions is the model that was fit.
hatvalues()stdres()cooks.distance()rstudent()# Using our influential diagnostics
## Leverage
which(hatvalues(penmodfinal)>16/333)
15 20 115 130 143 186 190 240 288 294 308 314 340
10 15 109 124 137 179 183 232 277 283 297 303 329
## Studentized Residual
which(abs(stdres(penmodfinal))>2)
8 40 46 82 86 94 105 110 120 122 132 166 170 196 232 285 293 314 315 325
7 35 41 76 80 88 99 104 114 116 126 160 164 189 224 274 282 303 304 314
## Cooks Distance
which(cooks.distance(penmodfinal)>4/333)
40 82 112 118 122 166 170 193 232 285 293 294 297 298 314 315 325
35 76 106 112 116 160 164 186 224 274 282 283 286 287 303 304 314
## Deleted Studentized Residual
which(abs(rstudent(penmodfinal))>2)
8 40 46 82 86 94 105 110 120 122 132 166 170 196 232 285 293 314 315 325
7 35 41 76 80 88 99 104 114 116 126 160 164 189 224 274 282 303 304 314
#Multiple arguments
which(hatvalues(penmodfinal)>16/333&cooks.distance(penmodfinal)>4/333&abs(rstudent(penmodfinal))>2&abs(stdres(penmodfinal))>2)
314
303
The top row is the original row position of the observation from the data frame, the bottom row is the observation’s position index on the list of influence diagnostics. These do not match in this example because there are NA values in the data frame.
Observations 15, 20, 115, 130, 143, 186, 190, 240, 288, 294, 308, 314, 340 exceed the threshold for leverage.
Observations 40, 82, 112, 118, 122, 166, 170, 193, 232, 285, 293, 294, 297, 298, 314, 315, 325 exceed the threshold for Cook’s Distance.
Observations 8, 40, 46, 82, 86, 94, 105, 110, 120, 122, 132, 166, 170, 196, 232, 285, 293, 314, 315, 325 exceed the threshold for Deleted Studentized Residuals.
Observations 8, 40, 46, 82, 86, 94, 105, 110, 120, 122, 132, 166, 170, 196, 232, 285, 293, 314, 315, 325 exceed the threshold for Studentized Residuals.
Observation 314 exceeds in all 4 categories.
Next steps would be to evaluate closer with dffits or dfbetas and see how those observations compare to the other observations in the data. Then look at the EDA. Determine if the observations are very influential and/or truly representative of the data. We should NOT just remove without further exploration.
#First we will store the DFFits to make them easier to work with
pendffits<-dffits(penmodfinal)
# Calling the observation numbers (row names)
# don't get too worried on the syntax
pendffits[c("15","40","170","294","314","315")]
15 40 170 294 314 315
0.1286874 0.3818284 0.3833745 0.3424947 0.5436063 -0.4620501
# Now we can compare these values to the rest
summary(dffits(penmodfinal))
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.462050 -0.093274 -0.003084 0.001633 0.101550 0.543606
# Slightly different syntax for DFBetas
# Because DFBetas is a matrix. not just a vector
pendfbeta<-dfbetas(penmodfinal)
pendfbeta[c("15","40","170","294","314","315"),]
(Intercept) bill_depth_mm bill_length_mm flipper_length_mm speciesChinstrap
15 -0.08557838 0.09310769 0.06910227 0.02660027 0.0655398436
40 0.02653293 0.05209415 0.03999337 -0.25112153 -0.0043988499
170 -0.12168270 0.11147375 0.13543939 0.04868884 0.0246383890
294 -0.01852371 0.02920232 0.08105004 -0.17108150 -0.1584788882
314 0.26622725 -0.32579551 -0.36372431 0.14860592 -0.0004064124
315 0.21521173 -0.21042438 -0.25248533 -0.04511119 -0.1638646983
speciesGentoo sexmale bill_depth_mm:bill_length_mm
15 0.02492217 0.009146321 -0.08271890
40 0.12938768 0.170995071 -0.04103063
170 -0.05802485 0.169876311 -0.13743246
294 -0.04443237 -0.119146477 -0.03129413
314 0.07753382 -0.100034429 0.36711794
315 0.18047644 -0.001915482 0.25222563
summary(dfbetas(penmodfinal))
(Intercept) bill_depth_mm bill_length_mm
Min. :-2.241e-01 Min. :-0.3257955 Min. :-3.637e-01
1st Qu.:-1.834e-02 1st Qu.:-0.0121734 1st Qu.:-9.957e-03
Median :-3.307e-04 Median : 0.0009279 Median : 2.128e-04
Mean : 7.712e-05 Mean :-0.0000870 Mean :-7.541e-05
3rd Qu.: 1.552e-02 3rd Qu.: 0.0194839 3rd Qu.: 1.921e-02
Max. : 2.662e-01 Max. : 0.2017456 Max. : 2.407e-01
flipper_length_mm speciesChinstrap speciesGentoo
Min. :-2.511e-01 Min. :-2.327e-01 Min. :-3.006e-01
1st Qu.:-1.948e-02 1st Qu.:-1.623e-02 1st Qu.:-1.962e-02
Median : 1.204e-04 Median :-7.051e-05 Median : 6.127e-04
Mean :-3.359e-05 Mean :-9.018e-05 Mean : 8.558e-06
3rd Qu.: 2.059e-02 3rd Qu.: 1.830e-02 3rd Qu.: 1.995e-02
Max. : 2.652e-01 Max. : 2.515e-01 Max. : 2.159e-01
sexmale bill_depth_mm:bill_length_mm
Min. :-2.001e-01 Min. :-2.387e-01
1st Qu.:-2.291e-02 1st Qu.:-1.882e-02
Median :-1.235e-03 Median :-4.818e-04
Mean :-9.972e-05 Mean : 9.593e-05
3rd Qu.: 1.990e-02 3rd Qu.: 1.117e-02
Max. : 2.072e-01 Max. : 3.671e-01
# Look at those values in the table
pendata[c("15","40","170","294","314","315"),]
summary(pendata)
Observation body_mass_g species island
Min. : 1.00 Min. :2700 Length:344 Length:344
1st Qu.: 86.75 1st Qu.:3550 Class :character Class :character
Median :172.50 Median :4050 Mode :character Mode :character
Mean :172.50 Mean :4202
3rd Qu.:258.25 3rd Qu.:4750
Max. :344.00 Max. :6300
NA's :2
bill_length_mm bill_depth_mm flipper_length_mm sex
Min. :32.10 Min. :13.10 Min. :172.0 Length:344
1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0 Class :character
Median :44.45 Median :17.30 Median :197.0 Mode :character
Mean :43.92 Mean :17.15 Mean :200.9
3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0
Max. :59.60 Max. :21.50 Max. :231.0
NA's :2 NA's :2 NA's :2
year
Min. :2007
1st Qu.:2007
Median :2008
Mean :2008
3rd Qu.:2009
Max. :2009
Now we can compare the DFFits and DFBetas to the rest. Additionally, we can compare the x values to the rest of the data. Upon investigating, observation 314 appears to be very influential and has high values of DFFits. We may consider removing this.
We can use the which function inside our square bracket notation to create new data frames based on subsetting requirements. We can store these as new data objects, or use the subsetting directly into another function’s arguments.
# General formatting for subsetting with the which() function
## Quantitative Variables
quantsubsetpen<-pendata[which(pendata$bill_depth_mm<17.3),]
#Example
plot((body_mass_g)~flipper_length_mm,data=quantsubsetpen)
plot((body_mass_g)~flipper_length_mm,data=pendata[which(pendata$bill_depth_mm<17.3),])
# Qualitative Variables
qualsubsetpen<-pendata[which(pendata$species=="Gentoo"),]
# Multiple Arguments
multaubsetpen1<-pendata[which(pendata$species=="Gentoo"&pendata$bill_depth_mm<17.3),]
multaubsetpen2<-pendata[which(pendata$species=="Gentoo"|pendata$bill_depth_mm<17.3),]
If you want to remove just one or more observations, or a set of
observations, we can use a - (“minus”) or the which()
function within a subset.
# You can do this directly into the data argument in lm
# remove one single value, based on the observation's original row number
penmodrem.1<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[-314,])
summary(penmodrem.1)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata[-314,
])
Residuals:
Min 1Q Median 3Q Max
-704.2 -170.2 0.9 175.3 875.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6775.864 1838.196 -3.686 0.000267 ***
bill_depth_mm 365.764 98.923 3.697 0.000256 ***
bill_length_mm 145.132 41.470 3.500 0.000531 ***
flipper_length_mm 16.108 2.881 5.592 4.78e-08 ***
speciesChinstrap -200.730 81.904 -2.451 0.014782 *
speciesGentoo 913.577 131.798 6.932 2.26e-11 ***
sexmale 396.318 47.184 8.399 1.42e-15 ***
bill_depth_mm:bill_length_mm -7.180 2.313 -3.104 0.002079 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 283 on 324 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8793, Adjusted R-squared: 0.8767
F-statistic: 337.1 on 7 and 324 DF, p-value: < 2.2e-16
# note DF is different
#remove more than one, based on the observations' original row numbers
penmodrem.2<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[-c(15,40,170,294,314,315),])
summary(penmodrem.2)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata[-c(15,
40, 170, 294, 314, 315), ])
Residuals:
Min 1Q Median 3Q Max
-681.17 -171.80 1.54 171.99 688.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6820.422 1833.965 -3.719 0.000236 ***
bill_depth_mm 359.304 99.122 3.625 0.000336 ***
bill_length_mm 142.566 41.217 3.459 0.000616 ***
flipper_length_mm 17.212 2.859 6.020 4.79e-09 ***
speciesChinstrap -181.755 81.946 -2.218 0.027260 *
speciesGentoo 883.023 128.661 6.863 3.52e-11 ***
sexmale 385.230 46.275 8.325 2.51e-15 ***
bill_depth_mm:bill_length_mm -7.107 2.309 -3.079 0.002260 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 274.2 on 319 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8845, Adjusted R-squared: 0.882
F-statistic: 349 on 7 and 319 DF, p-value: < 2.2e-16
#remove based on logical expression
penmodrem.3<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata[which(pendata$bill_depth_mm<17.3),])
summary(penmodrem.3)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata[which(pendata$bill_depth_mm <
17.3), ])
Residuals:
Min 1Q Median 3Q Max
-683.38 -162.91 -14.99 155.11 860.97
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4913.958 4618.410 -1.064 0.28901
bill_depth_mm 254.936 281.470 0.906 0.36650
bill_length_mm 82.344 94.554 0.871 0.38519
flipper_length_mm 17.357 4.703 3.691 0.00031 ***
speciesChinstrap -117.223 135.779 -0.863 0.38930
speciesGentoo 1036.777 213.654 4.853 2.98e-06 ***
sexmale 447.063 71.778 6.228 4.34e-09 ***
bill_depth_mm:bill_length_mm -3.869 6.021 -0.642 0.52152
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 269 on 153 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.917, Adjusted R-squared: 0.9132
F-statistic: 241.4 on 7 and 153 DF, p-value: < 2.2e-16
There are TONS of ways we can use R to streamline (or potentially overcomplicate) this process. We will stick with removing just a few observations at once.
Cautions on removing observations
We may need to iteratively recheck model assumptions and influential diagnostics as we transform variables and/or remove observations. It is good practice to make one change at a time There is not necessarily one right way to end model building in the residual analysis.
# Re-perform the residual analysis without observation 314
#Residuals Plots of explanatory variables vs residuals
residualPlots(penmodrem.1,tests=F)
#Residual vs Fitted and QQ plot
plot(penmodrem.1, which=c(1,2))
#Cooks Distance Thresholds
plot(penmodrem.1,which=4)
#Leverage vs Studentized Residuals
influencePlot(penmodrem.1,fill=F)
#Deleted Studentized Residals
ols_plot_resid_stud_fit(penmodrem.1)
The assumptions don’t seem much different and we still have pretty extreme values. It is likely not worth it to alter this model.