These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy.
Check the appropriateness of response variable for regression: View a histogram of response variable. Primarily, it should be continuous with few outliers. Data should not be recorded over time. It is sometimes useful if the response variable is unimodal and symmetric, but that is not a requirement.
pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
## [1] "X" "species" "island"
## [4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm"
## [7] "body_mass_g" "sex" "year"
hist(pendata$body_mass_g, xlab="Body Mass", main="Histogram of Body Mass (in grams)")
For qualitative variables, we explore the box plots and means for each qualitative variable explanatory variable then classify the relationships as existent or not.
#Summary Statistics for response variable grouped by each level of the response
tapply(pendata$body_mass_g,pendata$species,summary)
$Adelie
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2850 3350 3700 3701 4000 4775 1
$Chinstrap
Min. 1st Qu. Median Mean 3rd Qu. Max.
2700 3488 3700 3733 3950 4800
$Gentoo
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3950 4700 5000 5076 5500 6300 1
tapply(pendata$body_mass_g,pendata$sex,summary)
$female
Min. 1st Qu. Median Mean 3rd Qu. Max.
2700 3350 3650 3862 4550 5200
$male
Min. 1st Qu. Median Mean 3rd Qu. Max.
3250 3900 4300 4546 5312 6300
#Box plots for species and sex
boxplot(body_mass_g~species,pendata, ylab="Body Mass (grams)")
boxplot(body_mass_g~sex,pendata, ylab="Body Mass (grams)")
There appears to be relationships with each of the qualitative explanatory variables because the mean value of y is different for each level.
For the sake of this example, we will hold off on adding the quantitative variables right now.
Therefore, our hypothesized model is:
\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM\)
#note: the syntax in R does not necessarily match the parameters
penmod1<-lm(body_mass_g~species+sex,data=pendata)
summary(penmod1)
Call:
lm(formula = body_mass_g ~ species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-816.87 -217.80 -16.87 227.61 882.20
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3372.39 31.43 107.308 <2e-16 ***
speciesChinstrap 26.92 46.48 0.579 0.563
speciesGentoo 1377.86 39.10 35.236 <2e-16 ***
sexmale 667.56 34.70 19.236 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 316.6 on 329 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8454
F-statistic: 606.1 on 3 and 329 DF, p-value: < 2.2e-16
The prediction equation is:
\(\widehat{bodymass}=3372.39+26.92SpeciesC+1377.86SpeciesG+667.56SexM\)
We can recode our base level using the relevel function.
#This is only done if we desire to change the base level
recodedSpecies<-relevel(factor(pendata$species),ref="Gentoo")
penmod6<-lm(body_mass_g~recodedSpecies,data=pendata)
summary(penmod6)
Call:
lm(formula = body_mass_g ~ recodedSpecies, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1126.02 -333.09 -33.09 316.91 1223.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5076.02 41.68 121.78 <2e-16 ***
recodedSpeciesAdelie -1375.35 56.15 -24.50 <2e-16 ***
recodedSpeciesChinstrap -1342.93 69.86 -19.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 462.3 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677
F-statistic: 343.6 on 2 and 339 DF, p-value: < 2.2e-16
Hypotheses:
Distribution of test statistic: F with 3, 329 DF
Test Statistic: F=606.1
Pvalue: <2.2e-16
Decision: pvalue<0.05 -> REJECT H0
Conclusion: The model with species and sex is adequate at body mass for the penguins.
Species is determined by two parameters (SpeciesC and SpeciesG dummy variables), so we would want to keep or remove BOTH of those dummy variables. We will cover this later.
We can test sex because it is only one parameter in the model.
Hypotheses:
Distribution of test statistic: T with 329 DF
Test Statistic: t=19.236
Pvalue: <2.2e-16
Decision: pvalue<0.05 -> REJECT H0
Conclusion: The sex of the penguin is significant at predicting the body mass of a penguin given species is a constant in the model. We will keep it in the model.
Note we can include both qualitative and quantitative predictors in the model.
penmod2<-lm(body_mass_g~species+sex+bill_depth_mm,data=pendata)
summary(penmod2)
Call:
lm(formula = body_mass_g ~ species + sex + bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-851.2 -187.7 -3.5 221.7 936.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1576.99 357.08 4.416 1.37e-05 ***
speciesChinstrap 19.44 44.87 0.433 0.665
speciesGentoo 1721.70 77.88 22.106 < 2e-16 ***
sexmale 513.99 45.24 11.360 < 2e-16 ***
bill_depth_mm 102.04 20.22 5.046 7.48e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 305.5 on 328 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8578, Adjusted R-squared: 0.8561
F-statistic: 494.8 on 4 and 328 DF, p-value: < 2.2e-16
For each millimeter increase in bill depth, we expect the body weight of a penguin to increase by 102.04 grams, given species and sex are held constant.
Create a scatter plot of the quantitative variable vs the response. Then label the observations by the qualitative variable.
plot(body_mass_g~bill_depth_mm , col=factor(species),data=pendata,xlab="Bill Depth",ylab="Body Mass (grams)")
legend("topright",legend = levels(factor(pendata$species)), pch = 19,
col = factor(levels(factor(pendata$species))))
NOTE: When we group the observations on the scatter plot for Bill Depth by species, the relationships for each level are positive, despite the overall negative trend. Also we see each species would have a similar slope, but different y-intercept.
Let’s suppose we simplify the model to just species and bill depth:
\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 BillDepth\) , where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise
What if there were different slopes for each group? We will explore this next time!
We will explore the various styles of interactions through several combinations of variables.
If an interaction is present, we would note non-parallel (different) slopes for each level of the qualitative variable.
plot(body_mass_g~bill_depth_mm , col=factor(species),data=pendata,xlab="Bill Depth",ylab="Body Mass (grams)")
legend("topright",legend = levels(factor(pendata$species)), pch = 19,
col = factor(levels(factor(pendata$species))))
There does not appear to be strong evidence of an interaction because the bill depth-body mass slopes for each species appear to be the same.
How would these “different slopes” show up in the model?
Write a model for E(y) as a function of bill depth and species that hypothesizes different bill depth-body mass slopes for each species.
\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 BillDepth\)
\(+\beta_4SpeciesC*BillDepth+ \beta_5SpeciesG*BillDepth\)
, where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise
The bill depth-body mass slope for Adelie (SpeciesC=0,SpeciesG=0) will be: \(\beta_3\)
The bill depth-body mass slope for Chinstrap (SpeciesC=1,SpeciesG=0) will be: \(\beta_3+\beta_4\)
The bill depth-body mass slope for Gentoo (SpeciesC=0,SpeciesG=1) will be: \(\beta_3+\beta_5\)
In the last section we saw that by just including the dummy variables, we account for different y-intercepts for each species.
penmod3<-lm(body_mass_g~species+bill_depth_mm+species*bill_depth_mm,data=pendata)
summary(penmod3)
Call:
lm(formula = body_mass_g ~ species + bill_depth_mm + species *
bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-845.89 -254.74 -28.46 228.01 1161.41
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -283.28 437.94 -0.647 0.5182
speciesChinstrap 247.06 829.77 0.298 0.7661
speciesGentoo -175.71 658.43 -0.267 0.7897
bill_depth_mm 217.15 23.82 9.117 <2e-16 ***
speciesChinstrap:bill_depth_mm -12.53 45.01 -0.278 0.7809
speciesGentoo:bill_depth_mm 152.29 40.49 3.761 0.0002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 354.9 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.807, Adjusted R-squared: 0.8041
F-statistic: 281 on 5 and 336 DF, p-value: < 2.2e-16
#Note we can also simplify the syntax
penmod3a<-lm(body_mass_g~species*bill_depth_mm,data=pendata)
summary(penmod3a)
Call:
lm(formula = body_mass_g ~ species * bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-845.89 -254.74 -28.46 228.01 1161.41
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -283.28 437.94 -0.647 0.5182
speciesChinstrap 247.06 829.77 0.298 0.7661
speciesGentoo -175.71 658.43 -0.267 0.7897
bill_depth_mm 217.15 23.82 9.117 <2e-16 ***
speciesChinstrap:bill_depth_mm -12.53 45.01 -0.278 0.7809
speciesGentoo:bill_depth_mm 152.29 40.49 3.761 0.0002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 354.9 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.807, Adjusted R-squared: 0.8041
F-statistic: 281 on 5 and 336 DF, p-value: < 2.2e-16
Write the prediction equation
\(\widehat{body mass}=-283.28+247.06SpeciesC-175.71SpeciesG+217.15 BillDepth\)
\(-12.53SpeciesC*BillDepth+ 152.29SpeciesG*BillDepth\)
Interpret the slope for bill depth
If an interaction is present, we would notice extreme crossing of levels. The relationship of one variable with the response DEPENDS on the other variable.
# We need to verify there are observations for every combination of level
table(pendata$species,pendata$sex)
female male
Adelie 73 73
Chinstrap 34 34
Gentoo 58 61
# We can plot interactions either way
interaction.plot(pendata$species, pendata$sex, pendata$body_mass_g,fun=mean,trace.label="Sex", xlab="Species",ylab="Mean Body Mass")
interaction.plot(pendata$sex, pendata$species, pendata$body_mass_g,fun=mean,trace.label="Species", xlab="Sex",ylab="Mean Body Mass")
There is not strong evidence of an interaction because there is not crossing of the levels.
Write a model for E(y) as a function of sex and species and its interaction.
\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 SexM\)
\(+\beta_4SpeciesC*SexM+ \beta_5SpeciesG*SexM\)
, where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise, SexM= 1 if Male, 0 if female
penmod4<-lm(body_mass_g~species+sex+species*sex,data=pendata)
summary(penmod4)
Call:
lm(formula = body_mass_g ~ species + sex + species * sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-827.21 -213.97 11.03 206.51 861.03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3368.84 36.21 93.030 < 2e-16 ***
speciesChinstrap 158.37 64.24 2.465 0.01420 *
speciesGentoo 1310.91 54.42 24.088 < 2e-16 ***
sexmale 674.66 51.21 13.174 < 2e-16 ***
speciesChinstrap:sexmale -262.89 90.85 -2.894 0.00406 **
speciesGentoo:sexmale 130.44 76.44 1.706 0.08886 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 309.4 on 327 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8546, Adjusted R-squared: 0.8524
F-statistic: 384.3 on 5 and 327 DF, p-value: < 2.2e-16
#Note we can use simplified syntax
penmod4a<-lm(body_mass_g~species*sex,data=pendata)
summary(penmod4a)
Call:
lm(formula = body_mass_g ~ species * sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-827.21 -213.97 11.03 206.51 861.03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3368.84 36.21 93.030 < 2e-16 ***
speciesChinstrap 158.37 64.24 2.465 0.01420 *
speciesGentoo 1310.91 54.42 24.088 < 2e-16 ***
sexmale 674.66 51.21 13.174 < 2e-16 ***
speciesChinstrap:sexmale -262.89 90.85 -2.894 0.00406 **
speciesGentoo:sexmale 130.44 76.44 1.706 0.08886 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 309.4 on 327 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8546, Adjusted R-squared: 0.8524
F-statistic: 384.3 on 5 and 327 DF, p-value: < 2.2e-16
Write the prediction equation
\(\widehat{body mass}=3368.84+158.37SpeciesC+1310.91SpeciesG+674.66 SexM\)
\(-262.89SpeciesC*SexM+ 130.44SpeciesG*SexM\)
Plug in the appropriate 1s and 0s for the dummy variables to interpret the coefficients.
The average body mass for Adelie female penguins is 3368.84 grams.
The estimated mean body mass for Chinstrap penguins is (158.31-262.89)=104.58 grams lower than Adelie penguins, for male penguins.
The estimated mean body mass for Chinstrap penguins is 158.31 higher than Adelie penguins, for female penguins.
The estimated mean body mass for Gentoo penguins is (1310.91+130.44)=1441.35 grams higher than Adelie penguins, for male penguins.
The estimated mean body mass for Gentoo penguins is 1310.91 grams higher than Adelie penguins, for female penguins.
The estimated mean body mass for male penguins is (674.66-262.89)=411.77 grams higher than female penguins, for Chinstrap penguins.
The estimated mean body mass for male penguins is (674.66+130.44)=805.1 grams higher than female penguins, for Gentoo penguins.
The estimated mean body mass for male penguins is 674.66 grams higher than female penguins, for Adelie penguins.
There is no visualization for two quantitative variables having an interaction. We base our justificaiton off of intuition or research.
Write a model for E(y) as a function of bill depth and bill length that allows for an interaction.
\(E(body mass)=\beta_0+\beta_1BillLength+\beta_2 BillDepth\)
\(+\beta_3BillLength*BillDepth\)
penmod5<-lm(body_mass_g~bill_length_mm+bill_depth_mm+bill_length_mm*bill_depth_mm,data=pendata)
summary(penmod5)
Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + bill_length_mm *
bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1811.29 -355.81 4.35 354.80 1606.90
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25583.278 2668.939 -9.586 <2e-16 ***
bill_length_mm 715.006 58.681 12.185 <2e-16 ***
bill_depth_mm 1484.934 149.405 9.939 <2e-16 ***
bill_length_mm:bill_depth_mm -36.079 3.297 -10.944 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 503.5 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6093, Adjusted R-squared: 0.6058
F-statistic: 175.7 on 3 and 338 DF, p-value: < 2.2e-16
#We can used shortened syntax
penmod5a<-lm(body_mass_g~bill_length_mm*bill_depth_mm,data=pendata)
summary(penmod5a)
Call:
lm(formula = body_mass_g ~ bill_length_mm * bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1811.29 -355.81 4.35 354.80 1606.90
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25583.278 2668.939 -9.586 <2e-16 ***
bill_length_mm 715.006 58.681 12.185 <2e-16 ***
bill_depth_mm 1484.934 149.405 9.939 <2e-16 ***
bill_length_mm:bill_depth_mm -36.079 3.297 -10.944 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 503.5 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6093, Adjusted R-squared: 0.6058
F-statistic: 175.7 on 3 and 338 DF, p-value: < 2.2e-16
\(\widehat{body mass}=-25583.278+715.006BillLength+1484.934 BillDepth-36.079 BillLength*BillDepth\)
Suppose a penguin has a bill depth of 17mm. What is the slope of bill length?
Check the appropriateness of response variable for regression: View a histogram of response variable. Primarily, it should be continuous with few outliers. Data should not be recorded over time. It is sometimes useful if the response variable is unimodal and symmetric, but that is not a requirement.
pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
## [1] "X" "species" "island"
## [4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm"
## [7] "body_mass_g" "sex" "year"
hist(pendata$body_mass_g, xlab="Body Mass", main="Histogram of Body Mass (in grams)")
We will explore the relationship with quantitative variables with scatter plots and correlations and classify each relationship as linear, curvilinear, or none. We explore the box plots and means for each qualitative variable explanatory variable then classify the relationships as existent or not. Additionally, we can explore interactions.
#Scatter plots for quantitative variables
for (i in names(pendata)[4:6]) {
plot(pendata[,i], pendata$body_mass_g,xlab=i,ylab="Body Mass (grams)")
}
#Correlations for quantitative variables
round(cor(pendata[4:6],pendata$body_mass_g,use="complete.obs"),3)
[,1]
bill_length_mm 0.595
bill_depth_mm -0.472
flipper_length_mm 0.871
#Summary Statistics for response variable grouped by each level of the response
tapply(pendata$body_mass_g,pendata$species,summary)
$Adelie
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2850 3350 3700 3701 4000 4775 1
$Chinstrap
Min. 1st Qu. Median Mean 3rd Qu. Max.
2700 3488 3700 3733 3950 4800
$Gentoo
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3950 4700 5000 5076 5500 6300 1
tapply(pendata$body_mass_g,pendata$sex,summary)
$female
Min. 1st Qu. Median Mean 3rd Qu. Max.
2700 3350 3650 3862 4550 5200
$male
Min. 1st Qu. Median Mean 3rd Qu. Max.
3250 3900 4300 4546 5312 6300
#Box plots for Qualitative species and sex
boxplot(body_mass_g~species,pendata, ylab="Body Mass (grams)")
boxplot(body_mass_g~sex,pendata, ylab="Body Mass (grams)")
# Interactions for species X Sex We need to verify there are observations for every combination of level
table(pendata$species,pendata$sex)
female male
Adelie 73 73
Chinstrap 34 34
Gentoo 58 61
# plot interaction for Species X Sex
interaction.plot(pendata$species, pendata$sex, pendata$body_mass_g,fun=mean,trace.label="Sex", xlab="Species",ylab="Mean Body Mass")
# Interaction between Bill Depth X Species
plot(body_mass_g~bill_depth_mm , col=factor(species),data=pendata,xlab="Bill Depth",ylab="Body Mass (grams)")
legend("topright",legend = levels(factor(pendata$species)), pch = 19,
col = factor(levels(factor(pendata$species))))
# Interaction bill depth/sex
plot(body_mass_g~bill_depth_mm , col=factor(sex),data=pendata,xlab="Bill Depth",ylab="Body Mass (grams)")
legend("topright",
legend = levels(factor(pendata$sex)),
pch = 19,
col = factor(levels(factor(pendata$sex))))
# Interaction bill length/species
plot(body_mass_g~bill_length_mm , col=factor(species),data=pendata,xlab="Bill Length",ylab="Body Mass (grams)")
legend("topright",
legend = levels(factor(pendata$species)),
pch = 19,
col = factor(levels(factor(pendata$species))))
# Interaction bill length/sex
plot(body_mass_g~bill_length_mm , col=factor(sex),data=pendata,xlab="Bill Length",ylab="Body Mass (grams)")
legend("topright",
legend = levels(factor(pendata$sex)),
pch = 19,
col = factor(levels(factor(pendata$sex))))
# Interaction flipper length/species
plot(body_mass_g~flipper_length_mm , col=factor(species),data=pendata,xlab="Flipper Length",ylab="Body Mass (grams)")
legend("topright",
legend = levels(factor(pendata$species)),
pch = 19,
col = factor(levels(factor(pendata$species))))
# Interaction flipper/sex
plot(body_mass_g~flipper_length_mm , col=factor(sex),data=pendata,xlab="Flipper Length",ylab="Body Mass (grams)")
legend("topright",
legend = levels(factor(pendata$sex)),
pch = 19,
col = factor(levels(factor(pendata$sex))))
There appears to be relationships with each of the qualitative explanatory variables (Species and sex) because the mean value of y is different for each level.
bill length and flipper length have positive linear relationships with the response.
bill depth has a negative relationship. however it appears there might be groups in within with positive relationship.
The only interaction that appears to be relevant in EDA is Bill Depth X Species
First we will add the quantitative variables of interest and the QUANT X QUANT interactions that we believe may be important. Here, we believe Bill Depth X Bill Length may be a useful interaction.
\(E(bodymass)=\beta_0+\beta_1BillDep+\beta_2BillLen+\beta_3FlipperLen+\beta_7BillDep*BillLen\)
penmod1f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm,data=pendata)
summary(penmod1f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1161.66 -252.74 -39.99 239.65 1139.91
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -20370.313 1960.714 -10.389 < 2e-16 ***
bill_depth_mm 858.001 114.277 7.508 5.42e-13 ***
bill_length_mm 352.102 47.415 7.426 9.25e-13 ***
flipper_length_mm 43.350 2.486 17.441 < 2e-16 ***
bill_depth_mm:bill_length_mm -19.071 2.585 -7.379 1.26e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 365.6 on 337 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.7946, Adjusted R-squared: 0.7922
F-statistic: 326 on 4 and 337 DF, p-value: < 2.2e-16
Start with Global F test
Hypotheses:
Distribution of test statistic: F with 4, 337 DF
Test Statistic: F=326
Pvalue: <0.0001
Decision: 0.0001<0.05 -> REJECT H0
Conclusion: The model with bill depth, bill length, flipper length, and the interaction of bill length and bill depth is adequate at predicting the body mass of penguins.
Test the “most important predictors”: Is the interaction significant?: We look at a t-test here because this interaction accounts for just one parameter.
We will not do any further testing as we do not have another variables of interest.
We will add the qualitative variables to the model. We did not believe the interaction between species and sex looked important, so we will not add it.
penmod2f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(penmod2f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-706.90 -175.12 -6.14 176.80 880.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6290.446 1832.633 -3.432 0.000675 ***
bill_depth_mm 333.924 98.231 3.399 0.000760 ***
bill_length_mm 130.274 41.058 3.173 0.001653 **
flipper_length_mm 16.535 2.888 5.725 2.35e-08 ***
speciesChinstrap -200.763 82.322 -2.439 0.015273 *
speciesGentoo 923.788 132.379 6.978 1.68e-11 ***
sexmale 391.603 47.370 8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm -6.344 2.290 -2.770 0.005919 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 284.4 on 325 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8778, Adjusted R-squared: 0.8752
F-statistic: 333.7 on 7 and 325 DF, p-value: < 2.2e-16
NOTE: The model that is fit to match this syntax is:
It is ordered this way despite the order in the lm() function because R builds the model with first order terms listed first
\(E(bodymass)=\beta_0+\beta_1BillDep+\beta_2BillLen+\beta_3FlipperLen+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDep*BillLen\)
Test the “most important predictors”: Is the species significant?: We look at a nested f test here because species accounts for two parameters.
#fit the reduced model
redpenmod2f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+sex,data=pendata)
summary(redpenmod2f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1027.84 -216.45 -26.22 200.94 903.55
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14086.152 1833.599 -7.682 1.83e-13 ***
bill_depth_mm 610.319 103.472 5.898 9.16e-09 ***
bill_length_mm 283.535 42.278 6.707 8.77e-11 ***
flipper_length_mm 34.146 2.396 14.253 < 2e-16 ***
sexmale 492.047 49.009 10.040 < 2e-16 ***
bill_depth_mm:bill_length_mm -15.633 2.300 -6.798 5.03e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 319.5 on 327 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8449, Adjusted R-squared: 0.8426
F-statistic: 356.3 on 5 and 327 DF, p-value: < 2.2e-16
#syntax for nested test is anova(REDUCED,COMPLETE)
anova(redpenmod2f,penmod2f)
We will not do any further testing as we did not have another variable of interest.
We will add the interaction between bill depth and species because we thought that looked most important.
penmod3f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex+species*bill_depth_mm,data=pendata)
summary(penmod3f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex + species *
bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-763.71 -170.22 -3.61 167.06 887.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4562.344 2259.560 -2.019 0.0443 *
bill_depth_mm 241.956 126.493 1.913 0.0567 .
bill_length_mm 90.930 55.312 1.644 0.1012
flipper_length_mm 15.946 2.924 5.453 9.88e-08 ***
speciesChinstrap 742.233 911.667 0.814 0.4162
speciesGentoo 400.335 584.241 0.685 0.4937
sexmale 386.571 47.305 8.172 6.96e-15 ***
bill_depth_mm:bill_length_mm -4.100 3.160 -1.298 0.1953
bill_depth_mm:speciesChinstrap -52.218 51.172 -1.020 0.3083
bill_depth_mm:speciesGentoo 38.098 36.315 1.049 0.2949
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 283.6 on 323 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8793, Adjusted R-squared: 0.876
F-statistic: 261.6 on 9 and 323 DF, p-value: < 2.2e-16
NOTE: The model that is fit to match this syntax is:
\(E(bodymass)=\beta_0+\beta_1BillDep+\beta_2BillLen+\beta_3FlipperLen+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDep*BillLen+\beta_8SpeciesC*BillDep+\beta_9SpeciesG*BillDep\)
Test the “most important predictors”: Is the species X bill depth interaction significant?: We look at a nested f test here because this interaction accounts for two parameters.
# updated reduced model
redpenmod3f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(redpenmod3f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-706.90 -175.12 -6.14 176.80 880.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6290.446 1832.633 -3.432 0.000675 ***
bill_depth_mm 333.924 98.231 3.399 0.000760 ***
bill_length_mm 130.274 41.058 3.173 0.001653 **
flipper_length_mm 16.535 2.888 5.725 2.35e-08 ***
speciesChinstrap -200.763 82.322 -2.439 0.015273 *
speciesGentoo 923.788 132.379 6.978 1.68e-11 ***
sexmale 391.603 47.370 8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm -6.344 2.290 -2.770 0.005919 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 284.4 on 325 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8778, Adjusted R-squared: 0.8752
F-statistic: 333.7 on 7 and 325 DF, p-value: < 2.2e-16
anova(redpenmod3f,penmod3f)
We will not do any further testing as we did not have another variable of interest.
Our final model is:
\(E(bodymass)=\beta_0+\beta_1BillDep+\beta_2BillLen+\beta_3FlipperLen+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDep*BillLen\)
penmod2f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(penmod2f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-706.90 -175.12 -6.14 176.80 880.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6290.446 1832.633 -3.432 0.000675 ***
bill_depth_mm 333.924 98.231 3.399 0.000760 ***
bill_length_mm 130.274 41.058 3.173 0.001653 **
flipper_length_mm 16.535 2.888 5.725 2.35e-08 ***
speciesChinstrap -200.763 82.322 -2.439 0.015273 *
speciesGentoo 923.788 132.379 6.978 1.68e-11 ***
sexmale 391.603 47.370 8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm -6.344 2.290 -2.770 0.005919 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 284.4 on 325 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8778, Adjusted R-squared: 0.8752
F-statistic: 333.7 on 7 and 325 DF, p-value: < 2.2e-16
confint(penmod2f)
2.5 % 97.5 %
(Intercept) -9895.76579 -2685.12579
bill_depth_mm 140.67493 527.17252
bill_length_mm 49.49987 211.04754
flipper_length_mm 10.85320 22.21621
speciesChinstrap -362.71468 -38.81151
speciesGentoo 663.36072 1184.21605
sexmale 298.41317 484.79374
bill_depth_mm:bill_length_mm -10.84883 -1.83918
Hypotheses:
Distribution of test statistic: F with 7,325 DF
Test Statistic: F=333.7
Pvalue: <2.2e-16
Decision: pvalue<0.05 -> REJECT H0
Conclusion: The model is adequate at body mass for the penguins.
We will cover this in Unit 3
# Or we can create a data frame with the new values.
newpen<-data.frame(bill_depth_mm=17,bill_length_mm=40,flipper_length_mm=205,species="Gentoo",sex="female")
newpen
predict(penmod2f,newpen, interval="prediction")
fit lwr upr
1 4596.685 4009.935 5183.436
We are 95% confident that a penguin with a bill depth of 17, bill length of 40, flipper length of 205, Gentoo species, and female to have a body mass between 4009.935 and 5183 grams.