Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program set out to study the impact of Ecological Sexual Dimorphism in arctic penguins. In 2014, that particular study compared deep-dive foraging habits of male and female penguins and how that resulted in different biological outcomes. To read more abut it, you can go to read the full study here. And you can find more of Dr. Gorman’s more recent work here
We only have a portion of the data from the larger scale study, so our goals and scope of analysis will be different. Our aim is to determine which factors impact the body mass of a penguin. This is useful to know as body mass is an important indicator of penguin’s health.
Our data table contains the following columns:
pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
[1] "Observation" "body_mass_g" "species"
[4] "island" "bill_length_mm" "bill_depth_mm"
[7] "flipper_length_mm" "sex" "year"
hist(pendata$body_mass_g, xlab="Body Mass", main="Histogram of Body Mass (in grams)")
ADD YOUR NOTES ON THE SHAPE + DISTRIBUTION OF RESPONSE:
RECALL: Our goal through the exploratory data analysis phase is to identify the individual relationships between the explanatory variables and the response.
For quantitative variables, we explore the scatter plots and correlations for each explanatory variable then classify the relationships as linear, curvilinear, other, or none.
For qualitative variables, we explore the box plots and means value of y for each level of the qualitative explanatory variables. Then classify the relationships as existent or not.
In this example we will evaluate the qualitative variables species (with three levels) and sex (with two levels)
#Summary Statistics for response variable grouped by each level of the response
tapply(pendata$body_mass_g,pendata$species,summary)
$Adelie
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2850 3350 3700 3701 4000 4775 1
$Chinstrap
Min. 1st Qu. Median Mean 3rd Qu. Max.
2700 3488 3700 3733 3950 4800
$Gentoo
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3950 4700 5000 5076 5500 6300 1
tapply(pendata$body_mass_g,pendata$sex,summary)
$female
Min. 1st Qu. Median Mean 3rd Qu. Max.
2700 3350 3650 3862 4550 5200
$male
Min. 1st Qu. Median Mean 3rd Qu. Max.
3250 3900 4300 4546 5312 6300
#Box plots for species and sex
boxplot(body_mass_g~species,pendata, ylab="Body Mass (grams)")
boxplot(body_mass_g~sex,pendata, ylab="Body Mass (grams)")
There appears to be a relationship with species and the response (body mass) because the mean value of y is quite different for each level. We see Adelie and Chinstrap have very similar means (3701 grams and 3733 grams respectively), but Gentoo is much higher at 5076 grams.
There appears to be a relationship with sex and the response (body mass) because the mean value of y is quite different for each level. We see males with an average body mass of 4546 grams and females much lower at 3862 grams.
For the sake of this example, we will hold off on adding the quantitative variables right now.
Therefore, our hypothesized model is:
\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM\)
What does each \(\beta\) coefficient represent?
#note: the syntax in R does not necessarily match the parameters
penmod1<-lm(body_mass_g~species+sex,data=pendata)
summary(penmod1)
Call:
lm(formula = body_mass_g ~ species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-816.87 -217.80 -16.87 227.61 882.20
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3372.39 31.43 107.308 <2e-16 ***
speciesChinstrap 26.92 46.48 0.579 0.563
speciesGentoo 1377.86 39.10 35.236 <2e-16 ***
sexmale 667.56 34.70 19.236 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 316.6 on 329 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8454
F-statistic: 606.1 on 3 and 329 DF, p-value: < 2.2e-16
The prediction equation is:
\(\widehat{bodymass}=3372.39+26.92SpeciesC+1377.86SpeciesG+667.56SexM\)
We can recode our base level using the relevel function. This is optional, but could be useful if we have a particular interest group.
In this example, Gentoo appears to be the most different from the other species, so it might be useful to define it as the base level. Then each of our estimates will compare to Gentoo.
#This is only done if we desire to change the base level
recodedSpecies<-relevel(factor(pendata$species),ref="Gentoo")
penmod6<-lm(body_mass_g~recodedSpecies,data=pendata)
summary(penmod6)
Call:
lm(formula = body_mass_g ~ recodedSpecies, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1126.02 -333.09 -33.09 316.91 1223.98
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5076.02 41.68 121.78 <2e-16 ***
recodedSpeciesAdelie -1375.35 56.15 -24.50 <2e-16 ***
recodedSpeciesChinstrap -1342.93 69.86 -19.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 462.3 on 339 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6697, Adjusted R-squared: 0.6677
F-statistic: 343.6 on 2 and 339 DF, p-value: < 2.2e-16
Recall the model:
\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM\)
Hypotheses:
Distribution of test statistic: F with 3, 329 DF
Test Statistic: F=606.1
Pvalue: <0.0000001
Decision: pvalue<0.05 -> REJECT H0
Conclusion: The model with species and sex is adequate at body mass for the penguins.
Species is determined by two parameters (since SpeciesC and SpeciesG are the dummy variables), so we would want to keep or remove BOTH of those dummy variables. We will cover this later.
We can test sex because it is only associated with one parameter in the model.
Hypotheses:
Distribution of test statistic: T with 329 DF
Test Statistic: t=19.236
Pvalue: <0.0000001
Decision: pvalue<0.05 -> REJECT H0
Conclusion: The sex of the penguin is significant at predicting the body mass of a penguin given species is a constant in the model. We will keep it in the model.
Note we can include both qualitative and quantitative predictors in the model.
penmod2<-lm(body_mass_g~species+sex+bill_depth_mm,data=pendata)
summary(penmod2)
Call:
lm(formula = body_mass_g ~ species + sex + bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-851.2 -187.7 -3.5 221.7 936.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1576.99 357.08 4.416 1.37e-05 ***
speciesChinstrap 19.44 44.87 0.433 0.665
speciesGentoo 1721.70 77.88 22.106 < 2e-16 ***
sexmale 513.99 45.24 11.360 < 2e-16 ***
bill_depth_mm 102.04 20.22 5.046 7.48e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 305.5 on 328 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8578, Adjusted R-squared: 0.8561
F-statistic: 494.8 on 4 and 328 DF, p-value: < 2.2e-16
Write the model that was fit
\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM+\beta_4BillDepth\)
Create a scatter plot of the quantitative variable vs the response. Then label the observations by the qualitative variable.
#Bill Depth / Species plot with no regression lines
plot(x=pendata$bill_depth_mm, y=pendata$body_mass_g,pch=as.numeric(factor(pendata$species)),xlab="Bill Depth",ylab="Body Mass (grams)")
legend("topright",legend = levels(factor(pendata$species)),
pch = 1:nlevels(factor(pendata$species)))
NOTE: When we group the observations on the scatter plot for Bill Depth by species, the relationships for each level are positive, despite the overall negative trend. Also we see each species would have a similar slope, but different y-intercepts.
Let’s suppose we simplify the model to just species and bill depth:
\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 BillDepth\) , where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise
What if there were different slopes for each group? We will explore this next time!
We will explore the various styles of interactions through several combinations of variables.
If an interaction is present, we would note non-parallel (different) slopes for each level of the qualitative variable.
# Bill Depth / Species Interaction Scatter Plot with regression lines
plot(x=pendata$bill_depth_mm, y=pendata$body_mass_g,pch=as.numeric(factor(pendata$species)),xlab="Bill Depth",ylab="Body Mass (grams)")
legend("topright",legend = levels(factor(pendata$species)),
pch = 1:nlevels(factor(pendata$species)))
for (g in levels(as.factor(pendata$species))) {
group_subset <- subset(pendata, species == g)
model <- lm(body_mass_g ~ bill_depth_mm, data = group_subset)
abline(model,lwd = 1)
}
There does not appear to be strong evidence of an interaction because the bill depth-body mass slopes for each species appear to be the same.
How would these “different slopes” show up in the model?
Write a model for E(y) as a function of bill depth and species that hypothesizes different bill depth-body mass slopes for each species.
\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 BillDepth\)
\(+\beta_4SpeciesC*BillDepth+ \beta_5SpeciesG*BillDepth\)
, where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise
The bill depth-body mass slope for Adelie (SpeciesC=0,SpeciesG=0) will be: \(\beta_3\)
The bill depth-body mass slope for Chinstrap (SpeciesC=1,SpeciesG=0) will be: \(\beta_3+\beta_4\)
The bill depth-body mass slope for Gentoo (SpeciesC=0,SpeciesG=1) will be: \(\beta_3+\beta_5\)
In the last section we saw that by just including the dummy variables, we account for different y-intercepts for each species. By adding an interaction, we allow for different slopes.
penmod3<-lm(body_mass_g~species+bill_depth_mm+species*bill_depth_mm,data=pendata)
summary(penmod3)
Call:
lm(formula = body_mass_g ~ species + bill_depth_mm + species *
bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-845.89 -254.74 -28.46 228.01 1161.41
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -283.28 437.94 -0.647 0.5182
speciesChinstrap 247.06 829.77 0.298 0.7661
speciesGentoo -175.71 658.43 -0.267 0.7897
bill_depth_mm 217.15 23.82 9.117 <2e-16 ***
speciesChinstrap:bill_depth_mm -12.53 45.01 -0.278 0.7809
speciesGentoo:bill_depth_mm 152.29 40.49 3.761 0.0002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 354.9 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.807, Adjusted R-squared: 0.8041
F-statistic: 281 on 5 and 336 DF, p-value: < 2.2e-16
#Note we can also simplify the syntax
penmod3a<-lm(body_mass_g~species*bill_depth_mm,data=pendata)
summary(penmod3a)
Call:
lm(formula = body_mass_g ~ species * bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-845.89 -254.74 -28.46 228.01 1161.41
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -283.28 437.94 -0.647 0.5182
speciesChinstrap 247.06 829.77 0.298 0.7661
speciesGentoo -175.71 658.43 -0.267 0.7897
bill_depth_mm 217.15 23.82 9.117 <2e-16 ***
speciesChinstrap:bill_depth_mm -12.53 45.01 -0.278 0.7809
speciesGentoo:bill_depth_mm 152.29 40.49 3.761 0.0002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 354.9 on 336 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.807, Adjusted R-squared: 0.8041
F-statistic: 281 on 5 and 336 DF, p-value: < 2.2e-16
Write the prediction equation
\(\widehat{body mass}=-283.28+247.06SpeciesC-175.71SpeciesG+217.15 BillDepth\)
\(-12.53SpeciesC*BillDepth+ 152.29SpeciesG*BillDepth\)
Interpret the slope estimate for bill depth
If an interaction is present, we would notice extreme crossing of levels. The relationship of one variable with the response DEPENDS on the other variable.
# We need to verify there are non-zero observations for every combination of level
table(pendata$species,pendata$sex)
female male
Adelie 73 73
Chinstrap 34 34
Gentoo 58 61
# We can plot interactions either way
interaction.plot(pendata$species, pendata$sex, pendata$body_mass_g,fun=mean,trace.label="Sex", xlab="Species",ylab="Mean Body Mass")
interaction.plot(pendata$sex, pendata$species, pendata$body_mass_g,fun=mean,trace.label="Species", xlab="Sex",ylab="Mean Body Mass")
There is not strong evidence of an interaction because there is not drastic switching of the levels.
Write a model for E(y) as a function of sex and species and its interaction.
\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 SexM\)
\(+\beta_4SpeciesC*SexM+ \beta_5SpeciesG*SexM\)
Coefficient for SpeciesC = \(\beta_1 +\beta_4*SexM\)
Coefficient for SpeciesG = \(\beta_2 +\beta_5*SexM\)
Coefficient for SexM = \(\beta_3 +\beta_4*SpeciesC+\beta_5*SpeciesG\)
There are many ways to interpret these betas. For instance, the average value of males’ body mass depends on the species. Or the average body mass of a particular species depends on the sex of the penguin. Remember qualitative variables do not have “slope” coefficients. Instead we interpret as the (difference in the) average value of y, given a combination of categorical levels.
penmod4<-lm(body_mass_g~species+sex+species*sex,data=pendata)
summary(penmod4)
Call:
lm(formula = body_mass_g ~ species + sex + species * sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-827.21 -213.97 11.03 206.51 861.03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3368.84 36.21 93.030 < 2e-16 ***
speciesChinstrap 158.37 64.24 2.465 0.01420 *
speciesGentoo 1310.91 54.42 24.088 < 2e-16 ***
sexmale 674.66 51.21 13.174 < 2e-16 ***
speciesChinstrap:sexmale -262.89 90.85 -2.894 0.00406 **
speciesGentoo:sexmale 130.44 76.44 1.706 0.08886 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 309.4 on 327 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8546, Adjusted R-squared: 0.8524
F-statistic: 384.3 on 5 and 327 DF, p-value: < 2.2e-16
#Note we can use simplified syntax
penmod4a<-lm(body_mass_g~species*sex,data=pendata)
summary(penmod4a)
Call:
lm(formula = body_mass_g ~ species * sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-827.21 -213.97 11.03 206.51 861.03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3368.84 36.21 93.030 < 2e-16 ***
speciesChinstrap 158.37 64.24 2.465 0.01420 *
speciesGentoo 1310.91 54.42 24.088 < 2e-16 ***
sexmale 674.66 51.21 13.174 < 2e-16 ***
speciesChinstrap:sexmale -262.89 90.85 -2.894 0.00406 **
speciesGentoo:sexmale 130.44 76.44 1.706 0.08886 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 309.4 on 327 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8546, Adjusted R-squared: 0.8524
F-statistic: 384.3 on 5 and 327 DF, p-value: < 2.2e-16
Write the prediction equation
\(\widehat{body mass}=3368.84+158.37SpeciesC+1310.91SpeciesG+674.66 SexM\)
\(-262.89SpeciesC*SexM+ 130.44SpeciesG*SexM\)
Plug in the appropriate 1s and 0s for the dummy variables to interpret the coefficients.
Estimate of coefficient for SpeciesC = \(158.37 -262.89*SexM\)
The estimated mean body mass for Chinstrap penguins is (158.37-262.89)=104.58 grams lower than Adelie penguins, for male penguins.
The estimated mean body mass for Chinstrap penguins is 158.37 higher than Adelie penguins, for female penguins.
Estimate of coefficient for SpeciesG = \(1310.91 +130.44*SexM\)
The estimated mean body mass for Gentoo penguins is (1310.91+130.44)=1441.35 grams higher than Adelie penguins, for male penguins.
The estimated mean body mass for Gentoo penguins is 1310.91 grams higher than Adelie penguins, for female penguins.
Estimate of coefficient for SexM = \(674.66 -262.89*SpeciesC+130.44*SpeciesG\)
The estimated mean body mass for male penguins is (674.66-262.89)=411.77 grams higher than female penguins, for Chinstrap penguins.
The estimated mean body mass for male penguins is (674.66+130.44)=805.1 grams higher than female penguins, for Gentoo penguins.
The estimated mean body mass for male penguins is 674.66 grams higher than female penguins, for Adelie penguins.
There is no visualization for two quantitative variables having an interaction. We base our justification off of intuition or research.
Write a model for E(y) as a function of bill depth and bill length that allows for an interaction.
\(E(body mass)=\beta_0+\beta_1BillLength+\beta_2 BillDepth\)
\(+\beta_3BillLength*BillDepth\)
penmod5<-lm(body_mass_g~bill_length_mm+bill_depth_mm+bill_length_mm*bill_depth_mm,data=pendata)
summary(penmod5)
Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + bill_length_mm *
bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1811.29 -355.81 4.35 354.80 1606.90
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25583.278 2668.939 -9.586 <2e-16 ***
bill_length_mm 715.006 58.681 12.185 <2e-16 ***
bill_depth_mm 1484.934 149.405 9.939 <2e-16 ***
bill_length_mm:bill_depth_mm -36.079 3.297 -10.944 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 503.5 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6093, Adjusted R-squared: 0.6058
F-statistic: 175.7 on 3 and 338 DF, p-value: < 2.2e-16
#We can used shortened syntax
penmod5a<-lm(body_mass_g~bill_length_mm*bill_depth_mm,data=pendata)
summary(penmod5a)
Call:
lm(formula = body_mass_g ~ bill_length_mm * bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1811.29 -355.81 4.35 354.80 1606.90
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -25583.278 2668.939 -9.586 <2e-16 ***
bill_length_mm 715.006 58.681 12.185 <2e-16 ***
bill_depth_mm 1484.934 149.405 9.939 <2e-16 ***
bill_length_mm:bill_depth_mm -36.079 3.297 -10.944 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 503.5 on 338 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6093, Adjusted R-squared: 0.6058
F-statistic: 175.7 on 3 and 338 DF, p-value: < 2.2e-16
The prediction equation is:
\(\widehat{body mass}=-25583.278+715.006BillLength+1484.934 BillDepth-36.079 BillLength*BillDepth\)
Suppose a penguin has a bill depth of 17mm. What is the slope of bill length?
Suppose a penguin has a bill length of 40mm. What is the slope of bill depth?
pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
[1] "Observation" "body_mass_g" "species"
[4] "island" "bill_length_mm" "bill_depth_mm"
[7] "flipper_length_mm" "sex" "year"
hist(pendata$body_mass_g, xlab="Body Mass", main="Histogram of Body Mass (in grams)")
RECALL: Our goal through the exploratory data analysis phase is to identify the individual relationships between the explanatory variables and the response.
For quantitative variables, we explore the scatter plots and correlations for each explanatory variable then classify the relationships as linear, curvilinear, other, or none.
For qualitative variables, we explore the box plots and means value of y for each level of the qualitative explanatory variables. Then classify the relationships as existent or not.
[,1]
bill_length_mm 0.595
bill_depth_mm -0.472
flipper_length_mm 0.871
Bill length (0.595) and flipper length (0.871) have positive linear relationships with the response.
Bill depth has a negative relationship (r=-0.472). However it appears there might be groups in within with positive relationship.
$Adelie
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
2850 3350 3700 3701 4000 4775 1
$Chinstrap
Min. 1st Qu. Median Mean 3rd Qu. Max.
2700 3488 3700 3733 3950 4800
$Gentoo
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
3950 4700 5000 5076 5500 6300 1
$female
Min. 1st Qu. Median Mean 3rd Qu. Max.
2700 3350 3650 3862 4550 5200
$male
Min. 1st Qu. Median Mean 3rd Qu. Max.
3250 3900 4300 4546 5312 6300
There appears to be a relationship with species and the response (body mass) because the mean value of y is quite different for each level. We see Adelie and Chinstrap have very similar means (3701 grams and 3733 grams respectively), but Gentoo is much higher at 5076 grams.
There appears to be a relationship with sex and the response (body mass) because the mean value of y is quite different for each level. We see males with an average body mass of 4546 grams and females much lower at 3862 grams.
female male
Adelie 73 73
Chinstrap 34 34
Gentoo 58 61
There is not strong evidence of an interaction because there is not drastic switching of the levels.
First we will add the quantitative variables of interest and the QUANT X QUANT interactions that we believe may be important. We will add all of the quantitative variables, as they all had moderate correlations. Here, we believe Bill Depth X Bill Length may be a useful interaction, but there is not EDA.
Hypothesized Linear Model:
\(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_4BillDepth*BillLenth\)
penmod1f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm,data=pendata)
summary(penmod1f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1161.66 -252.74 -39.99 239.65 1139.91
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -20370.313 1960.714 -10.389 < 2e-16 ***
bill_depth_mm 858.001 114.277 7.508 5.42e-13 ***
bill_length_mm 352.102 47.415 7.426 9.25e-13 ***
flipper_length_mm 43.350 2.486 17.441 < 2e-16 ***
bill_depth_mm:bill_length_mm -19.071 2.585 -7.379 1.26e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 365.6 on 337 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.7946, Adjusted R-squared: 0.7922
F-statistic: 326 on 4 and 337 DF, p-value: < 2.2e-16
Start with Global F test
Hypotheses:
Distribution of test statistic: F with 4, 337 DF
Test Statistic: F=326
Pvalue: <0.0001
Decision: 0.0001<0.05 -> REJECT H0
Conclusion: The model with bill depth, bill length, flipper length, and the interaction of bill length and bill depth is adequate at predicting the body mass of penguins.
Test the “most important predictors”: Complex variables and research variables:
Is the interaction significant?: We look at a t-test here because this interaction accounts for just one parameter.
We will not do any further testing as we have not specified other variables of interest.
We will add the qualitative variables to the model from above. We did not believe the interaction between species and sex looked important, so we will not add it.
penmod2f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(penmod2f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-706.90 -175.12 -6.14 176.80 880.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6290.446 1832.633 -3.432 0.000675 ***
bill_depth_mm 333.924 98.231 3.399 0.000760 ***
bill_length_mm 130.274 41.058 3.173 0.001653 **
flipper_length_mm 16.535 2.888 5.725 2.35e-08 ***
speciesChinstrap -200.763 82.322 -2.439 0.015273 *
speciesGentoo 923.788 132.379 6.978 1.68e-11 ***
sexmale 391.603 47.370 8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm -6.344 2.290 -2.770 0.005919 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 284.4 on 325 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8778, Adjusted R-squared: 0.8752
F-statistic: 333.7 on 7 and 325 DF, p-value: < 2.2e-16
NOTE: The model that is fit to match this syntax is:
It is ordered this way despite the order in the lm() function because R builds the model with first order terms listed first
Hypothesized Linear Model:
\(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength\)
Test the “most important predictors”:
Is the species significant?: We look at a nested f test here because species accounts for two parameters.
#fit the reduced model
redpenmod2f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+sex,data=pendata)
summary(redpenmod2f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-1027.84 -216.45 -26.22 200.94 903.55
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14086.152 1833.599 -7.682 1.83e-13 ***
bill_depth_mm 610.319 103.472 5.898 9.16e-09 ***
bill_length_mm 283.535 42.278 6.707 8.77e-11 ***
flipper_length_mm 34.146 2.396 14.253 < 2e-16 ***
sexmale 492.047 49.009 10.040 < 2e-16 ***
bill_depth_mm:bill_length_mm -15.633 2.300 -6.798 5.03e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 319.5 on 327 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8449, Adjusted R-squared: 0.8426
F-statistic: 356.3 on 5 and 327 DF, p-value: < 2.2e-16
#syntax for nested test is anova(REDUCED,COMPLETE)
anova(redpenmod2f,penmod2f)
Complete Model: \(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength\)
We will not do any further testing as we did not have another variable of interest.
We will add the interaction between bill depth and species because we thought that looked important.
penmod3f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex+species*bill_depth_mm,data=pendata)
summary(penmod3f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex + species *
bill_depth_mm, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-763.71 -170.22 -3.61 167.06 887.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4562.344 2259.560 -2.019 0.0443 *
bill_depth_mm 241.956 126.493 1.913 0.0567 .
bill_length_mm 90.930 55.312 1.644 0.1012
flipper_length_mm 15.946 2.924 5.453 9.88e-08 ***
speciesChinstrap 742.233 911.667 0.814 0.4162
speciesGentoo 400.335 584.241 0.685 0.4937
sexmale 386.571 47.305 8.172 6.96e-15 ***
bill_depth_mm:bill_length_mm -4.100 3.160 -1.298 0.1953
bill_depth_mm:speciesChinstrap -52.218 51.172 -1.020 0.3083
bill_depth_mm:speciesGentoo 38.098 36.315 1.049 0.2949
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 283.6 on 323 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8793, Adjusted R-squared: 0.876
F-statistic: 261.6 on 9 and 323 DF, p-value: < 2.2e-16
NOTE: The model that is fit to match this syntax is:
\(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength+\beta_8SpeciesC*BillDepth+\beta_9SpeciesG*BillDepth\)
Test the “most important predictors”: Is the species X bill depth interaction significant?: We look at a nested f test here because this interaction accounts for two parameters.
# updated reduced model
redpenmod3f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(redpenmod3f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-706.90 -175.12 -6.14 176.80 880.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6290.446 1832.633 -3.432 0.000675 ***
bill_depth_mm 333.924 98.231 3.399 0.000760 ***
bill_length_mm 130.274 41.058 3.173 0.001653 **
flipper_length_mm 16.535 2.888 5.725 2.35e-08 ***
speciesChinstrap -200.763 82.322 -2.439 0.015273 *
speciesGentoo 923.788 132.379 6.978 1.68e-11 ***
sexmale 391.603 47.370 8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm -6.344 2.290 -2.770 0.005919 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 284.4 on 325 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8778, Adjusted R-squared: 0.8752
F-statistic: 333.7 on 7 and 325 DF, p-value: < 2.2e-16
anova(redpenmod3f,penmod3f)
Complete Model: \(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength+\beta_8SpeciesC*BillDepth+\beta_9SpeciesG*BillDepth\)
We will not do any further testing as we did not have another variable of interest.
Our final model is:
\(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLenth+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength\)
# Refitting the final model
penmod2f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
round(coef(penmod2f),2)
(Intercept) bill_depth_mm
-6290.45 333.92
bill_length_mm flipper_length_mm
130.27 16.53
speciesChinstrap speciesGentoo
-200.76 923.79
sexmale bill_depth_mm:bill_length_mm
391.60 -6.34
round(confint(penmod2f,level=.95),2)
2.5 % 97.5 %
(Intercept) -9895.77 -2685.13
bill_depth_mm 140.67 527.17
bill_length_mm 49.50 211.05
flipper_length_mm 10.85 22.22
speciesChinstrap -362.71 -38.81
speciesGentoo 663.36 1184.22
sexmale 298.41 484.79
bill_depth_mm:bill_length_mm -10.85 -1.84
The prediction Equation:
\(\widehat{bodymass}=-6290.45+333.92BillDepth+130.27 BillLength+16.53 FlipperLength-200.76SpeciesC+923.79SpeciesG+391.60SexM-6.34BillDepth*BillLength\)
where Species C= 1 if Chinstrap, 0 otherwise
SpeciesG = 1 if Gentoo, 0 otherwise
SexM = 1 if male, 0 otherwise
Quantitative Variable with No Interaction: Flipper Length
Qualitative Variable with No Interaction: SexM
Qualitative Variable with No Interaction:
Quant X Quant Interactions: Bill Length X Bill Depth
Confidence intervals when interactions are present: We cannot interpret interval estimates for main effects or interactions when there is an interaction present in the model.
\(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLenth+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength\)
# Regression output
summary(penmod2f)
Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm +
bill_depth_mm * bill_length_mm + species + sex, data = pendata)
Residuals:
Min 1Q Median 3Q Max
-706.90 -175.12 -6.14 176.80 880.08
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6290.446 1832.633 -3.432 0.000675 ***
bill_depth_mm 333.924 98.231 3.399 0.000760 ***
bill_length_mm 130.274 41.058 3.173 0.001653 **
flipper_length_mm 16.535 2.888 5.725 2.35e-08 ***
speciesChinstrap -200.763 82.322 -2.439 0.015273 *
speciesGentoo 923.788 132.379 6.978 1.68e-11 ***
sexmale 391.603 47.370 8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm -6.344 2.290 -2.770 0.005919 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 284.4 on 325 degrees of freedom
(11 observations deleted due to missingness)
Multiple R-squared: 0.8778, Adjusted R-squared: 0.8752
F-statistic: 333.7 on 7 and 325 DF, p-value: < 2.2e-16
We will cover this in Unit 3
What is the estimated body mass for a penguin with a bill depth of 17, bill length of 40, flipper length of 205, Gentoo species, and female?
# First create a data frame with the new values.
newpen<-data.frame(bill_depth_mm=17,bill_length_mm=40,flipper_length_mm=205,species="Gentoo",sex="female")
predict(penmod2f,newpen, level=.95, interval="prediction")
fit lwr upr
1 4596.685 4009.935 5183.436
We are 95% confident that a penguin with a bill depth of 17, bill length of 40, flipper length of 205, Gentoo species, and is female will have a body mass between 4009.935 and 5183 grams. Its estimated body mass is 4596.685 grams.
What is the estimated average body mass for penguins that all have a bill depth of 17, bill length of 40, flipper length of 205, Gentoo species, and female?
predict(penmod2f,newpen, level=.95, interval="confidence")
fit lwr upr
1 4596.685 4420.195 4773.176
We are 95% confident that penguins with a bill depth of 17, bill length of 40, flipper length of 205, Gentoo species, and are female will have an average body mass between 4420.195 and 4773.176 grams. Their estimated average body mass is 4596.685.
What is the estimated body mass for a penguin with a bill depth of 21, bill length of 43, flipper length of 200, Adelie species, and male?
You should get fit= 4293.63, lwr = 3726.241 and upr= 4861.019
What is the estimated average body mass for penguins that all have a bill depth of 21, bill length of 43, flipper length of 200, Adelie species, and male?
You should get fit= 4293.63, lwr = 4199.804 and upr= 4387.456
During periods of high electricity demand, especially during the hot summer months, the power output from a gas turbine engine can drop dramatically. One way to counter this drop in power is by cooling the inlet air to the gas turbine. An increasingly popular cooling method uses high pressure inlet fogging. The performance of a sample of 67 gas turbines augmented with high pressure inlet fogging was investigated in the Journal of Engineering for Gas Turbines and Power (January 2005). One measure of performance is heat rate (kilojoules per kilowatt per hour). Heat rates for the 67 gas turbines, saved in the gasturbine file. Read some of the engineering applications from this journal here:
gasturbine<-read.delim("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/GASTURBINE.txt")
head(gasturbine)
names(gasturbine)
[1] "ENGINE" "SHAFTS" "RPM" "CPRATIO" "INLET.TEMP"
[6] "EXH.TEMP" "AIRFLOW" "POWER" "HEATRATE"
hist(gasturbine$HEATRATE, xlab="Heat Rate", main="Histogram of Heat Rate")
The distribution of the response variable, heat rate, is unimodal and skewed right. It is continuous, so it should still be suitable for regression.
We will explore the relationship with quantitative variables with scatter plots and correlations and classify each relationship as linear, curvilinear, other, or none. We explore the box plots and means for each qualitative variable explanatory variable then classify the relationships as existent or not. We will not explore interactions in this example.
#Scatter plots for quantitative variables
for (i in names(gasturbine)[3:8]) {
plot((gasturbine[,i]), gasturbine$HEATRATE,xlab=i,ylab="Heat Rate")
}
#Attempt Transformations
plot(log(gasturbine$RPM), gasturbine$HEATRATE,ylab="Heat Rate")
plot(log(gasturbine$CPRATIO), gasturbine$HEATRATE,ylab="Heat Rate")
plot(log(gasturbine$POWER), gasturbine$HEATRATE,ylab="Heat Rate")
plot(log(gasturbine$AIRFLOW), gasturbine$HEATRATE,ylab="Heat Rate")
#Correlations for quantitative variables
round(cor(gasturbine[3:8],gasturbine$HEATRATE,use="complete.obs"),3)
[,1]
RPM 0.844
CPRATIO -0.735
INLET.TEMP -0.801
EXH.TEMP -0.314
AIRFLOW -0.703
POWER -0.697
round(cor(log(gasturbine[3:8]),gasturbine$HEATRATE,use="complete.obs"),3)
[,1]
RPM 0.810
CPRATIO -0.818
INLET.TEMP -0.813
EXH.TEMP -0.303
AIRFLOW -0.840
POWER -0.871
#Summary Statistics for response variable grouped by each level of the response
tapply(gasturbine$HEATRATE,gasturbine$ENGINE,summary)
$Advanced
Min. 1st Qu. Median Mean 3rd Qu. Max.
9105 9295 9669 9764 9933 11588
$Aeroderiv
Min. 1st Qu. Median Mean 3rd Qu. Max.
8714 10708 12414 12312 13697 16243
$Traditional
Min. 1st Qu. Median Mean 3rd Qu. Max.
10086 10598 11183 11544 11956 14796
tapply(gasturbine$HEATRATE,gasturbine$SHAFTS,summary)
$`1`
Min. 1st Qu. Median Mean 3rd Qu. Max.
9105 9918 10592 10930 11674 14796
$`2`
Min. 1st Qu. Median Mean 3rd Qu. Max.
10951 11223 11654 12536 13232 16243
$`3`
Min. 1st Qu. Median Mean 3rd Qu. Max.
8714 8903 9092 9092 9280 9469
#Box plots for Qualitative
boxplot(HEATRATE~ENGINE,gasturbine, ylab="Heat Rate")
boxplot(HEATRATE~SHAFTS,gasturbine, ylab="Heat Rate")
# Summary counts for qualitative variables
table(gasturbine$ENGINE,gasturbine$SHAFTS)
1 2 3
Advanced 21 0 0
Aeroderiv 1 4 2
Traditional 35 4 0
Do any of the explanatory variables have relationships with each other? We will look at pairwise correlations and VIF to evaluate multicollinearity in the quantitative explanatory variables.
Below are several methods for evaluating PAIRWISE relationships of the explanatory variables. You may use whichever is easiest to interpret for yourself.
# First Store the transformations in the data table
gasturbine$log.CPRATIO<-log(gasturbine$CPRATIO)
gasturbine$log.AIRFLOW<-log(gasturbine$AIRFLOW)
gasturbine$log.POWER<-log(gasturbine$POWER)
Method 1: Correlation Matrix
Prints the correlations of each variable with each other variable.
# Regular correlation
# Use the correlation function with no second argument
# to find the correlations with the explanatory variables with each other
gasturcor<-round(cor(gasturbine[,c(3,5,6,10:12)]),4)
gasturcor
RPM INLET.TEMP EXH.TEMP log.CPRATIO log.AIRFLOW log.POWER
RPM 1.0000 -0.5536 -0.1715 -0.5819 -0.9177 -0.9083
INLET.TEMP -0.5536 1.0000 0.7283 0.7440 0.6649 0.7221
EXH.TEMP -0.1715 0.7283 1.0000 0.1712 0.4068 0.4515
log.CPRATIO -0.5819 0.7440 0.1712 1.0000 0.5308 0.5757
log.AIRFLOW -0.9177 0.6649 0.4068 0.5308 1.0000 0.9964
log.POWER -0.9083 0.7221 0.4515 0.5757 0.9964 1.0000
Method 2: Correlation Heat Map (best in color)
# Correlation Heat Map
# The argument is the stored correlation values from the cor() function
corrplot(gasturcor)
Method 3: Scatter Plot Matrix (Best viewed on a computer or with a few number of variables and/or observations
# Scatter plot matrix
plot(gasturbine[c(3,5,6,10:12)])
Method 4: Correlations and P-vales for \(H_0: \rho=0\)
Note the top matrix in this out put are the rounded correlations, the bottom matrix are the p-values that correspond with the correlation significance test for each pair.
# A new correlation function
rcorr(as.matrix(gasturbine[,c(3,5,6,10:12)]))
RPM INLET.TEMP EXH.TEMP log.CPRATIO log.AIRFLOW log.POWER
RPM 1.00 -0.55 -0.17 -0.58 -0.92 -0.91
INLET.TEMP -0.55 1.00 0.73 0.74 0.66 0.72
EXH.TEMP -0.17 0.73 1.00 0.17 0.41 0.45
log.CPRATIO -0.58 0.74 0.17 1.00 0.53 0.58
log.AIRFLOW -0.92 0.66 0.41 0.53 1.00 1.00
log.POWER -0.91 0.72 0.45 0.58 1.00 1.00
n= 67
P
RPM INLET.TEMP EXH.TEMP log.CPRATIO log.AIRFLOW log.POWER
RPM 0.0000 0.1653 0.0000 0.0000 0.0000
INLET.TEMP 0.0000 0.0000 0.0000 0.0000 0.0000
EXH.TEMP 0.1653 0.0000 0.1660 0.0006 0.0001
log.CPRATIO 0.0000 0.0000 0.1660 0.0000 0.0000
log.AIRFLOW 0.0000 0.0000 0.0006 0.0000 0.0000
log.POWER 0.0000 0.0000 0.0001 0.0000 0.0000
There is concern of strong pairwise relationships with several pairs of variables.
Next we evaluate the severity of MULTICOLLINEARITY with VIF by fitting a model with only all first order quantitative variables.
# Multicollinearity VIF
# First fit the model with all of the quantitative variables
gasmod1<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP+log.CPRATIO+log.AIRFLOW+log.POWER,data=gasturbine)
#store and round the VIF values
gasmod1vif<-round(vif(gasmod1),3)
gasmod1vif
RPM INLET.TEMP EXH.TEMP log.CPRATIO log.AIRFLOW log.POWER
10.300 24.579 7.203 7.086 1085.142 1225.107
#find the average VIF
mean(gasmod1vif)
[1] 393.2362
Yes, there is evidence of severe multicollinearity because all VIFs are MUCH greater than 10 and the average VIF is MUCH greater than 3 at 393.2362.
Because we have quite a few variables and severe multicollinearity, we need to address that. It is not clear from EDA what variables should remain and which variables should be removed.
We will use iterative variable selection procedures to narrow down our quantitative variables to a best set of predictors. We will use the entry and remain significance levels of 0.15
REMEMBER: Variable Screening’s main purpose is to narrow down quantitative predictors. It can be a useful tool when multicollinearity is present, but that is not its only function.
# Note: Details=T gives all of the steps within each process
# In most cases, you can set Details=F
#Default: prem = 0.3
ols_step_backward_p(gasmod1,p_rem=0.15,details=T)
Backward Elimination Method
---------------------------
Candidate Terms:
1. RPM
2. INLET.TEMP
3. EXH.TEMP
4. log.CPRATIO
5. log.AIRFLOW
6. log.POWER
Step => 0
Model => HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP + log.CPRATIO + log.AIRFLOW + log.POWER
R2 => 0.984
Initiating stepwise selection...
Step => 1
Removed => log.CPRATIO
Model => HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP + log.AIRFLOW + log.POWER
R2 => 0.98358
No more variables to be removed.
Variables Removed:
=> log.CPRATIO
Stepwise Summary
--------------------------------------------------------------------------
Step Variable AIC SBC SBIC R2 Adj. R2
--------------------------------------------------------------------------
0 Full Model 917.568 935.205 729.036 0.98369 0.98206
1 log.CPRATIO 916.000 931.433 727.152 0.98358 0.98224
--------------------------------------------------------------------------
Final Model Output
------------------
Model Summary
-------------------------------------------------------------------
R 0.992 RMSE 202.837
R-Squared 0.984 MSE 41142.862
Adj. R-Squared 0.982 Coef. Var 1.921
Pred R-Squared 0.975 AIC 916.000
MAE 152.893 SBC 931.433
-------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
---------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------------
Regression 165140636.668 5 33028127.334 730.877 0.0000
Residual 2756571.779 61 45189.701
Total 167897208.448 66
---------------------------------------------------------------------------
Parameter Estimates
--------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
--------------------------------------------------------------------------------------------------
(Intercept) 43134.882 1973.376 21.858 0.000 39188.873 47080.890
RPM 0.066 0.012 0.291 5.568 0.000 0.042 0.090
INLET.TEMP 1.337 0.813 0.115 1.644 0.105 -0.289 2.962
EXH.TEMP 12.538 1.098 0.347 11.422 0.000 10.343 14.733
log.AIRFLOW 7529.364 539.316 7.262 13.961 0.000 6450.935 8607.794
log.POWER -7347.619 503.882 -8.082 -14.582 0.000 -8355.194 -6340.044
--------------------------------------------------------------------------------------------------
#default: penter = 0.3
ols_step_forward_p(gasmod1,p_enter=0.15,details=T)
Forward Selection Method
------------------------
Candidate Terms:
1. RPM
2. INLET.TEMP
3. EXH.TEMP
4. log.CPRATIO
5. log.AIRFLOW
6. log.POWER
Step => 0
Model => HEATRATE ~ 1
R2 => 0
Initiating stepwise selection...
Selection Metrics Table
------------------------------------------------------------------
Predictor Pr(>|t|) R-Squared Adj. R-Squared AIC
------------------------------------------------------------------
log.POWER 0.00000 0.758 0.755 1088.159
RPM 0.00000 0.712 0.708 1099.849
log.AIRFLOW 0.00000 0.705 0.701 1101.457
log.CPRATIO 0.00000 0.670 0.665 1109.066
INLET.TEMP 0.00000 0.641 0.635 1114.713
EXH.TEMP 0.00959 0.099 0.085 1176.358
------------------------------------------------------------------
Step => 1
Selected => log.POWER
Model => HEATRATE ~ log.POWER
R2 => 0.758
Selection Metrics Table
------------------------------------------------------------------
Predictor Pr(>|t|) R-Squared Adj. R-Squared AIC
------------------------------------------------------------------
log.CPRATIO 0.00000 0.909 0.906 1024.868
log.AIRFLOW 0.00000 0.865 0.861 1051.171
INLET.TEMP 2e-05 0.820 0.814 1070.432
RPM 0.03660 0.774 0.767 1085.549
EXH.TEMP 0.14832 0.766 0.759 1087.954
------------------------------------------------------------------
Step => 2
Selected => log.CPRATIO
Model => HEATRATE ~ log.POWER + log.CPRATIO
R2 => 0.909
Selection Metrics Table
------------------------------------------------------------------
Predictor Pr(>|t|) R-Squared Adj. R-Squared AIC
------------------------------------------------------------------
log.AIRFLOW 0.00236 0.921 0.918 1016.961
RPM 0.10842 0.913 0.908 1024.106
EXH.TEMP 0.27423 0.911 0.906 1025.587
INLET.TEMP 0.45619 0.910 0.905 1026.273
------------------------------------------------------------------
Step => 3
Selected => log.AIRFLOW
Model => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW
R2 => 0.921
Selection Metrics Table
----------------------------------------------------------------
Predictor Pr(>|t|) R-Squared Adj. R-Squared AIC
----------------------------------------------------------------
EXH.TEMP 0.00000 0.975 0.974 941.630
RPM 0.00000 0.947 0.943 993.145
INLET.TEMP 0.00000 0.944 0.941 995.852
----------------------------------------------------------------
Step => 4
Selected => EXH.TEMP
Model => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW + EXH.TEMP
R2 => 0.975
Selection Metrics Table
----------------------------------------------------------------
Predictor Pr(>|t|) R-Squared Adj. R-Squared AIC
----------------------------------------------------------------
RPM 0.00000 0.983 0.982 916.887
INLET.TEMP 0.76662 0.975 0.973 943.533
----------------------------------------------------------------
Step => 5
Selected => RPM
Model => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW + EXH.TEMP + RPM
R2 => 0.983
Selection Metrics Table
----------------------------------------------------------------
Predictor Pr(>|t|) R-Squared Adj. R-Squared AIC
----------------------------------------------------------------
INLET.TEMP 0.27902 0.984 0.982 917.568
----------------------------------------------------------------
Step => 6
Selected => INLET.TEMP
Model => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW + EXH.TEMP + RPM + INLET.TEMP
R2 => 0.984
Variables Selected:
=> log.POWER
=> log.CPRATIO
=> log.AIRFLOW
=> EXH.TEMP
=> RPM
=> INLET.TEMP
Stepwise Summary
----------------------------------------------------------------------------
Step Variable AIC SBC SBIC R2 Adj. R2
----------------------------------------------------------------------------
0 Base Model 1181.327 1185.737 987.298 0.00000 0.00000
1 log.POWER 1088.159 1094.773 892.613 0.75839 0.75467
2 log.CPRATIO 1024.868 1033.687 828.649 0.90882 0.90597
3 log.AIRFLOW 1016.961 1027.984 819.495 0.92135 0.91761
4 EXH.TEMP 941.630 954.858 748.697 0.97520 0.97360
5 RPM 916.887 932.320 727.870 0.98336 0.98200
6 INLET.TEMP 917.568 935.205 729.036 0.98369 0.98206
----------------------------------------------------------------------------
Final Model Output
------------------
Model Summary
-------------------------------------------------------------------
R 0.992 RMSE 202.184
R-Squared 0.984 MSE 40878.488
Adj. R-Squared 0.982 Coef. Var 1.931
Pred R-Squared 0.975 AIC 917.568
MAE 154.185 SBC 935.205
-------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
---------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------------
Regression 165158349.755 6 27526391.626 603.019 0.0000
Residual 2738858.692 60 45647.645
Total 167897208.448 66
---------------------------------------------------------------------------
Parameter Estimates
--------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
--------------------------------------------------------------------------------------------------
(Intercept) 43116.228 1983.576 21.737 0.000 39148.486 47083.970
log.POWER -7433.155 524.714 -8.176 -14.166 0.000 -8482.740 -6383.570
log.CPRATIO 157.821 253.354 0.027 0.623 0.536 -348.962 664.604
log.AIRFLOW 7624.558 563.172 7.354 13.539 0.000 6498.047 8751.070
EXH.TEMP 13.259 1.599 0.367 8.291 0.000 10.060 16.457
RPM 0.067 0.012 0.295 5.575 0.000 0.043 0.091
INLET.TEMP 1.036 0.949 0.089 1.092 0.279 -0.861 2.934
--------------------------------------------------------------------------------------------------
#Default: pent = 0.1, prem = 0.3
ols_step_both_p(gasmod1,p_ent=0.15,p_rem=0.15,details=T)
Stepwise Selection Method
-------------------------
Candidate Terms:
1. RPM
2. INLET.TEMP
3. EXH.TEMP
4. log.CPRATIO
5. log.AIRFLOW
6. log.POWER
Step => 0
Model => HEATRATE ~ 1
R2 => 0
Initiating stepwise selection...
Step => 1
Selected => log.POWER
Model => HEATRATE ~ log.POWER
R2 => 0.758
Step => 2
Selected => log.CPRATIO
Model => HEATRATE ~ log.POWER + log.CPRATIO
R2 => 0.909
Step => 3
Selected => log.AIRFLOW
Model => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW
R2 => 0.921
Step => 4
Selected => EXH.TEMP
Model => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW + EXH.TEMP
R2 => 0.975
Step => 5
Removed => log.CPRATIO
Model => HEATRATE ~ log.POWER + log.AIRFLOW + EXH.TEMP
=> 0.97519
Step => 6
Selected => RPM
Model => HEATRATE ~ log.POWER + log.AIRFLOW + EXH.TEMP + RPM
R2 => 0.983
Step => 7
Selected => INLET.TEMP
Model => HEATRATE ~ log.POWER + log.AIRFLOW + EXH.TEMP + RPM + INLET.TEMP
R2 => 0.984
Stepwise Summary
--------------------------------------------------------------------------------
Step Variable AIC SBC SBIC R2 Adj. R2
--------------------------------------------------------------------------------
0 Base Model 1181.327 1185.737 987.298 0.00000 0.00000
1 log.POWER (+) 1088.159 1094.773 892.613 0.75839 0.75467
2 log.CPRATIO (+) 1024.868 1033.687 828.649 0.90882 0.90597
3 log.AIRFLOW (+) 1016.961 1027.984 819.495 0.92135 0.91761
4 EXH.TEMP (+) 941.630 954.858 748.697 0.97520 0.97360
5 log.CPRATIO (-) 939.661 950.684 747.255 0.97519 0.97401
6 RPM (+) 916.905 930.133 727.384 0.98285 0.98175
7 INLET.TEMP (+) 916.000 931.433 727.152 0.98358 0.98224
--------------------------------------------------------------------------------
Final Model Output
------------------
Model Summary
-------------------------------------------------------------------
R 0.992 RMSE 202.837
R-Squared 0.984 MSE 41142.862
Adj. R-Squared 0.982 Coef. Var 1.921
Pred R-Squared 0.975 AIC 916.000
MAE 152.893 SBC 931.433
-------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
---------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------------
Regression 165140636.668 5 33028127.334 730.877 0.0000
Residual 2756571.779 61 45189.701
Total 167897208.448 66
---------------------------------------------------------------------------
Parameter Estimates
--------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
--------------------------------------------------------------------------------------------------
(Intercept) 43134.882 1973.376 21.858 0.000 39188.873 47080.890
log.POWER -7347.619 503.882 -8.082 -14.582 0.000 -8355.194 -6340.044
log.AIRFLOW 7529.364 539.316 7.262 13.961 0.000 6450.935 8607.794
EXH.TEMP 12.538 1.098 0.347 11.422 0.000 10.343 14.733
RPM 0.066 0.012 0.291 5.568 0.000 0.042 0.090
INLET.TEMP 1.337 0.813 0.115 1.644 0.105 -0.289 2.962
--------------------------------------------------------------------------------------------------
Variables selected through Stepwise Regression: log.power, log.airflow, exh.temp, rpm, inlet.temp
In this instance the three procedures resulted in slightly different subsets of variables. This may not always happen.
What happens when we change the level of entry/remain?
Let’s just look at the stepwise procedure: Run this setting p_entry and p_remain to thier defaults then set both to 0.001 And set details = F.
# stepwise regression
ols_step_both_p(gasmod1,p_ent=0.1,p_rem=0.3,details=F)
Stepwise Summary
--------------------------------------------------------------------------------
Step Variable AIC SBC SBIC R2 Adj. R2
--------------------------------------------------------------------------------
0 Base Model 1181.327 1185.737 987.298 0.00000 0.00000
1 log.POWER (+) 1088.159 1094.773 892.613 0.75839 0.75467
2 log.CPRATIO (+) 1024.868 1033.687 828.649 0.90882 0.90597
3 log.AIRFLOW (+) 1016.961 1027.984 819.495 0.92135 0.91761
4 EXH.TEMP (+) 941.630 954.858 748.697 0.97520 0.97360
5 log.CPRATIO (-) 939.661 950.684 747.255 0.97519 0.97401
6 RPM (+) 916.905 930.133 727.384 0.98285 0.98175
--------------------------------------------------------------------------------
Final Model Output
------------------
Model Summary
-------------------------------------------------------------------
R 0.991 RMSE 207.283
R-Squared 0.983 MSE 42966.285
Adj. R-Squared 0.982 Coef. Var 1.947
Pred R-Squared 0.975 AIC 916.905
MAE 157.004 SBC 930.133
-------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
---------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------------
Regression 165018467.384 4 41254616.846 888.509 0.0000
Residual 2878741.064 62 46431.307
Total 167897208.448 66
---------------------------------------------------------------------------
Parameter Estimates
--------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
--------------------------------------------------------------------------------------------------
(Intercept) 40313.966 988.372 40.788 0.000 38338.239 42289.693
log.POWER -6604.773 226.163 -7.265 -29.204 0.000 -7056.867 -6152.679
log.AIRFLOW 6737.029 245.470 6.498 27.445 0.000 6246.342 7227.715
EXH.TEMP 13.352 0.993 0.370 13.446 0.000 11.367 15.337
RPM 0.062 0.012 0.272 5.264 0.000 0.038 0.085
--------------------------------------------------------------------------------------------------
ols_step_both_p(gasmod1,p_ent=0.001,p_rem=0.001,details=F)
Stepwise Summary
--------------------------------------------------------------------------------
Step Variable AIC SBC SBIC R2 Adj. R2
--------------------------------------------------------------------------------
0 Base Model 1181.327 1185.737 987.298 0.00000 0.00000
1 log.POWER (+) 1088.159 1094.773 892.613 0.75839 0.75467
2 log.CPRATIO (+) 1024.868 1033.687 828.649 0.90882 0.90597
--------------------------------------------------------------------------------
Final Model Output
------------------
Model Summary
--------------------------------------------------------------------
R 0.953 RMSE 478.010
R-Squared 0.909 MSE 228493.597
Adj. R-Squared 0.906 Coef. Var 4.420
Pred R-Squared 0.897 AIC 1024.868
MAE 330.596 SBC 1033.687
--------------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
---------------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------------
Regression 152588137.418 2 76294068.709 318.949 0.0000
Residual 15309071.030 64 239204.235
Total 167897208.448 66
---------------------------------------------------------------------------
Parameter Estimates
--------------------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
--------------------------------------------------------------------------------------------------
(Intercept) 23961.951 581.450 41.211 0.000 22800.371 25123.531
log.POWER -543.491 41.970 -0.598 -12.950 0.000 -627.334 -459.647
log.CPRATIO -2737.915 266.450 -0.474 -10.276 0.000 -3270.210 -2205.620
--------------------------------------------------------------------------------------------------
Summarize your results.
| Type | p_ent/p_rem=.15 | p_ent=.1 andp_rem=.3 | p_ent/p_rem=.0001 |
|---|---|---|---|
| Selected Parameters | log.power, log.airflow, exh.temp, rpm, inlet.temp | log.power, log.airflow, exh.temp, rpm | log.power, log.CPRatio |
Since our main goal from this particular EDA was to eliminate severe multicollinearity. We will first check if this new subset of predictors does that.
#updated model
gasmod2<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP+log.AIRFLOW+log.POWER,data=gasturbine)
gasmod2vif<-round(vif(gasmod2),3)
gasmod2vif
RPM INLET.TEMP EXH.TEMP log.AIRFLOW log.POWER
10.163 18.231 3.429 1005.242 1141.209
mean(gasmod2vif)
[1] 435.6548
This did not resolve our problem of severe multicollinearity. We should remove log.power, log.airflow (or both).
#updated model: REMOVE LOG.POWER
gasmod3<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP+log.AIRFLOW,data=gasturbine)
gasmod3vif<-round(vif(gasmod3),3)
gasmod3vif
RPM INLET.TEMP EXH.TEMP log.AIRFLOW
10.057 3.575 3.334 10.683
mean(gasmod3vif)
[1] 6.91225
#updated model: REMOVE LOG.AIRFLOW
gasmod4<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP+log.POWER,data=gasturbine)
gasmod4vif<-round(vif(gasmod4),3)
gasmod4vif
RPM INLET.TEMP EXH.TEMP log.POWER
9.777 3.676 3.271 12.128
mean(gasmod4vif)
[1] 7.213
#updated model: REMOVE BOTH
gasmod5<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP,data=gasturbine)
gasmod5vif<-round(vif(gasmod5),3)
gasmod5vif
RPM INLET.TEMP EXH.TEMP
1.727 3.570 2.551
mean(gasmod5vif)
[1] 2.616
Of the three models above- which seems to resolve our concern best?
It appears the model without power and airflow have resolved the issue of severe multicollinearity. HOWEVER, we may have also decided to remove a combination of RPM and power/airflow, as that also had a strong relationship. As a researcher, or with your client, you may discuss which of those three variables would be most important to analyze moving forward. Here, we will remove the transformations as we want the most least complex model.
All of what we’ve done so far - in this gas turbine example - is still just our exploratory data analysis (EDA) phase of narrowing down predictors. We will combine our graphical interpretations with the iterative process to select quantitative variables to begin the stages of model building. Then adding in other qualitative variables and interactions that seemed important. RECALL: