Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program set out to study the impact of Ecological Sexual Dimorphism in arctic penguins. In 2014, that particular study compared deep-dive foraging habits of male and female penguins and how that resulted in different biological outcomes. To read more abut it, you can go to read the full study here. And you can find more of Dr. Gorman’s more recent work here

We only have a portion of the data from the larger scale study, so our goals and scope of analysis will be different. Our aim is to determine which factors impact the body mass of a penguin. This is useful to know as body mass is an important indicator of penguin’s health.

Our data table contains the following columns:

Unit 2.1 Qualitative Variables

Step 1: Collect the Data

pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
[1] "Observation"       "body_mass_g"       "species"          
[4] "island"            "bill_length_mm"    "bill_depth_mm"    
[7] "flipper_length_mm" "sex"               "year"             
hist(pendata$body_mass_g, xlab="Body Mass", main="Histogram of Body Mass (in grams)") 

ADD YOUR NOTES ON THE SHAPE + DISTRIBUTION OF RESPONSE:

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

RECALL: Our goal through the exploratory data analysis phase is to identify the individual relationships between the explanatory variables and the response.

In this example we will evaluate the qualitative variables species (with three levels) and sex (with two levels)

#Summary Statistics for response variable grouped by each level of the response
tapply(pendata$body_mass_g,pendata$species,summary)
$Adelie
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2850    3350    3700    3701    4000    4775       1 

$Chinstrap
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2700    3488    3700    3733    3950    4800 

$Gentoo
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   3950    4700    5000    5076    5500    6300       1 
tapply(pendata$body_mass_g,pendata$sex,summary)
$female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2700    3350    3650    3862    4550    5200 

$male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3250    3900    4300    4546    5312    6300 
#Box plots for species and sex
boxplot(body_mass_g~species,pendata, ylab="Body Mass (grams)")

boxplot(body_mass_g~sex,pendata, ylab="Body Mass (grams)")

\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM\)

What does each \(\beta\) coefficient represent?

Step 3: Estimate the model parameters (fit the model using R)

#note: the syntax in R does not necessarily match the parameters
penmod1<-lm(body_mass_g~species+sex,data=pendata)
summary(penmod1)

Call:
lm(formula = body_mass_g ~ species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-816.87 -217.80  -16.87  227.61  882.20 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3372.39      31.43 107.308   <2e-16 ***
speciesChinstrap    26.92      46.48   0.579    0.563    
speciesGentoo     1377.86      39.10  35.236   <2e-16 ***
sexmale            667.56      34.70  19.236   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 316.6 on 329 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8468,    Adjusted R-squared:  0.8454 
F-statistic: 606.1 on 3 and 329 DF,  p-value: < 2.2e-16

The prediction equation is:

\(\widehat{bodymass}=3372.39+26.92SpeciesC+1377.86SpeciesG+667.56SexM\)

Interpret the estimates

  • The estimated mean body mass for Adelie female penguins is 3372.39 grams.
  • The estimated mean body mass for Chinstrap penguins is 26.92 grams higher than Adelie penguins, given gender is held constant.
  • The estimated mean body mass for Gentoo penguins is 1377.86 grams higher than Adelie penguins, given gender is held constant.
  • The estimated mean body mass for male penguins is 667.56 grams higher than female penguins, given species is held constant.

Note

We can recode our base level using the relevel function. This is optional, but could be useful if we have a particular interest group.

In this example, Gentoo appears to be the most different from the other species, so it might be useful to define it as the base level. Then each of our estimates will compare to Gentoo.

#This is only done if we desire to change the base level 
recodedSpecies<-relevel(factor(pendata$species),ref="Gentoo")
penmod6<-lm(body_mass_g~recodedSpecies,data=pendata)
summary(penmod6)

Call:
lm(formula = body_mass_g ~ recodedSpecies, data = pendata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1126.02  -333.09   -33.09   316.91  1223.98 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)              5076.02      41.68  121.78   <2e-16 ***
recodedSpeciesAdelie    -1375.35      56.15  -24.50   <2e-16 ***
recodedSpeciesChinstrap -1342.93      69.86  -19.22   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 462.3 on 339 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6697,    Adjusted R-squared:  0.6677 
F-statistic: 343.6 on 2 and 339 DF,  p-value: < 2.2e-16

Step 4: Specify the distribution of the errors and find the estimate of the variance

Step 5: Evaluate the Utility of the model

Recall the model:

\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM\)

First we Perform the Global F Test:

  • Hypotheses:

    • \(H_0: \beta_1= \beta_2=\beta_3=0\) (the model is not adequate)
    • \(H_a\):at least one of \(\beta_1 , \beta_2 , \beta_3\) does not equal 0 (the model is adequate)
  • Distribution of test statistic: F with 3, 329 DF

  • Test Statistic: F=606.1

  • Pvalue: <0.0000001

  • Decision: pvalue<0.05 -> REJECT H0

  • Conclusion: The model with species and sex is adequate at body mass for the penguins.

What terms can we test “individually?”

  • Species is determined by two parameters (since SpeciesC and SpeciesG are the dummy variables), so we would want to keep or remove BOTH of those dummy variables. We will cover this later.

  • We can test sex because it is only associated with one parameter in the model.

  • Hypotheses:

    • \(H_0: \beta_3=0\)
    • \(H_a:\beta_3 \neq 0\)
  • Distribution of test statistic: T with 329 DF

  • Test Statistic: t=19.236

  • Pvalue: <0.0000001

  • Decision: pvalue<0.05 -> REJECT H0

  • Conclusion: The sex of the penguin is significant at predicting the body mass of a penguin given species is a constant in the model. We will keep it in the model.

Quantitative predictors

Note we can include both qualitative and quantitative predictors in the model.

penmod2<-lm(body_mass_g~species+sex+bill_depth_mm,data=pendata)
summary(penmod2)

Call:
lm(formula = body_mass_g ~ species + sex + bill_depth_mm, data = pendata)

Residuals:
   Min     1Q Median     3Q    Max 
-851.2 -187.7   -3.5  221.7  936.3 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       1576.99     357.08   4.416 1.37e-05 ***
speciesChinstrap    19.44      44.87   0.433    0.665    
speciesGentoo     1721.70      77.88  22.106  < 2e-16 ***
sexmale            513.99      45.24  11.360  < 2e-16 ***
bill_depth_mm      102.04      20.22   5.046 7.48e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 305.5 on 328 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8578,    Adjusted R-squared:  0.8561 
F-statistic: 494.8 on 4 and 328 DF,  p-value: < 2.2e-16

Write the model that was fit

\(E(bodymass)=\beta_0+\beta_1 SpeciesC+\beta_2 SpeciesG+\beta_3 SexM+\beta_4BillDepth\)

  • where SpeciesC= 1 if Chinstrap, 0 otherwise
  • SpeciesG - 1 if Gentoo, 0 otherwise
  • SexM = 1 if male, 0 otherwise

Interpret the estimates

  • Quantitative Variable:
    • Parameter: \(\beta_4\) For each millimeter increase in bill depth, the body weight of a penguin to increases by \(\beta_4\) grams, given species and sex are held constant.
      • Estimate \(\hat{\beta_4}\) =102.04 For each millimeter increase in bill depth, we expect the body weight of a penguin to increase by 102.04 grams, given species and sex are held constant.
  • Qualitative Variable:
    • Parameter:\(\beta_3\) The true mean body mass for male penguins is β_3 grams higher than female penguins, given species and bill depth are held constant.
    • Estimate: \(\hat{\beta_3}\) =513.99 The estimated mean body mass for male penguins is 513.99 grams higher than female penguins, given species and bill depth are held constant.
  • Interpret species coefficients: \(\beta_1\) and \(\beta_2\)

Visualizing Quantitative and Qualitative Variables

Create a scatter plot of the quantitative variable vs the response. Then label the observations by the qualitative variable.

#Bill Depth / Species plot with no regression lines
plot(x=pendata$bill_depth_mm, y=pendata$body_mass_g,pch=as.numeric(factor(pendata$species)),xlab="Bill Depth",ylab="Body Mass (grams)") 
legend("topright",legend = levels(factor(pendata$species)),
       pch = 1:nlevels(factor(pendata$species)))

NOTE: When we group the observations on the scatter plot for Bill Depth by species, the relationships for each level are positive, despite the overall negative trend. Also we see each species would have a similar slope, but different y-intercepts.

Let’s suppose we simplify the model to just species and bill depth:

\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 BillDepth\) , where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise

  • The body mass- bill depth y intercept for Adelie (SpeciesC=0,SpeciesG=0) will be: \(\beta_0\)
  • The body mass- bill depth y intercept for Chinstrap (SpeciesC=1,SpeciesG=0) will be: \(\beta_0+\beta_1\)
  • The body mass- bill depth y intercept for Gentoo (SpeciesC=0,SpeciesG=1) will be: \(\beta_0+\beta_2\)
  • All species will have the same body mass- bill depth slope of \(\beta_3\)

What if there were different slopes for each group? We will explore this next time!

Unit 2.2 Interactions

We will explore the various styles of interactions through several combinations of variables.

Qualitative X Quantitative

EDA: Grouped Scatter Plot (Species X Bill Depth)

If an interaction is present, we would note non-parallel (different) slopes for each level of the qualitative variable.

# Bill Depth / Species Interaction Scatter Plot with regression lines
plot(x=pendata$bill_depth_mm, y=pendata$body_mass_g,pch=as.numeric(factor(pendata$species)),xlab="Bill Depth",ylab="Body Mass (grams)") 
legend("topright",legend = levels(factor(pendata$species)),
       pch = 1:nlevels(factor(pendata$species)))
for (g in levels(as.factor(pendata$species))) {
  group_subset <- subset(pendata, species == g)
  model <- lm(body_mass_g ~ bill_depth_mm, data = group_subset)
  abline(model,lwd = 1)
}

There does not appear to be strong evidence of an interaction because the bill depth-body mass slopes for each species appear to be the same.

How would these “different slopes” show up in the model?

Write a model for E(y) as a function of bill depth and species that hypothesizes different bill depth-body mass slopes for each species.

\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 BillDepth\)

\(+\beta_4SpeciesC*BillDepth+ \beta_5SpeciesG*BillDepth\)

, where SpeciesC= 1 if Chinstrap, 0 otherwise and SpeciesG = 1 if Gentoo, 0 otherwise

  • The coefficient (slope) for BillDepth is \(\beta_3+\beta_4SpeciesC+ \beta_5SpeciesG\)
    • The bill depth-body mass slope for Adelie (SpeciesC=0,SpeciesG=0) will be: \(\beta_3\)

    • The bill depth-body mass slope for Chinstrap (SpeciesC=1,SpeciesG=0) will be: \(\beta_3+\beta_4\)

    • The bill depth-body mass slope for Gentoo (SpeciesC=0,SpeciesG=1) will be: \(\beta_3+\beta_5\)

    • In the last section we saw that by just including the dummy variables, we account for different y-intercepts for each species. By adding an interaction, we allow for different slopes.

Fit the model with the interaction Species X Bill Depth

penmod3<-lm(body_mass_g~species+bill_depth_mm+species*bill_depth_mm,data=pendata)
summary(penmod3)

Call:
lm(formula = body_mass_g ~ species + bill_depth_mm + species * 
    bill_depth_mm, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-845.89 -254.74  -28.46  228.01 1161.41 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     -283.28     437.94  -0.647   0.5182    
speciesChinstrap                 247.06     829.77   0.298   0.7661    
speciesGentoo                   -175.71     658.43  -0.267   0.7897    
bill_depth_mm                    217.15      23.82   9.117   <2e-16 ***
speciesChinstrap:bill_depth_mm   -12.53      45.01  -0.278   0.7809    
speciesGentoo:bill_depth_mm      152.29      40.49   3.761   0.0002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 354.9 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.807, Adjusted R-squared:  0.8041 
F-statistic:   281 on 5 and 336 DF,  p-value: < 2.2e-16
#Note we can also simplify the syntax
penmod3a<-lm(body_mass_g~species*bill_depth_mm,data=pendata)
summary(penmod3a)

Call:
lm(formula = body_mass_g ~ species * bill_depth_mm, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-845.89 -254.74  -28.46  228.01 1161.41 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     -283.28     437.94  -0.647   0.5182    
speciesChinstrap                 247.06     829.77   0.298   0.7661    
speciesGentoo                   -175.71     658.43  -0.267   0.7897    
bill_depth_mm                    217.15      23.82   9.117   <2e-16 ***
speciesChinstrap:bill_depth_mm   -12.53      45.01  -0.278   0.7809    
speciesGentoo:bill_depth_mm      152.29      40.49   3.761   0.0002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 354.9 on 336 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.807, Adjusted R-squared:  0.8041 
F-statistic:   281 on 5 and 336 DF,  p-value: < 2.2e-16

Write the prediction equation

\(\widehat{body mass}=-283.28+247.06SpeciesC-175.71SpeciesG+217.15 BillDepth\)

\(-12.53SpeciesC*BillDepth+ 152.29SpeciesG*BillDepth\)

Interpret the slope estimate for bill depth

  • The slope is estimated as \(217.15 -12.53SpeciesC+ 152.29SpeciesG\)
    • For a one millimeter increase in bill depth, the expected body mass will increase by 217.15 grams for the Adelie penguins.
    • For a one millimeter increase in bill depth, the expected body mass will increase by 204.62 (= 217.15-12.53) grams for the Chinstrap penguins.
    • For a one millimeter increase in bill depth, the expected body mass will increase by 369.44 (= 217.15+152.29) grams for the Gentoo penguins.

Qualitative X Qualitative

EDA: Interaction Plot (Sex X Species)

If an interaction is present, we would notice extreme crossing of levels. The relationship of one variable with the response DEPENDS on the other variable.

# We need to verify there are non-zero observations for every combination of level
table(pendata$species,pendata$sex)
           
            female male
  Adelie        73   73
  Chinstrap     34   34
  Gentoo        58   61
# We can plot interactions either way
interaction.plot(pendata$species, pendata$sex, pendata$body_mass_g,fun=mean,trace.label="Sex", xlab="Species",ylab="Mean Body Mass")

interaction.plot(pendata$sex, pendata$species, pendata$body_mass_g,fun=mean,trace.label="Species", xlab="Sex",ylab="Mean Body Mass")

There is not strong evidence of an interaction because there is not drastic switching of the levels.

Write a model for E(y) as a function of sex and species and its interaction.

\(E(body mass)=\beta_0+\beta_1SpeciesC+ \beta_2SpeciesG+\beta_3 SexM\)

\(+\beta_4SpeciesC*SexM+ \beta_5SpeciesG*SexM\)

  • Coefficient for SpeciesC = \(\beta_1 +\beta_4*SexM\)

  • Coefficient for SpeciesG = \(\beta_2 +\beta_5*SexM\)

  • Coefficient for SexM = \(\beta_3 +\beta_4*SpeciesC+\beta_5*SpeciesG\)

  • There are many ways to interpret these betas. For instance, the average value of males’ body mass depends on the species. Or the average body mass of a particular species depends on the sex of the penguin. Remember qualitative variables do not have “slope” coefficients. Instead we interpret as the (difference in the) average value of y, given a combination of categorical levels.

Fit the model with the interaction Species X Sex

penmod4<-lm(body_mass_g~species+sex+species*sex,data=pendata)
summary(penmod4)

Call:
lm(formula = body_mass_g ~ species + sex + species * sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-827.21 -213.97   11.03  206.51  861.03 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3368.84      36.21  93.030  < 2e-16 ***
speciesChinstrap           158.37      64.24   2.465  0.01420 *  
speciesGentoo             1310.91      54.42  24.088  < 2e-16 ***
sexmale                    674.66      51.21  13.174  < 2e-16 ***
speciesChinstrap:sexmale  -262.89      90.85  -2.894  0.00406 ** 
speciesGentoo:sexmale      130.44      76.44   1.706  0.08886 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 309.4 on 327 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8546,    Adjusted R-squared:  0.8524 
F-statistic: 384.3 on 5 and 327 DF,  p-value: < 2.2e-16
#Note we can use simplified syntax
penmod4a<-lm(body_mass_g~species*sex,data=pendata)
summary(penmod4a)

Call:
lm(formula = body_mass_g ~ species * sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-827.21 -213.97   11.03  206.51  861.03 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3368.84      36.21  93.030  < 2e-16 ***
speciesChinstrap           158.37      64.24   2.465  0.01420 *  
speciesGentoo             1310.91      54.42  24.088  < 2e-16 ***
sexmale                    674.66      51.21  13.174  < 2e-16 ***
speciesChinstrap:sexmale  -262.89      90.85  -2.894  0.00406 ** 
speciesGentoo:sexmale      130.44      76.44   1.706  0.08886 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 309.4 on 327 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8546,    Adjusted R-squared:  0.8524 
F-statistic: 384.3 on 5 and 327 DF,  p-value: < 2.2e-16

Write the prediction equation

\(\widehat{body mass}=3368.84+158.37SpeciesC+1310.91SpeciesG+674.66 SexM\)

\(-262.89SpeciesC*SexM+ 130.44SpeciesG*SexM\)

Plug in the appropriate 1s and 0s for the dummy variables to interpret the coefficients.

  • The estimated average body mass for Adelie female penguins is 3368.84 grams.

Estimate of coefficient for SpeciesC = \(158.37 -262.89*SexM\)

  • The estimated mean body mass for Chinstrap penguins is (158.37-262.89)=104.58 grams lower than Adelie penguins, for male penguins.

  • The estimated mean body mass for Chinstrap penguins is 158.37 higher than Adelie penguins, for female penguins.

Estimate of coefficient for SpeciesG = \(1310.91 +130.44*SexM\)

  • The estimated mean body mass for Gentoo penguins is (1310.91+130.44)=1441.35 grams higher than Adelie penguins, for male penguins.

  • The estimated mean body mass for Gentoo penguins is 1310.91 grams higher than Adelie penguins, for female penguins.

Estimate of coefficient for SexM = \(674.66 -262.89*SpeciesC+130.44*SpeciesG\)

  • The estimated mean body mass for male penguins is (674.66-262.89)=411.77 grams higher than female penguins, for Chinstrap penguins.

  • The estimated mean body mass for male penguins is (674.66+130.44)=805.1 grams higher than female penguins, for Gentoo penguins.

  • The estimated mean body mass for male penguins is 674.66 grams higher than female penguins, for Adelie penguins.

Quantitative X Quantitative

EDA: There is none (Bill Length X Bill Depth)

There is no visualization for two quantitative variables having an interaction. We base our justification off of intuition or research.

Write a model for E(y) as a function of bill depth and bill length that allows for an interaction.

\(E(body mass)=\beta_0+\beta_1BillLength+\beta_2 BillDepth\)

\(+\beta_3BillLength*BillDepth\)

  • The coefficient (slope) for BillLength is \(\beta_1+\beta_3BillDepth\)
  • The coefficient (slope) for BillDepth is \(\beta_2+\beta_3BillLength\)

Fit the model with the interaction Bill Length X Bill Depth

penmod5<-lm(body_mass_g~bill_length_mm+bill_depth_mm+bill_length_mm*bill_depth_mm,data=pendata)
summary(penmod5)

Call:
lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm + bill_length_mm * 
    bill_depth_mm, data = pendata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1811.29  -355.81     4.35   354.80  1606.90 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -25583.278   2668.939  -9.586   <2e-16 ***
bill_length_mm                  715.006     58.681  12.185   <2e-16 ***
bill_depth_mm                  1484.934    149.405   9.939   <2e-16 ***
bill_length_mm:bill_depth_mm    -36.079      3.297 -10.944   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 503.5 on 338 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6093,    Adjusted R-squared:  0.6058 
F-statistic: 175.7 on 3 and 338 DF,  p-value: < 2.2e-16
#We can used shortened syntax
penmod5a<-lm(body_mass_g~bill_length_mm*bill_depth_mm,data=pendata)
summary(penmod5a)

Call:
lm(formula = body_mass_g ~ bill_length_mm * bill_depth_mm, data = pendata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1811.29  -355.81     4.35   354.80  1606.90 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -25583.278   2668.939  -9.586   <2e-16 ***
bill_length_mm                  715.006     58.681  12.185   <2e-16 ***
bill_depth_mm                  1484.934    149.405   9.939   <2e-16 ***
bill_length_mm:bill_depth_mm    -36.079      3.297 -10.944   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 503.5 on 338 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6093,    Adjusted R-squared:  0.6058 
F-statistic: 175.7 on 3 and 338 DF,  p-value: < 2.2e-16

The prediction equation is:

\(\widehat{body mass}=-25583.278+715.006BillLength+1484.934 BillDepth-36.079 BillLength*BillDepth\)

Suppose a penguin has a bill depth of 17mm. What is the slope of bill length?

  • The slope estimate for BillLength is \(715.006-36.079*BillDepth\). When bill depth is 17mm, our slope is \(715.006-36.079*17= 101.663\). Therefore, for a one millimeter increase in bill length, the expected body mass will increase by 101.66 grams when bill depth is held constant at 17mm.

Suppose a penguin has a bill length of 40mm. What is the slope of bill depth?

  • The slope estimate for BillDepth is \(1484.934-36.079*BillLength\). When bill depth is 40mm, our slope is \(1484.934-36.079*40= 41.774\). Therefore, for a one millimeter increase in bill depth, the expected body mass will increase by 41.774 grams when bill length is held constant at 40mm.

Unit 2.3: A Complete Analysis

Step 1: Collect the Data

  • Data Compilation: These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network. The data were imported directly from the Environmental Data Initiative (EDI) Data Portal, and are available for use by CC0 license (“No Rights Reserved”) in accordance with the Palmer Station Data Policy. We have no reason to question the credibility of these sources.
  • Data Organization: Are the data organized for MLR?- each row contains information about one observation (experimental unit) including all of the explanatory predictors and the response.
  • Suitability of response variable for regression: View a histogram of response variable. Primarily, it should be continuous (or discrete over a wide range of values) with few outliers. Data should not be recorded over time. It is sometimes useful if the response variable is unimodal and symmetric, but that is NOT a requirement. (Further, we will not use “rank” as a response variable in MLR. We need a variable with several observations of each response, if the variable is discrete.)
pendata<-read.csv("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/penguins.csv")
head(pendata)
names(pendata)
[1] "Observation"       "body_mass_g"       "species"          
[4] "island"            "bill_length_mm"    "bill_depth_mm"    
[7] "flipper_length_mm" "sex"               "year"             
hist(pendata$body_mass_g, xlab="Body Mass", main="Histogram of Body Mass (in grams)") 

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

RECALL: Our goal through the exploratory data analysis phase is to identify the individual relationships between the explanatory variables and the response.

  • For quantitative variables, we explore the scatter plots and correlations for each explanatory variable then classify the relationships as linear, curvilinear, other, or none.

  • For qualitative variables, we explore the box plots and means value of y for each level of the qualitative explanatory variables. Then classify the relationships as existent or not.

Quantitative Variables

                    [,1]
bill_length_mm     0.595
bill_depth_mm     -0.472
flipper_length_mm  0.871
  • Bill length (0.595) and flipper length (0.871) have positive linear relationships with the response.

  • Bill depth has a negative relationship (r=-0.472). However it appears there might be groups in within with positive relationship.

Qualitative Variables

$Adelie
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   2850    3350    3700    3701    4000    4775       1 

$Chinstrap
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2700    3488    3700    3733    3950    4800 

$Gentoo
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   3950    4700    5000    5076    5500    6300       1 
$female
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2700    3350    3650    3862    4550    5200 

$male
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   3250    3900    4300    4546    5312    6300 

  • There appears to be a relationship with species and the response (body mass) because the mean value of y is quite different for each level. We see Adelie and Chinstrap have very similar means (3701 grams and 3733 grams respectively), but Gentoo is much higher at 5076 grams.

  • There appears to be a relationship with sex and the response (body mass) because the mean value of y is quite different for each level. We see males with an average body mass of 4546 grams and females much lower at 3862 grams.

Interactions of Qualitative Variables

           
            female male
  Adelie        73   73
  Chinstrap     34   34
  Gentoo        58   61

There is not strong evidence of an interaction because there is not drastic switching of the levels.

Interactions of Qualitative X Quantitative Variables

  • The only interactions that appear to be relevant in EDA are Bill Depth X Species and Flipper Length X Species as they appear to have drastically different slopes.

Step 3: Estimate the model parameters (fit the model using R)

Stage 1: Add Quantitative Variables

First we will add the quantitative variables of interest and the QUANT X QUANT interactions that we believe may be important. We will add all of the quantitative variables, as they all had moderate correlations. Here, we believe Bill Depth X Bill Length may be a useful interaction, but there is not EDA.

Hypothesized Linear Model:

\(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_4BillDepth*BillLenth\)

penmod1f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm,data=pendata)
summary(penmod1f)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm, data = pendata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1161.66  -252.74   -39.99   239.65  1139.91 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -20370.313   1960.714 -10.389  < 2e-16 ***
bill_depth_mm                   858.001    114.277   7.508 5.42e-13 ***
bill_length_mm                  352.102     47.415   7.426 9.25e-13 ***
flipper_length_mm                43.350      2.486  17.441  < 2e-16 ***
bill_depth_mm:bill_length_mm    -19.071      2.585  -7.379 1.26e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 365.6 on 337 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.7946,    Adjusted R-squared:  0.7922 
F-statistic:   326 on 4 and 337 DF,  p-value: < 2.2e-16

Start with Global F test

  • Hypotheses:

    • \(H_0: \beta_1= \beta_2=\beta_3=\beta_4=0\) (the model is not adequate)
    • \(H_a\):at least one of \(\beta_1 , \beta_2 , \beta_3,\beta_4\) does not equal 0 (the model is adequate)
  • Distribution of test statistic: F with 4, 337 DF

  • Test Statistic: F=326

  • Pvalue: <0.0001

  • Decision: 0.0001<0.05 -> REJECT H0

  • Conclusion: The model with bill depth, bill length, flipper length, and the interaction of bill length and bill depth is adequate at predicting the body mass of penguins.

Test the “most important predictors”: Complex variables and research variables:

Is the interaction significant?: We look at a t-test here because this interaction accounts for just one parameter.

  • Hypotheses:
    • \(H_0: \beta_4=0\) (the interaction does not contribute to predicting body mass)
    • \(H_a:\beta_4 \neq 0\) (the interaction contributes to predicting body mass)
  • Distribution of test statistic: T with 337 DF
  • Test Statistic: t=-7.379
  • Pvalue: <0.0001
  • Decision: <0.0001<0.05 -> REJECT H0
  • Conclusion: The interaction between bill depth and bill length is significant at predicting body mass of penguins.

We will not do any further testing as we have not specified other variables of interest.

Stage 2: Add Qualitative variables

We will add the qualitative variables to the model from above. We did not believe the interaction between species and sex looked important, so we will not add it.

penmod2f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(penmod2f)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-706.90 -175.12   -6.14  176.80  880.08 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6290.446   1832.633  -3.432 0.000675 ***
bill_depth_mm                  333.924     98.231   3.399 0.000760 ***
bill_length_mm                 130.274     41.058   3.173 0.001653 ** 
flipper_length_mm               16.535      2.888   5.725 2.35e-08 ***
speciesChinstrap              -200.763     82.322  -2.439 0.015273 *  
speciesGentoo                  923.788    132.379   6.978 1.68e-11 ***
sexmale                        391.603     47.370   8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm    -6.344      2.290  -2.770 0.005919 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 284.4 on 325 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8778,    Adjusted R-squared:  0.8752 
F-statistic: 333.7 on 7 and 325 DF,  p-value: < 2.2e-16

NOTE: The model that is fit to match this syntax is:

It is ordered this way despite the order in the lm() function because R builds the model with first order terms listed first

Hypothesized Linear Model:

\(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength\)

  • where Species C= 1 if Chinstrap, 0 otherwise
  • SpeciesG = 1 if Gentoo, 0 otherwise
  • SexM = 1 if male, 0 otherwise

Test the “most important predictors”:

Is the species significant?: We look at a nested f test here because species accounts for two parameters.

#fit the reduced model
redpenmod2f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+sex,data=pendata)
summary(redpenmod2f)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + sex, data = pendata)

Residuals:
     Min       1Q   Median       3Q      Max 
-1027.84  -216.45   -26.22   200.94   903.55 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -14086.152   1833.599  -7.682 1.83e-13 ***
bill_depth_mm                   610.319    103.472   5.898 9.16e-09 ***
bill_length_mm                  283.535     42.278   6.707 8.77e-11 ***
flipper_length_mm                34.146      2.396  14.253  < 2e-16 ***
sexmale                         492.047     49.009  10.040  < 2e-16 ***
bill_depth_mm:bill_length_mm    -15.633      2.300  -6.798 5.03e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 319.5 on 327 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8449,    Adjusted R-squared:  0.8426 
F-statistic: 356.3 on 5 and 327 DF,  p-value: < 2.2e-16
#syntax for nested test is anova(REDUCED,COMPLETE)
anova(redpenmod2f,penmod2f)

Complete Model: \(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength\)

  • Hypotheses:
    • \(H_0: \beta_4=\beta_5=0\) (the species does not contribute to predicting body mass)
    • \(H_a:\beta_4, \beta_5 \neq 0\) (the species contributes to predicting body mass)
    • Reduced Model (assume null is true): \(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_6SexM+\beta_7BillDepth*BillLength\)
  • Distribution of test statistic: F 2 with 325 DF
  • Test Statistic: F=43.795
  • Pvalue: <0.0001
  • Decision: <0.0001<0.05 -> REJECT H0
  • Conclusion: The species is significant at predicting body mass of penguins. We will keep both dummy variables in the model and not test them individually.

We will not do any further testing as we did not have another variable of interest.

Stage 3: Add Qual X Quant Interaction

We will add the interaction between bill depth and species because we thought that looked important.

penmod3f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex+species*bill_depth_mm,data=pendata)
summary(penmod3f)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex + species * 
    bill_depth_mm, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-763.71 -170.22   -3.61  167.06  887.00 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    -4562.344   2259.560  -2.019   0.0443 *  
bill_depth_mm                    241.956    126.493   1.913   0.0567 .  
bill_length_mm                    90.930     55.312   1.644   0.1012    
flipper_length_mm                 15.946      2.924   5.453 9.88e-08 ***
speciesChinstrap                 742.233    911.667   0.814   0.4162    
speciesGentoo                    400.335    584.241   0.685   0.4937    
sexmale                          386.571     47.305   8.172 6.96e-15 ***
bill_depth_mm:bill_length_mm      -4.100      3.160  -1.298   0.1953    
bill_depth_mm:speciesChinstrap   -52.218     51.172  -1.020   0.3083    
bill_depth_mm:speciesGentoo       38.098     36.315   1.049   0.2949    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 283.6 on 323 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8793,    Adjusted R-squared:  0.876 
F-statistic: 261.6 on 9 and 323 DF,  p-value: < 2.2e-16

NOTE: The model that is fit to match this syntax is:

\(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength+\beta_8SpeciesC*BillDepth+\beta_9SpeciesG*BillDepth\)

  • where Species C= 1 if Chinstrap, 0 otherwise
  • SpeciesG = 1 if Gentoo, 0 otherwise
  • SexM = 1 if male, 0 otherwise

Test the “most important predictors”: Is the species X bill depth interaction significant?: We look at a nested f test here because this interaction accounts for two parameters.

# updated reduced model
redpenmod3f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
summary(redpenmod3f)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-706.90 -175.12   -6.14  176.80  880.08 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6290.446   1832.633  -3.432 0.000675 ***
bill_depth_mm                  333.924     98.231   3.399 0.000760 ***
bill_length_mm                 130.274     41.058   3.173 0.001653 ** 
flipper_length_mm               16.535      2.888   5.725 2.35e-08 ***
speciesChinstrap              -200.763     82.322  -2.439 0.015273 *  
speciesGentoo                  923.788    132.379   6.978 1.68e-11 ***
sexmale                        391.603     47.370   8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm    -6.344      2.290  -2.770 0.005919 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 284.4 on 325 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8778,    Adjusted R-squared:  0.8752 
F-statistic: 333.7 on 7 and 325 DF,  p-value: < 2.2e-16
anova(redpenmod3f,penmod3f)

Complete Model: \(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength+\beta_8SpeciesC*BillDepth+\beta_9SpeciesG*BillDepth\)

  • Hypotheses:
    • \(H_0: \beta_8=\beta_9=0\) (the interaction does not contribute to predicting body mass)
    • \(H_a:\beta_8, \beta_9 \neq 0\) (the interaction contributes to predicting body mass)
    • Reduced Model (assume null is true): \(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLength+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength\)
  • Distribution of test statistic: F 2 with 323 DF
  • Test Statistic: F=1.99
  • Pvalue: 0.137
  • Decision: 0.137>0.05 -> FAIL TO REJECT H0
  • Conclusion: The interaction between species and bill depth is not significant at predicting body mass of penguins. We will remove the interaction, thus removing \(\beta_8 and \beta_9\) from the model.

We will not do any further testing as we did not have another variable of interest.

Our final model is:

\(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLenth+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength\)

  • where Species C= 1 if Chinstrap, 0 otherwise
  • SpeciesG = 1 if Gentoo, 0 otherwise
  • SexM = 1 if male, 0 otherwise

Interpret the coefficients and confidence intervals for beta coefficients

# Refitting the final model
penmod2f<-lm(body_mass_g~bill_depth_mm+bill_length_mm+flipper_length_mm+bill_depth_mm*bill_length_mm+species+sex,data=pendata)
round(coef(penmod2f),2)
                 (Intercept)                bill_depth_mm 
                    -6290.45                       333.92 
              bill_length_mm            flipper_length_mm 
                      130.27                        16.53 
            speciesChinstrap                speciesGentoo 
                     -200.76                       923.79 
                     sexmale bill_depth_mm:bill_length_mm 
                      391.60                        -6.34 
round(confint(penmod2f,level=.95),2)
                                2.5 %   97.5 %
(Intercept)                  -9895.77 -2685.13
bill_depth_mm                  140.67   527.17
bill_length_mm                  49.50   211.05
flipper_length_mm               10.85    22.22
speciesChinstrap              -362.71   -38.81
speciesGentoo                  663.36  1184.22
sexmale                        298.41   484.79
bill_depth_mm:bill_length_mm   -10.85    -1.84

The prediction Equation:

\(\widehat{bodymass}=-6290.45+333.92BillDepth+130.27 BillLength+16.53 FlipperLength-200.76SpeciesC+923.79SpeciesG+391.60SexM-6.34BillDepth*BillLength\)

  • where Species C= 1 if Chinstrap, 0 otherwise

  • SpeciesG = 1 if Gentoo, 0 otherwise

  • SexM = 1 if male, 0 otherwise

  • Quantitative Variable with No Interaction: Flipper Length

    • Estimate \(\hat{\beta_3}\) =16.53. For each millimeter increase in bill depth, we expect the body weight of a penguin to increase by 16.53 grams, given the bill length, bill depth, and species are held constant
    • Interval: We are 95% confident that for each millimeter increase in bill depth, actual the body weight of a penguin increases by between 10.85 and 22.22 grams, given the bill length, bill depth, and species are held constant
  • Qualitative Variable with No Interaction: SexM

    • Estimate: \(\hat{\beta_6}\) =391.60 The estimated mean body mass for male penguins is 391.60 grams higher than female penguins, given bill length, bill depth, flipper length, and species are held constant.
    • Interval: We are 95% confident that the actual mean body mass for male penguins is between 298.41 and 484.79 grams higher than female penguins, given bill length, bill depth, flipper length, and species are held constant.
  • Qualitative Variable with No Interaction:

    • Interpret species coefficients: \(\hat{\beta_4}\) and \(\hat{\beta_5}\)
  • Quant X Quant Interactions: Bill Length X Bill Depth

    • We cannot interpret \(\hat{\beta_1}\) and \(\hat{\beta_2}\) individually since Bill Length and Bill Depth have an interaction in the model.
    • The slope estimate for BillDepth is \(333.92-6.34 *BillLength\). For a one millimeter increase in bill depth, the expected body mass will increase by \(333.92-6.34 *BillLength\) grams given bill length, flipper length, sex, and species are held constant
    • The slope estimate for BillLength is \(130.27-6.34 *BillDepth\). For a one millimeter increase in bill length, the expected body mass will increase by \(130.27-6.34 *BillDepth\) grams given bill depth, flipper length, sex, and species are held constant
  • Confidence intervals when interactions are present: We cannot interpret interval estimates for main effects or interactions when there is an interaction present in the model.

Step 5: Evaluate the Utility of the model

\(E(bodymass)=\beta_0+\beta_1BillDepth+\beta_2BillLength+\beta_3FlipperLenth+\beta_4SpeciesC+\beta_5SpeciesG+\beta_6SexM+\beta_7BillDepth*BillLength\)

  • where Species C= 1 if Chinstrap, 0 otherwise
  • SpeciesG = 1 if Gentoo, 0 otherwise
  • SexM = 1 if male, 0 otherwise
# Regression output
summary(penmod2f)

Call:
lm(formula = body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm + 
    bill_depth_mm * bill_length_mm + species + sex, data = pendata)

Residuals:
    Min      1Q  Median      3Q     Max 
-706.90 -175.12   -6.14  176.80  880.08 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -6290.446   1832.633  -3.432 0.000675 ***
bill_depth_mm                  333.924     98.231   3.399 0.000760 ***
bill_length_mm                 130.274     41.058   3.173 0.001653 ** 
flipper_length_mm               16.535      2.888   5.725 2.35e-08 ***
speciesChinstrap              -200.763     82.322  -2.439 0.015273 *  
speciesGentoo                  923.788    132.379   6.978 1.68e-11 ***
sexmale                        391.603     47.370   8.267 3.55e-15 ***
bill_depth_mm:bill_length_mm    -6.344      2.290  -2.770 0.005919 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 284.4 on 325 degrees of freedom
  (11 observations deleted due to missingness)
Multiple R-squared:  0.8778,    Adjusted R-squared:  0.8752 
F-statistic: 333.7 on 7 and 325 DF,  p-value: < 2.2e-16

Further Assessment:

  • Global F test: (F=333.7, p<0.00001) The model is adequate at body mass for the penguins.
  • RMSE: Approximately 95% of our predictions will be within \(2*284.4=568.8\) grams of the actual body mass for the penguins.
  • Adjusted R-Sq: 0.87
    • It is close to R2, which indicates a good fit.
    • After adjusting for the number of observations and parameters in the model, 87.5% of the variation in body mass is explained by the model with bill depth, bill length, flipper length, species, sex, and the interaction between bill depth and bill length.
  • Confidence Interval for Betas Do they appear wide or narrow?

Step 6: Check the Model Assumptions

We will cover this in Unit 3

Step 7: Use the model for prediction or estimation

Prediction:

What is the estimated body mass for a penguin with a bill depth of 17, bill length of 40, flipper length of 205, Gentoo species, and female?

# First create a data frame with the new values.
newpen<-data.frame(bill_depth_mm=17,bill_length_mm=40,flipper_length_mm=205,species="Gentoo",sex="female")
predict(penmod2f,newpen, level=.95, interval="prediction")
       fit      lwr      upr
1 4596.685 4009.935 5183.436

We are 95% confident that a penguin with a bill depth of 17, bill length of 40, flipper length of 205, Gentoo species, and is female will have a body mass between 4009.935 and 5183 grams. Its estimated body mass is 4596.685 grams.

Estimation:

What is the estimated average body mass for penguins that all have a bill depth of 17, bill length of 40, flipper length of 205, Gentoo species, and female?

predict(penmod2f,newpen, level=.95, interval="confidence")
       fit      lwr      upr
1 4596.685 4420.195 4773.176

We are 95% confident that penguins with a bill depth of 17, bill length of 40, flipper length of 205, Gentoo species, and are female will have an average body mass between 4420.195 and 4773.176 grams. Their estimated average body mass is 4596.685.

Practice for your self

What is the estimated body mass for a penguin with a bill depth of 21, bill length of 43, flipper length of 200, Adelie species, and male?

You should get fit= 4293.63, lwr = 3726.241 and upr= 4861.019

What is the estimated average body mass for penguins that all have a bill depth of 21, bill length of 43, flipper length of 200, Adelie species, and male?

You should get fit= 4293.63, lwr = 4199.804 and upr= 4387.456

Unit 2.4 Multicollinearity and Variable Screening

During periods of high electricity demand, especially during the hot summer months, the power output from a gas turbine engine can drop dramatically. One way to counter this drop in power is by cooling the inlet air to the gas turbine. An increasingly popular cooling method uses high pressure inlet fogging. The performance of a sample of 67 gas turbines augmented with high pressure inlet fogging was investigated in the Journal of Engineering for Gas Turbines and Power (January 2005). One measure of performance is heat rate (kilojoules per kilowatt per hour). Heat rates for the 67 gas turbines, saved in the gasturbine file. Read some of the engineering applications from this journal here:

Step 1: Collect the Data

  • Data Compilation: These data were collected from researchers in a reputable journal.
  • Data Organization: Are the data organized for MLR?- each row contains information about one observation (experimental unit) including all of the explanatory predictors and the response.
  • Suitability of response variable for regression: View a histogram of response variable. Primarily, it should be continuous with few outliers. Data should not be recorded over time. It is sometimes useful if the response variable is unimodal and symmetric, but that is NOT a requirement. (Further, we will not use “rank” as a response variable in MLR. We need a continuous variable with several values of each response, if the variable is discrete.)
gasturbine<-read.delim("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/GASTURBINE.txt")
head(gasturbine)
names(gasturbine)
[1] "ENGINE"     "SHAFTS"     "RPM"        "CPRATIO"    "INLET.TEMP"
[6] "EXH.TEMP"   "AIRFLOW"    "POWER"      "HEATRATE"  
hist(gasturbine$HEATRATE, xlab="Heat Rate", main="Histogram of Heat Rate") 

The distribution of the response variable, heat rate, is unimodal and skewed right. It is continuous, so it should still be suitable for regression.

Step 2: Hypothesize Relationship

Exploratory Data Analysis

We will explore the relationship with quantitative variables with scatter plots and correlations and classify each relationship as linear, curvilinear, other, or none. We explore the box plots and means for each qualitative variable explanatory variable then classify the relationships as existent or not. We will not explore interactions in this example.

#Scatter plots for quantitative variables
for (i in names(gasturbine)[3:8]) {
  plot((gasturbine[,i]), gasturbine$HEATRATE,xlab=i,ylab="Heat Rate")
}

#Attempt Transformations
plot(log(gasturbine$RPM), gasturbine$HEATRATE,ylab="Heat Rate")

plot(log(gasturbine$CPRATIO), gasturbine$HEATRATE,ylab="Heat Rate")

plot(log(gasturbine$POWER), gasturbine$HEATRATE,ylab="Heat Rate")

plot(log(gasturbine$AIRFLOW), gasturbine$HEATRATE,ylab="Heat Rate")

#Correlations for quantitative variables
round(cor(gasturbine[3:8],gasturbine$HEATRATE,use="complete.obs"),3)
             [,1]
RPM         0.844
CPRATIO    -0.735
INLET.TEMP -0.801
EXH.TEMP   -0.314
AIRFLOW    -0.703
POWER      -0.697
round(cor(log(gasturbine[3:8]),gasturbine$HEATRATE,use="complete.obs"),3)
             [,1]
RPM         0.810
CPRATIO    -0.818
INLET.TEMP -0.813
EXH.TEMP   -0.303
AIRFLOW    -0.840
POWER      -0.871
#Summary Statistics for response variable grouped by each level of the response
tapply(gasturbine$HEATRATE,gasturbine$ENGINE,summary)
$Advanced
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9105    9295    9669    9764    9933   11588 

$Aeroderiv
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8714   10708   12414   12312   13697   16243 

$Traditional
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10086   10598   11183   11544   11956   14796 
tapply(gasturbine$HEATRATE,gasturbine$SHAFTS,summary)
$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9105    9918   10592   10930   11674   14796 

$`2`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10951   11223   11654   12536   13232   16243 

$`3`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8714    8903    9092    9092    9280    9469 
#Box plots for Qualitative
boxplot(HEATRATE~ENGINE,gasturbine, ylab="Heat Rate")

boxplot(HEATRATE~SHAFTS,gasturbine, ylab="Heat Rate")

# Summary counts for qualitative variables
table(gasturbine$ENGINE,gasturbine$SHAFTS)
             
               1  2  3
  Advanced    21  0  0
  Aeroderiv    1  4  2
  Traditional 35  4  0
  • Most of the variables have high correlations. Exhaust temperature has the weakest correlation.
  • CPRATIO, POWER and AIRFLOW should have a log transformation because their correlations with HEATRATE were much higher, indicating a stronger linear form with that transformation.
  • Engine type appears to have a different average heat rate for the advanced group.
  • Note: Shafts=3 only has two observations. Further, we would not be able to fit an interaction between Engine and Shaft because some level combinations do not have any observations.

Multicollinearity Assessment

Do any of the explanatory variables have relationships with each other? We will look at pairwise correlations and VIF to evaluate multicollinearity in the quantitative explanatory variables.

Pairwise Correlations

Below are several methods for evaluating PAIRWISE relationships of the explanatory variables. You may use whichever is easiest to interpret for yourself.

# First Store the transformations in the data table
gasturbine$log.CPRATIO<-log(gasturbine$CPRATIO)
gasturbine$log.AIRFLOW<-log(gasturbine$AIRFLOW)
gasturbine$log.POWER<-log(gasturbine$POWER)

Method 1: Correlation Matrix

Prints the correlations of each variable with each other variable.

# Regular correlation
# Use the correlation function with no second argument
# to find the correlations with the explanatory variables with each other

gasturcor<-round(cor(gasturbine[,c(3,5,6,10:12)]),4)
gasturcor
                RPM INLET.TEMP EXH.TEMP log.CPRATIO log.AIRFLOW log.POWER
RPM          1.0000    -0.5536  -0.1715     -0.5819     -0.9177   -0.9083
INLET.TEMP  -0.5536     1.0000   0.7283      0.7440      0.6649    0.7221
EXH.TEMP    -0.1715     0.7283   1.0000      0.1712      0.4068    0.4515
log.CPRATIO -0.5819     0.7440   0.1712      1.0000      0.5308    0.5757
log.AIRFLOW -0.9177     0.6649   0.4068      0.5308      1.0000    0.9964
log.POWER   -0.9083     0.7221   0.4515      0.5757      0.9964    1.0000

Method 2: Correlation Heat Map (best in color)

# Correlation Heat Map
# The argument is the stored correlation values from the cor() function
corrplot(gasturcor)

Method 3: Scatter Plot Matrix (Best viewed on a computer or with a few number of variables and/or observations

# Scatter plot matrix
plot(gasturbine[c(3,5,6,10:12)])

Method 4: Correlations and P-vales for \(H_0: \rho=0\)

Note the top matrix in this out put are the rounded correlations, the bottom matrix are the p-values that correspond with the correlation significance test for each pair.

# A new correlation function
rcorr(as.matrix(gasturbine[,c(3,5,6,10:12)]))
              RPM INLET.TEMP EXH.TEMP log.CPRATIO log.AIRFLOW log.POWER
RPM          1.00      -0.55    -0.17       -0.58       -0.92     -0.91
INLET.TEMP  -0.55       1.00     0.73        0.74        0.66      0.72
EXH.TEMP    -0.17       0.73     1.00        0.17        0.41      0.45
log.CPRATIO -0.58       0.74     0.17        1.00        0.53      0.58
log.AIRFLOW -0.92       0.66     0.41        0.53        1.00      1.00
log.POWER   -0.91       0.72     0.45        0.58        1.00      1.00

n= 67 


P
            RPM    INLET.TEMP EXH.TEMP log.CPRATIO log.AIRFLOW log.POWER
RPM                0.0000     0.1653   0.0000      0.0000      0.0000   
INLET.TEMP  0.0000            0.0000   0.0000      0.0000      0.0000   
EXH.TEMP    0.1653 0.0000              0.1660      0.0006      0.0001   
log.CPRATIO 0.0000 0.0000     0.1660               0.0000      0.0000   
log.AIRFLOW 0.0000 0.0000     0.0006   0.0000                  0.0000   
log.POWER   0.0000 0.0000     0.0001   0.0000      0.0000               

There is concern of strong pairwise relationships with several pairs of variables.

  • log.Power & log.airflow (r=0.92)
  • log.Power & inlet_temp (r=0.72)
  • log.airflow & RPM (r=0.92)

Multicollinearity

Next we evaluate the severity of MULTICOLLINEARITY with VIF by fitting a model with only all first order quantitative variables.

# Multicollinearity VIF
# First fit the model with all of the quantitative variables
gasmod1<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP+log.CPRATIO+log.AIRFLOW+log.POWER,data=gasturbine)

#store and round the VIF values
gasmod1vif<-round(vif(gasmod1),3)
gasmod1vif
        RPM  INLET.TEMP    EXH.TEMP log.CPRATIO log.AIRFLOW   log.POWER 
     10.300      24.579       7.203       7.086    1085.142    1225.107 
#find the average VIF
mean(gasmod1vif)
[1] 393.2362

Yes, there is evidence of severe multicollinearity because all VIFs are MUCH greater than 10 and the average VIF is MUCH greater than 3 at 393.2362.

Significance Based Variable Screening

Because we have quite a few variables and severe multicollinearity, we need to address that. It is not clear from EDA what variables should remain and which variables should be removed.

We will use iterative variable selection procedures to narrow down our quantitative variables to a best set of predictors. We will use the entry and remain significance levels of 0.15

REMEMBER: Variable Screening’s main purpose is to narrow down quantitative predictors. It can be a useful tool when multicollinearity is present, but that is not its only function.

Backwards Elimination

# Note: Details=T gives all of the steps within each process
# In most cases, you can set Details=F


#Default: prem = 0.3
ols_step_backward_p(gasmod1,p_rem=0.15,details=T)
Backward Elimination Method 
---------------------------

Candidate Terms: 

1. RPM 
2. INLET.TEMP 
3. EXH.TEMP 
4. log.CPRATIO 
5. log.AIRFLOW 
6. log.POWER 


Step   => 0 
Model  => HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP + log.CPRATIO + log.AIRFLOW + log.POWER 
R2     => 0.984 

Initiating stepwise selection... 

Step     => 1 
Removed  => log.CPRATIO 
Model    => HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP + log.AIRFLOW + log.POWER 
R2       => 0.98358 


No more variables to be removed.

Variables Removed: 

=> log.CPRATIO 

                              Stepwise Summary                              
--------------------------------------------------------------------------
Step    Variable         AIC        SBC       SBIC        R2       Adj. R2 
--------------------------------------------------------------------------
 0      Full Model     917.568    935.205    729.036    0.98369    0.98206 
 1      log.CPRATIO    916.000    931.433    727.152    0.98358    0.98224 
--------------------------------------------------------------------------

Final Model Output 
------------------

                           Model Summary                            
-------------------------------------------------------------------
R                         0.992       RMSE                 202.837 
R-Squared                 0.984       MSE                41142.862 
Adj. R-Squared            0.982       Coef. Var              1.921 
Pred R-Squared            0.975       AIC                  916.000 
MAE                     152.893       SBC                  931.433 
-------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                   ANOVA                                    
---------------------------------------------------------------------------
                     Sum of                                                
                    Squares        DF     Mean Square       F         Sig. 
---------------------------------------------------------------------------
Regression    165140636.668         5    33028127.334    730.877    0.0000 
Residual        2756571.779        61       45189.701                      
Total         167897208.448        66                                      
---------------------------------------------------------------------------

                                       Parameter Estimates                                         
--------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
--------------------------------------------------------------------------------------------------
(Intercept)    43134.882      1973.376                  21.858    0.000    39188.873    47080.890 
        RPM        0.066         0.012        0.291      5.568    0.000        0.042        0.090 
 INLET.TEMP        1.337         0.813        0.115      1.644    0.105       -0.289        2.962 
   EXH.TEMP       12.538         1.098        0.347     11.422    0.000       10.343       14.733 
log.AIRFLOW     7529.364       539.316        7.262     13.961    0.000     6450.935     8607.794 
  log.POWER    -7347.619       503.882       -8.082    -14.582    0.000    -8355.194    -6340.044 
--------------------------------------------------------------------------------------------------
  • Variables selected through Backwards elimination: rpm, inlet.temp, exh.temp, log.airflow, log.power

Forward Selection

#default: penter = 0.3
ols_step_forward_p(gasmod1,p_enter=0.15,details=T)
Forward Selection Method 
------------------------

Candidate Terms: 

1. RPM 
2. INLET.TEMP 
3. EXH.TEMP 
4. log.CPRATIO 
5. log.AIRFLOW 
6. log.POWER 


Step   => 0 
Model  => HEATRATE ~ 1 
R2     => 0 

Initiating stepwise selection... 

                     Selection Metrics Table                       
------------------------------------------------------------------
Predictor      Pr(>|t|)    R-Squared    Adj. R-Squared      AIC    
------------------------------------------------------------------
log.POWER       0.00000        0.758             0.755    1088.159 
RPM             0.00000        0.712             0.708    1099.849 
log.AIRFLOW     0.00000        0.705             0.701    1101.457 
log.CPRATIO     0.00000        0.670             0.665    1109.066 
INLET.TEMP      0.00000        0.641             0.635    1114.713 
EXH.TEMP        0.00959        0.099             0.085    1176.358 
------------------------------------------------------------------

Step      => 1 
Selected  => log.POWER 
Model     => HEATRATE ~ log.POWER 
R2        => 0.758 

                     Selection Metrics Table                       
------------------------------------------------------------------
Predictor      Pr(>|t|)    R-Squared    Adj. R-Squared      AIC    
------------------------------------------------------------------
log.CPRATIO     0.00000        0.909             0.906    1024.868 
log.AIRFLOW     0.00000        0.865             0.861    1051.171 
INLET.TEMP        2e-05        0.820             0.814    1070.432 
RPM             0.03660        0.774             0.767    1085.549 
EXH.TEMP        0.14832        0.766             0.759    1087.954 
------------------------------------------------------------------

Step      => 2 
Selected  => log.CPRATIO 
Model     => HEATRATE ~ log.POWER + log.CPRATIO 
R2        => 0.909 

                     Selection Metrics Table                       
------------------------------------------------------------------
Predictor      Pr(>|t|)    R-Squared    Adj. R-Squared      AIC    
------------------------------------------------------------------
log.AIRFLOW     0.00236        0.921             0.918    1016.961 
RPM             0.10842        0.913             0.908    1024.106 
EXH.TEMP        0.27423        0.911             0.906    1025.587 
INLET.TEMP      0.45619        0.910             0.905    1026.273 
------------------------------------------------------------------

Step      => 3 
Selected  => log.AIRFLOW 
Model     => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW 
R2        => 0.921 

                    Selection Metrics Table                      
----------------------------------------------------------------
Predictor     Pr(>|t|)    R-Squared    Adj. R-Squared      AIC   
----------------------------------------------------------------
EXH.TEMP       0.00000        0.975             0.974    941.630 
RPM            0.00000        0.947             0.943    993.145 
INLET.TEMP     0.00000        0.944             0.941    995.852 
----------------------------------------------------------------

Step      => 4 
Selected  => EXH.TEMP 
Model     => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW + EXH.TEMP 
R2        => 0.975 

                    Selection Metrics Table                      
----------------------------------------------------------------
Predictor     Pr(>|t|)    R-Squared    Adj. R-Squared      AIC   
----------------------------------------------------------------
RPM            0.00000        0.983             0.982    916.887 
INLET.TEMP     0.76662        0.975             0.973    943.533 
----------------------------------------------------------------

Step      => 5 
Selected  => RPM 
Model     => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW + EXH.TEMP + RPM 
R2        => 0.983 

                    Selection Metrics Table                      
----------------------------------------------------------------
Predictor     Pr(>|t|)    R-Squared    Adj. R-Squared      AIC   
----------------------------------------------------------------
INLET.TEMP     0.27902        0.984             0.982    917.568 
----------------------------------------------------------------

Step      => 6 
Selected  => INLET.TEMP 
Model     => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW + EXH.TEMP + RPM + INLET.TEMP 
R2        => 0.984 


Variables Selected: 

=> log.POWER 
=> log.CPRATIO 
=> log.AIRFLOW 
=> EXH.TEMP 
=> RPM 
=> INLET.TEMP 

                               Stepwise Summary                               
----------------------------------------------------------------------------
Step    Variable         AIC         SBC        SBIC        R2       Adj. R2 
----------------------------------------------------------------------------
 0      Base Model     1181.327    1185.737    987.298    0.00000    0.00000 
 1      log.POWER      1088.159    1094.773    892.613    0.75839    0.75467 
 2      log.CPRATIO    1024.868    1033.687    828.649    0.90882    0.90597 
 3      log.AIRFLOW    1016.961    1027.984    819.495    0.92135    0.91761 
 4      EXH.TEMP        941.630     954.858    748.697    0.97520    0.97360 
 5      RPM             916.887     932.320    727.870    0.98336    0.98200 
 6      INLET.TEMP      917.568     935.205    729.036    0.98369    0.98206 
----------------------------------------------------------------------------

Final Model Output 
------------------

                           Model Summary                            
-------------------------------------------------------------------
R                         0.992       RMSE                 202.184 
R-Squared                 0.984       MSE                40878.488 
Adj. R-Squared            0.982       Coef. Var              1.931 
Pred R-Squared            0.975       AIC                  917.568 
MAE                     154.185       SBC                  935.205 
-------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                   ANOVA                                    
---------------------------------------------------------------------------
                     Sum of                                                
                    Squares        DF     Mean Square       F         Sig. 
---------------------------------------------------------------------------
Regression    165158349.755         6    27526391.626    603.019    0.0000 
Residual        2738858.692        60       45647.645                      
Total         167897208.448        66                                      
---------------------------------------------------------------------------

                                       Parameter Estimates                                         
--------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
--------------------------------------------------------------------------------------------------
(Intercept)    43116.228      1983.576                  21.737    0.000    39148.486    47083.970 
  log.POWER    -7433.155       524.714       -8.176    -14.166    0.000    -8482.740    -6383.570 
log.CPRATIO      157.821       253.354        0.027      0.623    0.536     -348.962      664.604 
log.AIRFLOW     7624.558       563.172        7.354     13.539    0.000     6498.047     8751.070 
   EXH.TEMP       13.259         1.599        0.367      8.291    0.000       10.060       16.457 
        RPM        0.067         0.012        0.295      5.575    0.000        0.043        0.091 
 INLET.TEMP        1.036         0.949        0.089      1.092    0.279       -0.861        2.934 
--------------------------------------------------------------------------------------------------
  • Variables selected through Forward Selection: log.power, log.cpratio, log.airflow, exh.temp, inlet.temp, rpm

Stepwise Regression

#Default: pent = 0.1, prem = 0.3
ols_step_both_p(gasmod1,p_ent=0.15,p_rem=0.15,details=T)
Stepwise Selection Method 
-------------------------

Candidate Terms: 

1. RPM 
2. INLET.TEMP 
3. EXH.TEMP 
4. log.CPRATIO 
5. log.AIRFLOW 
6. log.POWER 


Step   => 0 
Model  => HEATRATE ~ 1 
R2     => 0 

Initiating stepwise selection... 

Step      => 1 
Selected  => log.POWER 
Model     => HEATRATE ~ log.POWER 
R2        => 0.758 

Step      => 2 
Selected  => log.CPRATIO 
Model     => HEATRATE ~ log.POWER + log.CPRATIO 
R2        => 0.909 

Step      => 3 
Selected  => log.AIRFLOW 
Model     => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW 
R2        => 0.921 

Step      => 4 
Selected  => EXH.TEMP 
Model     => HEATRATE ~ log.POWER + log.CPRATIO + log.AIRFLOW + EXH.TEMP 
R2        => 0.975 

Step     => 5 
Removed  => log.CPRATIO 
Model    => HEATRATE ~ log.POWER + log.AIRFLOW + EXH.TEMP 
  => 0.97519 

Step      => 6 
Selected  => RPM 
Model     => HEATRATE ~ log.POWER + log.AIRFLOW + EXH.TEMP + RPM 
R2        => 0.983 

Step      => 7 
Selected  => INLET.TEMP 
Model     => HEATRATE ~ log.POWER + log.AIRFLOW + EXH.TEMP + RPM + INLET.TEMP 
R2        => 0.984 

                                 Stepwise Summary                                 
--------------------------------------------------------------------------------
Step    Variable             AIC         SBC        SBIC        R2       Adj. R2 
--------------------------------------------------------------------------------
 0      Base Model         1181.327    1185.737    987.298    0.00000    0.00000 
 1      log.POWER (+)      1088.159    1094.773    892.613    0.75839    0.75467 
 2      log.CPRATIO (+)    1024.868    1033.687    828.649    0.90882    0.90597 
 3      log.AIRFLOW (+)    1016.961    1027.984    819.495    0.92135    0.91761 
 4      EXH.TEMP (+)        941.630     954.858    748.697    0.97520    0.97360 
 5      log.CPRATIO (-)     939.661     950.684    747.255    0.97519    0.97401 
 6      RPM (+)             916.905     930.133    727.384    0.98285    0.98175 
 7      INLET.TEMP (+)      916.000     931.433    727.152    0.98358    0.98224 
--------------------------------------------------------------------------------

Final Model Output 
------------------

                           Model Summary                            
-------------------------------------------------------------------
R                         0.992       RMSE                 202.837 
R-Squared                 0.984       MSE                41142.862 
Adj. R-Squared            0.982       Coef. Var              1.921 
Pred R-Squared            0.975       AIC                  916.000 
MAE                     152.893       SBC                  931.433 
-------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                   ANOVA                                    
---------------------------------------------------------------------------
                     Sum of                                                
                    Squares        DF     Mean Square       F         Sig. 
---------------------------------------------------------------------------
Regression    165140636.668         5    33028127.334    730.877    0.0000 
Residual        2756571.779        61       45189.701                      
Total         167897208.448        66                                      
---------------------------------------------------------------------------

                                       Parameter Estimates                                         
--------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
--------------------------------------------------------------------------------------------------
(Intercept)    43134.882      1973.376                  21.858    0.000    39188.873    47080.890 
  log.POWER    -7347.619       503.882       -8.082    -14.582    0.000    -8355.194    -6340.044 
log.AIRFLOW     7529.364       539.316        7.262     13.961    0.000     6450.935     8607.794 
   EXH.TEMP       12.538         1.098        0.347     11.422    0.000       10.343       14.733 
        RPM        0.066         0.012        0.291      5.568    0.000        0.042        0.090 
 INLET.TEMP        1.337         0.813        0.115      1.644    0.105       -0.289        2.962 
--------------------------------------------------------------------------------------------------
  • Variables selected through Stepwise Regression: log.power, log.airflow, exh.temp, rpm, inlet.temp

  • In this instance the three procedures resulted in slightly different subsets of variables. This may not always happen.

P_ENT and P_REM

What happens when we change the level of entry/remain?

Let’s just look at the stepwise procedure: Run this setting p_entry and p_remain to thier defaults then set both to 0.001 And set details = F.

# stepwise regression
ols_step_both_p(gasmod1,p_ent=0.1,p_rem=0.3,details=F)

                                 Stepwise Summary                                 
--------------------------------------------------------------------------------
Step    Variable             AIC         SBC        SBIC        R2       Adj. R2 
--------------------------------------------------------------------------------
 0      Base Model         1181.327    1185.737    987.298    0.00000    0.00000 
 1      log.POWER (+)      1088.159    1094.773    892.613    0.75839    0.75467 
 2      log.CPRATIO (+)    1024.868    1033.687    828.649    0.90882    0.90597 
 3      log.AIRFLOW (+)    1016.961    1027.984    819.495    0.92135    0.91761 
 4      EXH.TEMP (+)        941.630     954.858    748.697    0.97520    0.97360 
 5      log.CPRATIO (-)     939.661     950.684    747.255    0.97519    0.97401 
 6      RPM (+)             916.905     930.133    727.384    0.98285    0.98175 
--------------------------------------------------------------------------------

Final Model Output 
------------------

                           Model Summary                            
-------------------------------------------------------------------
R                         0.991       RMSE                 207.283 
R-Squared                 0.983       MSE                42966.285 
Adj. R-Squared            0.982       Coef. Var              1.947 
Pred R-Squared            0.975       AIC                  916.905 
MAE                     157.004       SBC                  930.133 
-------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                   ANOVA                                    
---------------------------------------------------------------------------
                     Sum of                                                
                    Squares        DF     Mean Square       F         Sig. 
---------------------------------------------------------------------------
Regression    165018467.384         4    41254616.846    888.509    0.0000 
Residual        2878741.064        62       46431.307                      
Total         167897208.448        66                                      
---------------------------------------------------------------------------

                                       Parameter Estimates                                         
--------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
--------------------------------------------------------------------------------------------------
(Intercept)    40313.966       988.372                  40.788    0.000    38338.239    42289.693 
  log.POWER    -6604.773       226.163       -7.265    -29.204    0.000    -7056.867    -6152.679 
log.AIRFLOW     6737.029       245.470        6.498     27.445    0.000     6246.342     7227.715 
   EXH.TEMP       13.352         0.993        0.370     13.446    0.000       11.367       15.337 
        RPM        0.062         0.012        0.272      5.264    0.000        0.038        0.085 
--------------------------------------------------------------------------------------------------
ols_step_both_p(gasmod1,p_ent=0.001,p_rem=0.001,details=F)

                                 Stepwise Summary                                 
--------------------------------------------------------------------------------
Step    Variable             AIC         SBC        SBIC        R2       Adj. R2 
--------------------------------------------------------------------------------
 0      Base Model         1181.327    1185.737    987.298    0.00000    0.00000 
 1      log.POWER (+)      1088.159    1094.773    892.613    0.75839    0.75467 
 2      log.CPRATIO (+)    1024.868    1033.687    828.649    0.90882    0.90597 
--------------------------------------------------------------------------------

Final Model Output 
------------------

                           Model Summary                             
--------------------------------------------------------------------
R                         0.953       RMSE                  478.010 
R-Squared                 0.909       MSE                228493.597 
Adj. R-Squared            0.906       Coef. Var               4.420 
Pred R-Squared            0.897       AIC                  1024.868 
MAE                     330.596       SBC                  1033.687 
--------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                   ANOVA                                    
---------------------------------------------------------------------------
                     Sum of                                                
                    Squares        DF     Mean Square       F         Sig. 
---------------------------------------------------------------------------
Regression    152588137.418         2    76294068.709    318.949    0.0000 
Residual       15309071.030        64      239204.235                      
Total         167897208.448        66                                      
---------------------------------------------------------------------------

                                       Parameter Estimates                                         
--------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
--------------------------------------------------------------------------------------------------
(Intercept)    23961.951       581.450                  41.211    0.000    22800.371    25123.531 
  log.POWER     -543.491        41.970       -0.598    -12.950    0.000     -627.334     -459.647 
log.CPRATIO    -2737.915       266.450       -0.474    -10.276    0.000    -3270.210    -2205.620 
--------------------------------------------------------------------------------------------------

Summarize your results.

Type p_ent/p_rem=.15 p_ent=.1 andp_rem=.3 p_ent/p_rem=.0001
Selected Parameters log.power, log.airflow, exh.temp, rpm, inlet.temp log.power, log.airflow, exh.temp, rpm log.power, log.CPRatio

Now what?

Since our main goal from this particular EDA was to eliminate severe multicollinearity. We will first check if this new subset of predictors does that.

#updated model
gasmod2<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP+log.AIRFLOW+log.POWER,data=gasturbine)
gasmod2vif<-round(vif(gasmod2),3)
gasmod2vif
        RPM  INLET.TEMP    EXH.TEMP log.AIRFLOW   log.POWER 
     10.163      18.231       3.429    1005.242    1141.209 
mean(gasmod2vif)
[1] 435.6548

This did not resolve our problem of severe multicollinearity. We should remove log.power, log.airflow (or both).

#updated model: REMOVE LOG.POWER
gasmod3<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP+log.AIRFLOW,data=gasturbine)
gasmod3vif<-round(vif(gasmod3),3)
gasmod3vif
        RPM  INLET.TEMP    EXH.TEMP log.AIRFLOW 
     10.057       3.575       3.334      10.683 
mean(gasmod3vif)
[1] 6.91225
#updated model: REMOVE LOG.AIRFLOW
gasmod4<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP+log.POWER,data=gasturbine)
gasmod4vif<-round(vif(gasmod4),3)
gasmod4vif
       RPM INLET.TEMP   EXH.TEMP  log.POWER 
     9.777      3.676      3.271     12.128 
mean(gasmod4vif)
[1] 7.213
#updated model: REMOVE BOTH
gasmod5<-lm(HEATRATE~RPM+INLET.TEMP+EXH.TEMP,data=gasturbine)
gasmod5vif<-round(vif(gasmod5),3)
gasmod5vif
       RPM INLET.TEMP   EXH.TEMP 
     1.727      3.570      2.551 
mean(gasmod5vif)
[1] 2.616

Of the three models above- which seems to resolve our concern best?

It appears the model without power and airflow have resolved the issue of severe multicollinearity. HOWEVER, we may have also decided to remove a combination of RPM and power/airflow, as that also had a strong relationship. As a researcher, or with your client, you may discuss which of those three variables would be most important to analyze moving forward. Here, we will remove the transformations as we want the most least complex model.

All of what we’ve done so far - in this gas turbine example - is still just our exploratory data analysis (EDA) phase of narrowing down predictors. We will combine our graphical interpretations with the iterative process to select quantitative variables to begin the stages of model building. Then adding in other qualitative variables and interactions that seemed important. RECALL:

  • Stage 1: Add Quantitative Predictors, higher order terms, and the QUANT X QUANT interactions
  • Stage 2: Add Qualitative predictors and QUAL X QUAL interactions
  • Stage 3: Add QUANT X QUAL Interactions