During periods of high electricity demand, especially during the hot summer months, the power output from a gas turbine engine can drop dramatically. One way to counter this drop in power is by cooling the inlet air to the gas turbine. An increasingly popular cooling method uses high pressure inlet fogging. The performance of a sample of 67 gas turbines augmented with high pressure inlet fogging was investigated in the Journal of Engineering for Gas Turbines and Power (January 2005). One measure of performance is heat rate (kilojoules per kilowatt per hour). Heat rates for the 67 gas turbines, saved in the gasturbine file.

Unit 3.1: Multicollinearity and Variable Screening

Step 1: Collect the Data

Check the appropriateness of response variable for regression: View a histogram of response variable. It should be continuous, and approximately unimodal and symmetric, with few outliers.

gasturbine<-read.delim("https://raw.githubusercontent.com/kvaranyak4/STAT3220/main/GASTURBINE.txt")
head(gasturbine)
names(gasturbine)
[1] "ENGINE"     "SHAFTS"     "RPM"        "CPRATIO"    "INLET.TEMP"
[6] "EXH.TEMP"   "AIRFLOW"    "POWER"      "HEATRATE"  
hist(gasturbine$HEATRATE, xlab="Heat Rate", main="Histogram of Heat Rate") 

The distribution of the response variable, heat rate, is unimodal and skewed right. It is continuous, so it should still be suitable for regression.

Step 2: Hypothesize Relationship (Exploratory Data Analysis)

We will explore the relationship with quantitative variables with scatter plots and correlations and classify each relationship as linear, curvilinear, or none. We explore the box plots and means for each qualitative variable explanatory variable then classify the relationships as existent or not. We will not explore interactions in this example.

#Scatter plots for quantitative variables
for (i in names(gasturbine)[3:8]) {
  plot(gasturbine[,i], gasturbine$HEATRATE,xlab=i,ylab="Heat Rate")
}

#Correlations for quantitative variables
round(cor(gasturbine[3:8],gasturbine$HEATRATE,use="complete.obs"),3)
             [,1]
RPM         0.844
CPRATIO    -0.735
INLET.TEMP -0.801
EXH.TEMP   -0.314
AIRFLOW    -0.703
POWER      -0.697
#Summary Statistics for response variable grouped by each level of the response
tapply(gasturbine$HEATRATE,gasturbine$ENGINE,summary)
$Advanced
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9105    9295    9669    9764    9933   11588 

$Aeroderiv
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8714   10708   12414   12312   13697   16243 

$Traditional
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10086   10598   11183   11544   11956   14796 
tapply(gasturbine$HEATRATE,gasturbine$SHAFTS,summary)
$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   9105    9918   10592   10930   11674   14796 

$`2`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10951   11223   11654   12536   13232   16243 

$`3`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8714    8903    9092    9092    9280    9469 
#Box plots for Qualitative
boxplot(HEATRATE~ENGINE,gasturbine, ylab="Heat Rate")

boxplot(HEATRATE~SHAFTS,gasturbine, ylab="Heat Rate")

# Summary counts for qualitative variables
table(gasturbine$ENGINE,gasturbine$SHAFTS)
             
               1  2  3
  Advanced    21  0  0
  Aeroderiv    1  4  2
  Traditional 35  4  0
  • Most of the variables have high correlations. Exhaust temperature has the weakest correlation.
  • Power has a curvilinear or potentially a negative log relationship.
  • Engine type appears to have a different average heat rate for the advanced group.
  • Note: Shafts=3 only has two observations. Further, we would not be able to fit an interaction between Engine and Shaft because some level combinations do not have any observations.

Step 2 (continued): Multicollinearity

Do any of the explanatory variables have relationships with each other? We will look at pairwise correlations and VIF to evaluate multicollinearity in the quantitative explanatory variables.

#Regular correlation
# Use the correlation function with no second argument
# to find the correlations with the explanaotry varibles with eachother
gasturcor<-round(cor(gasturbine[,3:8]),4)
gasturcor
               RPM CPRATIO INLET.TEMP EXH.TEMP AIRFLOW   POWER
RPM         1.0000 -0.4903    -0.5536  -0.1715 -0.6876 -0.6169
CPRATIO    -0.4903  1.0000     0.6851   0.1139  0.3826  0.4473
INLET.TEMP -0.5536  0.6851     1.0000   0.7283  0.6808  0.7503
EXH.TEMP   -0.1715  0.1139     0.7283   1.0000  0.5665  0.6309
AIRFLOW    -0.6876  0.3826     0.6808   0.5665  1.0000  0.9776
POWER      -0.6169  0.4473     0.7503   0.6309  0.9776  1.0000
#Correlation Visualization
# The argument is the stored correlation values from the cor() function
corrplot(gasturcor)

# Scatter plot matrix
plot(gasturbine[3:8])

#A new correlation function
# this will also output the assocaited p-values with Ho: rho=0
rcorr(as.matrix(gasturbine[,3:8]))
             RPM CPRATIO INLET.TEMP EXH.TEMP AIRFLOW POWER
RPM         1.00   -0.49      -0.55    -0.17   -0.69 -0.62
CPRATIO    -0.49    1.00       0.69     0.11    0.38  0.45
INLET.TEMP -0.55    0.69       1.00     0.73    0.68  0.75
EXH.TEMP   -0.17    0.11       0.73     1.00    0.57  0.63
AIRFLOW    -0.69    0.38       0.68     0.57    1.00  0.98
POWER      -0.62    0.45       0.75     0.63    0.98  1.00

n= 67 


P
           RPM    CPRATIO INLET.TEMP EXH.TEMP AIRFLOW POWER 
RPM               0.0000  0.0000     0.1653   0.0000  0.0000
CPRATIO    0.0000         0.0000     0.3585   0.0014  0.0001
INLET.TEMP 0.0000 0.0000             0.0000   0.0000  0.0000
EXH.TEMP   0.1653 0.3585  0.0000              0.0000  0.0000
AIRFLOW    0.0000 0.0014  0.0000     0.0000           0.0000
POWER      0.0000 0.0001  0.0000     0.0000   0.0000        

There is concern of strong pairwise relationships.

  • Power & airflow (r=0.977)
  • Power & inlet_temp (r=0.75)
  • inlet_temp & exh_temp (r=0.73)
#Multicollinearity VIF
# First fit the model with all of the quantitative variables
gasmod1<-lm(HEATRATE~.-ENGINE-SHAFTS,data=gasturbine)
# Syntax Note: We can use the . to indicate all the variables in the data frame
# And use the - to exclude a particular variable from the model
summary(gasmod1)

Call:
lm(formula = HEATRATE ~ . - ENGINE - SHAFTS, data = gasturbine)

Residuals:
     Min       1Q   Median       3Q      Max 
-1003.32  -307.35   -91.44   271.18  1405.52 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.431e+04  1.112e+03  12.869  < 2e-16 ***
RPM          8.058e-02  1.611e-02   5.002 5.25e-06 ***
CPRATIO     -6.775e+00  3.038e+01  -0.223 0.824301    
INLET.TEMP  -9.507e+00  1.529e+00  -6.217 5.33e-08 ***
EXH.TEMP     1.415e+01  3.469e+00   4.081 0.000135 ***
AIRFLOW     -2.553e+00  1.746e+00  -1.462 0.148892    
POWER        4.257e-03  4.217e-03   1.009 0.316804    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 458.8 on 60 degrees of freedom
Multiple R-squared:  0.9248,    Adjusted R-squared:  0.9173 
F-statistic:   123 on 6 and 60 DF,  p-value: < 2.2e-16
#store and round the VIF values
gasmod1vif<-round(vif(gasmod1),3)
gasmod1vif
       RPM    CPRATIO INLET.TEMP   EXH.TEMP    AIRFLOW      POWER 
     4.015      5.213     13.852      7.351     49.136     49.765 
#find the average VIF
mean(gasmod1vif)
[1] 21.55533

Yes, there is evidence of severe multicollinearity because several VIFs are much greater than 10 (inlet temp, air flow, and power) and the average VIF is greater than 3 at 21.55.

Step 2 (continued): Variable Screening

Because we have quite a few variables and severe multicollinearity, we need to address that. It is not clear from EDA what variables should remain and which variables should be removed.

We will use variable selection procedures to narrow down our quantitative variables to a best set of predictors. We will use the entry and remain significance levels of 0.15

# Note: Details=T gives all of the steps within each process
# In most cases, you can set Details=F

# backwards elimination
#Default: prem = 0.3
ols_step_backward_p(gasmod1,p_rem=0.15,details=T)
Backward Elimination Method 
---------------------------

Candidate Terms: 

1. RPM 
2. CPRATIO 
3. INLET.TEMP 
4. EXH.TEMP 
5. AIRFLOW 
6. POWER 


Step   => 0 
Model  => HEATRATE ~ RPM + CPRATIO + INLET.TEMP + EXH.TEMP + AIRFLOW + POWER 
R2     => 0.925 

Initiating stepwise selection... 

Step     => 1 
Removed  => CPRATIO 
Model    => HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP + AIRFLOW + POWER 
R2       => 0.92473 

Step     => 2 
Removed  => POWER 
Model    => HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP + AIRFLOW 
R2       => 0.92351 


No more variables to be removed.

Variables Removed: 

=> CPRATIO 
=> POWER 

                             Stepwise Summary                             
------------------------------------------------------------------------
Step    Variable        AIC         SBC       SBIC      R2       Adj. R2 
------------------------------------------------------------------------
 0      Full Model    1019.966    1037.604      NA    0.92479    0.91727 
 1      CPRATIO       1018.022    1033.455      NA    0.92473    0.91856 
 2      POWER         1017.095    1030.323      NA    0.92351    0.91858 
------------------------------------------------------------------------

Final Model Output 
------------------

                           Model Summary                             
--------------------------------------------------------------------
R                         0.961       RMSE                  437.803 
R-Squared                 0.924       MSE                191671.124 
Adj. R-Squared            0.919       Coef. Var               4.113 
Pred R-Squared            0.908       AIC                  1017.095 
MAE                     340.078       SBC                  1030.323 
--------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                   ANOVA                                    
---------------------------------------------------------------------------
                     Sum of                                                
                    Squares        DF     Mean Square       F         Sig. 
---------------------------------------------------------------------------
Regression    155055243.172         4    38763810.793    187.149    0.0000 
Residual       12841965.276        62      207128.472                      
Total         167897208.448        66                                      
---------------------------------------------------------------------------

                                       Parameter Estimates                                         
--------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
--------------------------------------------------------------------------------------------------
(Intercept)    13617.924       813.306                  16.744    0.000    11992.148    15243.700 
        RPM        0.089         0.013        0.391      6.608    0.000        0.062        0.116 
 INLET.TEMP       -9.186         0.770       -0.791    -11.923    0.000      -10.726       -7.646 
   EXH.TEMP       14.363         2.260        0.397      6.356    0.000        9.846       18.880 
    AIRFLOW       -0.848         0.437       -0.120     -1.939    0.057       -1.721        0.026 
--------------------------------------------------------------------------------------------------
# forward selection
#default: penter = 0.3
ols_step_forward_p(gasmod1,p_enter=0.15,details=T)
Forward Selection Method 
------------------------

Candidate Terms: 

1. RPM 
2. CPRATIO 
3. INLET.TEMP 
4. EXH.TEMP 
5. AIRFLOW 
6. POWER 


Step   => 0 
Model  => HEATRATE ~ 1 
R2     => 0 

Initiating stepwise selection... 

                     Selection Metrics Table                      
-----------------------------------------------------------------
Predictor     Pr(>|t|)    R-Squared    Adj. R-Squared      AIC    
-----------------------------------------------------------------
RPM            0.00000        0.712             0.708    1099.849 
INLET.TEMP     0.00000        0.641             0.635    1114.713 
CPRATIO        0.00000        0.540             0.533    1131.347 
AIRFLOW        0.00000        0.494             0.487    1137.639 
POWER          0.00000        0.486             0.478    1138.758 
EXH.TEMP       0.00959        0.099             0.085    1176.358 
-----------------------------------------------------------------

Step      => 1 
Selected  => RPM 
Model     => HEATRATE ~ RPM 
R2        => 0.712 

                     Selection Metrics Table                      
-----------------------------------------------------------------
Predictor     Pr(>|t|)    R-Squared    Adj. R-Squared      AIC    
-----------------------------------------------------------------
INLET.TEMP     0.00000        0.873             0.869    1047.326 
CPRATIO        0.00000        0.848             0.843    1059.194 
POWER          0.00048        0.763             0.755    1088.995 
EXH.TEMP       0.00860        0.742             0.734    1094.565 
AIRFLOW        0.00991        0.741             0.733    1094.833 
-----------------------------------------------------------------

Step      => 2 
Selected  => INLET.TEMP 
Model     => HEATRATE ~ RPM + INLET.TEMP 
R2        => 0.873 

                    Selection Metrics Table                      
----------------------------------------------------------------
Predictor    Pr(>|t|)    R-Squared    Adj. R-Squared      AIC    
----------------------------------------------------------------
EXH.TEMP      0.00000        0.919             0.915    1019.041 
CPRATIO         6e-05        0.902             0.897    1032.019 
AIRFLOW       0.44970        0.874             0.868    1048.714 
POWER         0.46919        0.874             0.868    1048.764 
----------------------------------------------------------------

Step      => 3 
Selected  => EXH.TEMP 
Model     => HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP 
R2        => 0.919 

                    Selection Metrics Table                      
----------------------------------------------------------------
Predictor    Pr(>|t|)    R-Squared    Adj. R-Squared      AIC    
----------------------------------------------------------------
AIRFLOW       0.05701        0.924             0.919    1017.095 
POWER         0.11395        0.922             0.917    1018.319 
CPRATIO       0.88511        0.919             0.914    1021.018 
----------------------------------------------------------------

Step      => 4 
Selected  => AIRFLOW 
Model     => HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP + AIRFLOW 
R2        => 0.924 

                    Selection Metrics Table                      
----------------------------------------------------------------
Predictor    Pr(>|t|)    R-Squared    Adj. R-Squared      AIC    
----------------------------------------------------------------
POWER         0.32494        0.925             0.919    1018.022 
CPRATIO       0.99054        0.924             0.917    1019.095 
----------------------------------------------------------------


No more variables to be added.

Variables Selected: 

=> RPM 
=> INLET.TEMP 
=> EXH.TEMP 
=> AIRFLOW 

                             Stepwise Summary                             
------------------------------------------------------------------------
Step    Variable        AIC         SBC       SBIC      R2       Adj. R2 
------------------------------------------------------------------------
 0      Base Model    1181.327    1185.737      NA    0.00000    0.00000 
 1      RPM           1099.849    1106.463      NA    0.71233    0.70791 
 2      INLET.TEMP    1047.326    1056.145      NA    0.87251    0.86853 
 3      EXH.TEMP      1019.041    1030.064      NA    0.91887    0.91501 
 4      AIRFLOW       1017.095    1030.323      NA    0.92351    0.91858 
------------------------------------------------------------------------

Final Model Output 
------------------

                           Model Summary                             
--------------------------------------------------------------------
R                         0.961       RMSE                  437.803 
R-Squared                 0.924       MSE                191671.124 
Adj. R-Squared            0.919       Coef. Var               4.113 
Pred R-Squared            0.908       AIC                  1017.095 
MAE                     340.078       SBC                  1030.323 
--------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                   ANOVA                                    
---------------------------------------------------------------------------
                     Sum of                                                
                    Squares        DF     Mean Square       F         Sig. 
---------------------------------------------------------------------------
Regression    155055243.172         4    38763810.793    187.149    0.0000 
Residual       12841965.276        62      207128.472                      
Total         167897208.448        66                                      
---------------------------------------------------------------------------

                                       Parameter Estimates                                         
--------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
--------------------------------------------------------------------------------------------------
(Intercept)    13617.924       813.306                  16.744    0.000    11992.148    15243.700 
        RPM        0.089         0.013        0.391      6.608    0.000        0.062        0.116 
 INLET.TEMP       -9.186         0.770       -0.791    -11.923    0.000      -10.726       -7.646 
   EXH.TEMP       14.363         2.260        0.397      6.356    0.000        9.846       18.880 
    AIRFLOW       -0.848         0.437       -0.120     -1.939    0.057       -1.721        0.026 
--------------------------------------------------------------------------------------------------
# stepwise regression
#Default: pent = 0.1, prem = 0.3
ols_step_both_p(gasmod1,p_ent=0.15,p_rem=0.15,details=T)
Stepwise Selection Method 
-------------------------

Candidate Terms: 

1. RPM 
2. CPRATIO 
3. INLET.TEMP 
4. EXH.TEMP 
5. AIRFLOW 
6. POWER 


Step   => 0 
Model  => HEATRATE ~ 1 
R2     => 0 

Initiating stepwise selection... 

Step      => 1 
Selected  => RPM 
Model     => HEATRATE ~ RPM 
R2        => 0.712 

Step      => 2 
Selected  => INLET.TEMP 
Model     => HEATRATE ~ RPM + INLET.TEMP 
R2        => 0.873 

Step      => 3 
Selected  => EXH.TEMP 
Model     => HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP 
R2        => 0.919 

Step      => 4 
Selected  => AIRFLOW 
Model     => HEATRATE ~ RPM + INLET.TEMP + EXH.TEMP + AIRFLOW 
R2        => 0.924 


No more variables to be added or removed.

                               Stepwise Summary                               
----------------------------------------------------------------------------
Step    Variable            AIC         SBC       SBIC      R2       Adj. R2 
----------------------------------------------------------------------------
 0      Base Model        1181.327    1185.737      NA    0.00000    0.00000 
 1      RPM (+)           1099.849    1106.463      NA    0.71233    0.70791 
 2      INLET.TEMP (+)    1047.326    1056.145      NA    0.87251    0.86853 
 3      EXH.TEMP (+)      1019.041    1030.064      NA    0.91887    0.91501 
 4      AIRFLOW (+)       1017.095    1030.323      NA    0.92351    0.91858 
----------------------------------------------------------------------------

Final Model Output 
------------------

                           Model Summary                             
--------------------------------------------------------------------
R                         0.961       RMSE                  437.803 
R-Squared                 0.924       MSE                191671.124 
Adj. R-Squared            0.919       Coef. Var               4.113 
Pred R-Squared            0.908       AIC                  1017.095 
MAE                     340.078       SBC                  1030.323 
--------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                   ANOVA                                    
---------------------------------------------------------------------------
                     Sum of                                                
                    Squares        DF     Mean Square       F         Sig. 
---------------------------------------------------------------------------
Regression    155055243.172         4    38763810.793    187.149    0.0000 
Residual       12841965.276        62      207128.472                      
Total         167897208.448        66                                      
---------------------------------------------------------------------------

                                       Parameter Estimates                                         
--------------------------------------------------------------------------------------------------
      model         Beta    Std. Error    Std. Beta       t        Sig         lower        upper 
--------------------------------------------------------------------------------------------------
(Intercept)    13617.924       813.306                  16.744    0.000    11992.148    15243.700 
        RPM        0.089         0.013        0.391      6.608    0.000        0.062        0.116 
 INLET.TEMP       -9.186         0.770       -0.791    -11.923    0.000      -10.726       -7.646 
   EXH.TEMP       14.363         2.260        0.397      6.356    0.000        9.846       18.880 
    AIRFLOW       -0.848         0.437       -0.120     -1.939    0.057       -1.721        0.026 
--------------------------------------------------------------------------------------------------
  • Backwards elimination: rpm, inlet.temp, exh.temp, airflow
  • Forward Selection: rpm, inlet.temp, exh.temp, airflow
  • Stepwise Regression: rpm, inlet.temp, exh.temp, airflow
  • In this instance all three procedures resulted in the same subset of variables. This may not always happen.
  • We will use this set of predictors to fit a model and determine if our problem of severe multicollinearity has been resolved.
#updated model
gasmod2<-lm(HEATRATE~.-ENGINE-SHAFTS-POWER-CPRATIO,data=gasturbine)
# Syntax Note: We can use the . to indicate all the variables in the data frame
# And use the - to exclude a particular variable from the model
summary(gasmod2)

Call:
lm(formula = HEATRATE ~ . - ENGINE - SHAFTS - POWER - CPRATIO, 
    data = gasturbine)

Residuals:
    Min      1Q  Median      3Q     Max 
-1007.7  -290.5  -106.0   240.1  1414.8 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.362e+04  8.133e+02  16.744  < 2e-16 ***
RPM          8.882e-02  1.344e-02   6.608 1.02e-08 ***
INLET.TEMP  -9.186e+00  7.704e-01 -11.923  < 2e-16 ***
EXH.TEMP     1.436e+01  2.260e+00   6.356 2.76e-08 ***
AIRFLOW     -8.475e-01  4.370e-01  -1.939    0.057 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 455.1 on 62 degrees of freedom
Multiple R-squared:  0.9235,    Adjusted R-squared:  0.9186 
F-statistic: 187.1 on 4 and 62 DF,  p-value: < 2.2e-16
gasmod2vif<-round(vif(gasmod2),3)
gasmod2vif
       RPM INLET.TEMP   EXH.TEMP    AIRFLOW 
     2.840      3.572      3.170      3.128 
mean(gasmod2vif)
[1] 3.1775

The average VIF is slightly greater than 3, and there are not individual VIFs greater than 10. We conclude this is no longer a severe issue and we can continue with our analysis. Moving forward, we might consider adding the remaining qualitative variables, interactions, and higher order (or variable transformations). We then would go forward and assess the model.

Unit 3.2: Residuals and Assumptions

Step 3: Estimate the Model Parameters

Reminder we looked at multicollinearity and variable screening to end up with a starting subset of predictors of: RPM, INLET.TEMP,EXH.TEMP, and AIRFLOW.

#updated model
gasmod2<-lm(HEATRATE~.-ENGINE-SHAFTS-POWER-CPRATIO,data=gasturbine)
# Syntax Note: We can use the . to indicate all the variables in the data frame
# And use the - to exclude a particular variable from the model
summary(gasmod2)

Call:
lm(formula = HEATRATE ~ . - ENGINE - SHAFTS - POWER - CPRATIO, 
    data = gasturbine)

Residuals:
    Min      1Q  Median      3Q     Max 
-1007.7  -290.5  -106.0   240.1  1414.8 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.362e+04  8.133e+02  16.744  < 2e-16 ***
RPM          8.882e-02  1.344e-02   6.608 1.02e-08 ***
INLET.TEMP  -9.186e+00  7.704e-01 -11.923  < 2e-16 ***
EXH.TEMP     1.436e+01  2.260e+00   6.356 2.76e-08 ***
AIRFLOW     -8.475e-01  4.370e-01  -1.939    0.057 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 455.1 on 62 degrees of freedom
Multiple R-squared:  0.9235,    Adjusted R-squared:  0.9186 
F-statistic: 187.1 on 4 and 62 DF,  p-value: < 2.2e-16

Try adding qualitative variable Engine

#updated model
gasmod3<-lm(HEATRATE~.-SHAFTS-POWER-CPRATIO,data=gasturbine)
# Syntax Note: We can use the . to indicate all the variables in the data frame
# And use the - to exclude a particular variable from the model
summary(gasmod3)

Call:
lm(formula = HEATRATE ~ . - SHAFTS - POWER - CPRATIO, data = gasturbine)

Residuals:
     Min       1Q   Median       3Q      Max 
-1002.38  -257.50   -60.08   252.72  1397.12 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.537e+04  1.284e+03  11.968  < 2e-16 ***
ENGINEAeroderiv   -4.245e+02  2.806e+02  -1.513   0.1356    
ENGINETraditional -3.772e+02  2.183e+02  -1.728   0.0891 .  
RPM                9.065e-02  1.459e-02   6.214 5.38e-08 ***
INLET.TEMP        -9.908e+00  9.161e-01 -10.816 1.01e-15 ***
EXH.TEMP           1.314e+01  2.377e+00   5.530 7.36e-07 ***
AIRFLOW           -8.534e-01  4.338e-01  -1.967   0.0538 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 450.8 on 60 degrees of freedom
Multiple R-squared:  0.9274,    Adjusted R-squared:  0.9201 
F-statistic: 127.7 on 6 and 60 DF,  p-value: < 2.2e-16

Perform nested F test to determine if Engine type is significant.

#reduced model
redgasmod3<-lm(HEATRATE~.-ENGINE-SHAFTS-POWER-CPRATIO,data=gasturbine)
#anova(reduced,complete)
anova(redgasmod3,gasmod3)
  • Hypotheses:
    • \(H_0: \beta_1=\beta_2=0\) (the engine type does not contribute to predicting heat rate)
    • \(H_a:\beta_1, \beta_2 \neq 0\) (the engine type contributes to predicting heat rate)
  • Distribution of test statistic: F 2 with 60 DF
  • Test Statistic: F=1.603
  • Pvalue: 0.2098
  • Decision: 0.2098>0.05 -> FAIL TO REJECT H0
  • Conclusion: The engine type is not significant at predicting heat rate. We will remove both dummy variables in the model and not test them individually.

For the sake of this example, we will not consider any further testing.

Step 4: Specify the distribution of the errors and find the estimate of the variance

We have covered this individually and will come back to it in a complete example at the end of the unit.

Step 5: Evaluate the Utility of the model

We have covered this individually and will come back to it in a complete example at the end of the unit.

Step 6: Check the Model Assumptions

First we will evaluate model assumptions

#There are a few options to view plots for assumptions

#Residuals Plots of explanatory variables vs residuals
residualPlots(redgasmod3,tests=F)

#Residual vs Fitted and QQ plot
plot(redgasmod3, which=c(1,2))

#histogram of residuals
hist(residuals(redgasmod3))

  • Lack of Fit: Residual Plots
  • Constant Variance: Residual Plots and Residual vs Fitted
  • Normality: QQ Plot and Histogram of Residuals
  • Independence: We do not have time series data. For the sake of this calss we will not explore this assumption other than stating we do not believe the observations are dependent on each other.

Unit 3.3 Influential Diagnostics

Then we will identify outliers and influential observations.

#Cooks Distance Thresholds
plot(redgasmod3,which=4)

#Leverage vs Studentized Residuals
influencePlot(redgasmod3,fill=F)
#Deleted Studentized Residuals vs PRedicted values
ols_plot_resid_stud_fit(redgasmod3)

#We can use various functions to store and view these statistics for all observations
rstudent(redgasmod3)
            1             2             3             4             5 
 0.8747974427  1.5560964778 -0.7996856609 -1.7098460010 -0.5940324181 
            6             7             8             9            10 
-0.3412340709 -0.2931152339  0.4972900835 -0.2354706794 -0.3311493834 
           11            12            13            14            15 
 3.4522455825  1.9210862780 -0.5410136060 -0.1288260698  0.8352669389 
           16            17            18            19            20 
-0.6395357114 -0.7796092939 -0.8579938744 -0.4185245333  0.3666865968 
           21            22            23            24            25 
 1.1206603654  0.3672529549 -0.4375840299 -0.6539765902 -0.1689423085 
           26            27            28            29            30 
-0.1441743577 -0.5847991132  0.2047194030  0.0464239138 -0.9006937286 
           31            32            33            34            35 
-0.9227536301 -2.3783545874  0.3188332240 -0.1362454337 -1.2230894198 
           36            37            38            39            40 
 2.5890770233  0.2898046600  0.3710733462 -1.6844836671 -0.8878650471 
           41            42            43            44            45 
-0.1370180599 -0.3583036352 -0.6634539981 -0.2406050715  1.9256194391 
           46            47            48            49            50 
-0.0006913961  2.5280837055  0.8806886673 -0.4450685015  0.8795627566 
           51            52            53            54            55 
 0.0180834913  1.2394007710  1.1312075637 -0.4108297552  0.5718074367 
           56            57            58            59            60 
-0.6844366633 -0.2384435371 -0.7815297552  0.5834226891 -0.7829079484 
           61            62            63            64            65 
 1.2699005058 -1.1968672345 -0.8411249929  0.7749438168 -0.6789274575 
           66            67 
 0.4408135249 -0.6462413281 
hatvalues(redgasmod3)
         1          2          3          4          5          6          7 
0.18516671 0.09055176 0.05058714 0.04222701 0.04740169 0.02442905 0.02753493 
         8          9         10         11         12         13         14 
0.03680293 0.03666050 0.06343149 0.04641455 0.05362948 0.08048917 0.04411916 
        15         16         17         18         19         20         21 
0.09715741 0.06230112 0.05045614 0.05908423 0.04799477 0.04943415 0.09104218 
        22         23         24         25         26         27         28 
0.02774620 0.03464087 0.04124602 0.02928731 0.06436712 0.06151213 0.11866326 
        29         30         31         32         33         34         35 
0.10176951 0.03312873 0.03305905 0.06825286 0.06162853 0.04263042 0.04149892 
        36         37         38         39         40         41         42 
0.15868147 0.05313439 0.09215184 0.04256770 0.06449028 0.08767073 0.05038647 
        43         44         45         46         47         48         49 
0.09087757 0.07303413 0.15265733 0.09438350 0.08759824 0.07018133 0.05939823 
        50         51         52         53         54         55         56 
0.07612455 0.07252967 0.13584379 0.05570074 0.08920490 0.03044303 0.03783830 
        57         58         59         60         61         62         63 
0.04271859 0.05120118 0.07468712 0.06875776 0.28843679 0.19696150 0.05017137 
        64         65         66         67 
0.27321053 0.11685791 0.07469814 0.04105443 
cooks.distance(redgasmod3)
           1            2            3            4            5            6 
3.491295e-02 4.713870e-02 6.854665e-03 2.500359e-02 3.548878e-03 5.915837e-04 
           7            8            9           10           11           12 
4.938184e-04 1.913029e-03 4.285386e-04 1.507041e-03 9.864721e-02 4.008823e-02 
          13           14           15           16           17           18 
5.183343e-03 1.556699e-04 1.508924e-02 5.487211e-03 6.500385e-03 9.284774e-03 
          19           20           21           22           23           24 
1.789958e-03 1.418308e-03 2.505464e-02 7.807054e-04 1.392366e-03 3.714128e-03 
          25           26           27           28           29           30 
1.749663e-04 2.905886e-04 4.531159e-03 1.146267e-03 4.963521e-05 5.576280e-03 
          31           32           33           34           35           36 
5.836246e-03 7.708231e-02 1.354886e-03 1.679746e-04 1.285081e-02 2.315621e-01 
          37           38           39           40           41           42 
9.567366e-04 2.834803e-03 2.450483e-02 1.090574e-02 3.666206e-04 1.381810e-03 
          43           44           45           46           47           48 
8.880251e-03 9.262997e-04 1.280157e-01 1.012739e-08 1.129043e-01 1.175094e-02 
          49           50           51           52           53           54 
2.534576e-03 1.279568e-02 5.198401e-06 4.788085e-02 1.502836e-02 3.351071e-03 
          55           56           57           58           59           60 
2.075794e-03 3.716375e-03 5.152710e-04 6.633792e-03 5.553912e-03 9.108151e-03 
          61           62           63           64           65           66 
1.294602e-01 6.978273e-02 7.509579e-03 4.544294e-02 1.230543e-02 3.178680e-03 
          67 
3.609809e-03 
dffits(redgasmod3)
            1             2             3             4             5 
 0.4170177822  0.4910163704 -0.1845915466 -0.3590218226 -0.1325110636 
            6             7             8             9            10 
-0.0539978261 -0.0493223042  0.0972060771 -0.0459352749 -0.0861800315 
           11            12            13            14            15 
 0.7616385631  0.4573177718 -0.1600657725 -0.0276767751  0.2740042045 
           16            17            18            19            20 
-0.1648469880 -0.1797117702 -0.2150030447 -0.0939719424  0.0836214074 
           21            22            23            24            25 
 0.3546691948  0.0620407828 -0.0828917707 -0.1356437040 -0.0293449126 
           26            27            28            29            30 
-0.0378152804 -0.1497175656  0.0751183771  0.0156263192 -0.1667228743 
           31            32            33            34            35 
-0.1706204055 -0.6437066706  0.0817084707 -0.0287502728 -0.2544956112 
           36            37            38            39            40 
 1.1244181724  0.0686513425  0.1182238653 -0.3551835576 -0.2331148430 
           41            42            43            44            45 
-0.0424746021 -0.0825343464 -0.2097625635 -0.0675360767  0.8173343575 
           46            47            48            49            50 
-0.0002232044  0.7833323891  0.2419547113 -0.1118434960  0.2524773824 
           51            52            53            54            55 
 0.0050569669  0.4914004305  0.2747374976 -0.1285718521  0.1013227680 
           56            57            58            59            60 
-0.1357297558 -0.0503702616 -0.1815509201  0.1657531823 -0.2127356355 
           61            62            63            64            65 
 0.8085153666 -0.5927457707 -0.1933151549  0.4751323690 -0.2469659817 
           66            67 
 0.1252472215 -0.1337142764 
dfbetas(redgasmod3)
     (Intercept)           RPM    INLET.TEMP      EXH.TEMP       AIRFLOW
1  -2.279223e-01  1.957424e-01 -1.714323e-02  1.546567e-01 -1.703446e-02
2   4.051086e-01  1.641254e-01  1.763526e-02 -3.046695e-01  1.553495e-01
3   5.328342e-02 -9.167441e-02 -7.190908e-02  1.855818e-02  3.730057e-02
4  -2.298860e-01  1.281016e-02  2.404519e-02  1.222945e-01  3.582115e-02
5   7.090661e-02  1.586245e-02  8.533875e-03 -6.592543e-02  8.135270e-02
6  -1.074914e-02  1.733130e-02 -1.113851e-02  1.021559e-02  2.050504e-02
7  -1.729453e-02  4.882113e-03 -1.901343e-02  2.297319e-02  8.291715e-03
8   6.892439e-02 -1.309476e-02 -2.380980e-03 -4.298349e-02  2.920960e-02
9  -2.384423e-02  1.582593e-02  2.055320e-02 -9.380213e-04 -1.393464e-02
10 -5.215280e-02 -1.201412e-02  1.712361e-02  2.701556e-02 -6.443648e-02
11 -6.310142e-03  2.396598e-01 -1.793847e-01  1.394825e-01 -4.139894e-02
12 -1.263191e-02 -2.386408e-01 -2.793844e-01  2.709710e-01 -2.384336e-01
13  1.075038e-01 -3.471785e-02 -5.779328e-02 -3.575671e-02  7.810913e-02
14 -1.403894e-02  7.576035e-03 -2.217761e-03  8.828123e-03  9.381602e-03
15  1.371472e-01 -1.717714e-01 -2.002165e-01  8.220513e-02 -4.553760e-02
16  3.302202e-02  1.321499e-01  8.525896e-02 -1.111006e-01  1.075346e-01
17 -3.523563e-02  1.207868e-01  1.319771e-01 -9.251472e-02  2.256074e-02
18 -6.259175e-02  9.740242e-02  1.603259e-01 -8.742298e-02 -4.892268e-02
19 -2.612367e-02  6.633212e-03  3.814709e-02 -9.256823e-03 -5.423808e-02
20 -1.137406e-02  3.130572e-02  5.304719e-02 -3.056345e-02 -2.533875e-02
21 -5.406246e-02 -2.970979e-01 -1.106295e-01  1.774639e-01 -2.820510e-01
22  1.580971e-02 -3.234500e-02 -2.955686e-03 -1.044288e-03 -2.631145e-02
23 -4.026990e-02  2.925503e-02  4.101491e-02 -6.763031e-03 -2.338916e-02
24 -2.088700e-02  7.622221e-02  9.300848e-02 -6.593646e-02 -3.339925e-03
25 -1.557716e-02  4.125643e-03  4.827110e-03  6.531960e-03 -1.175689e-02
26 -2.668523e-02  1.525396e-03  1.404215e-02  8.497781e-03 -2.354488e-02
27 -7.031641e-02  1.483705e-02  7.699529e-02 -7.210947e-03 -9.022251e-02
28  1.826501e-02  2.663362e-02 -3.881407e-02  1.354813e-02  2.231275e-02
29  1.310619e-02  5.340019e-03 -1.303871e-03 -8.565375e-03  5.997017e-03
30 -1.255556e-02 -1.628429e-02  3.662579e-02 -2.602052e-02  3.988996e-02
31 -1.533375e-02 -5.853538e-02 -3.706078e-02  3.538696e-02  3.146085e-02
32 -1.951174e-01 -3.749007e-01 -4.055211e-01  4.590853e-01 -5.068469e-02
33  5.296540e-02 -4.558618e-04  2.263107e-02 -4.981505e-02 -8.744398e-03
34 -1.166616e-02  4.563944e-03 -7.686496e-03  1.145432e-02  1.054809e-02
35 -8.877626e-02 -5.853790e-02 -1.262739e-01  1.486379e-01  4.230838e-02
36  2.151594e-01 -8.028265e-01 -9.902407e-01  7.095892e-01 -2.819409e-01
37  3.479171e-02 -3.456257e-02 -4.723973e-02  1.600133e-02  9.019337e-03
38  9.345339e-02 -1.616277e-02 -5.802030e-02 -2.118574e-02  6.274143e-02
39 -2.023247e-01  7.164835e-03 -4.998667e-02  1.572170e-01  6.491780e-02
40  1.015845e-01  5.409207e-02 -1.260661e-01  1.891516e-03  1.601236e-01
41  2.967446e-02  2.513911e-02  4.987646e-03 -2.949294e-02  3.128177e-02
42  5.060180e-02  2.436244e-02  9.652842e-03 -4.654654e-02  1.209514e-02
43  1.607071e-01  7.481794e-02  2.995105e-02 -1.469370e-01  6.740431e-02
44  1.218980e-02 -9.061398e-03  9.873985e-03 -1.290838e-02 -3.699955e-02
45  8.082375e-02  4.844164e-01  5.031571e-01 -5.088269e-01  5.267458e-01
46  9.039503e-05 -5.697722e-05 -3.956386e-05 -2.167776e-05 -9.525738e-05
47 -4.395842e-01 -3.259929e-01  2.277366e-01  2.277862e-01 -6.650165e-01
48 -1.356079e-01 -9.108045e-02  7.774062e-02  6.348874e-02 -1.952836e-01
49  7.499536e-02  3.328899e-02  1.209338e-02 -6.687915e-02  2.125665e-02
50 -9.344513e-02  8.676330e-02  1.375736e-01 -4.954967e-02  7.664740e-02
51 -2.482939e-04  1.638301e-03  5.911439e-04 -6.125042e-04  3.357212e-03
52 -1.845437e-04  2.724614e-01  2.243131e-01 -2.147092e-01  3.408524e-01
53 -4.474193e-02 -1.364584e-02  1.972718e-01 -9.954898e-02 -1.145640e-01
54  9.333002e-02  6.713827e-02 -4.523703e-03 -7.641058e-02  9.796458e-02
55  2.831006e-03  1.272207e-02  4.103671e-02 -3.308989e-02  3.018857e-02
56  5.606741e-02 -2.217269e-03 -5.264625e-02 -1.607383e-03 -4.903711e-03
57  2.827113e-02  5.395006e-03 -1.008651e-02 -1.357048e-02  2.960642e-03
58  4.125379e-02  4.983561e-02  8.206200e-02 -9.466783e-02 -3.937500e-02
59 -6.299498e-02  5.468502e-02  9.553217e-02 -3.448811e-02  4.356264e-02
60  3.272354e-03 -5.313130e-02  5.875525e-03  6.155496e-03 -1.425210e-01
61  1.622022e-01  6.287909e-01 -7.737918e-02 -1.387786e-01  4.375216e-01
62  9.743448e-02 -3.875700e-01  8.006443e-02 -8.228089e-02 -1.681696e-01
63  2.524545e-02 -8.294340e-02  1.213881e-02 -2.426282e-02  1.233960e-02
64  1.699380e-01  1.314209e-01  3.868923e-01 -4.113384e-01  1.441854e-02
65 -1.208286e-01  1.624592e-02 -1.325716e-01  1.707134e-01  4.451658e-02
66 -7.411863e-02  3.233025e-02  5.996087e-02  9.769816e-03 -5.992088e-02
67  3.251465e-02  1.728521e-02  4.640075e-02 -6.703951e-02  6.052772e-02
#We can subset by the observations of interest
dffits(redgasmod3)[c(11,36,61,64)]
       11        36        61        64 
0.7616386 1.1244182 0.8085154 0.4751324 
dfbetas(redgasmod3)[c(11,36,61,64),]
    (Intercept)        RPM  INLET.TEMP   EXH.TEMP     AIRFLOW
11 -0.006310142  0.2396598 -0.17938469  0.1394825 -0.04139894
36  0.215159421 -0.8028265 -0.99024074  0.7095892 -0.28194088
61  0.162202163  0.6287909 -0.07737918 -0.1387786  0.43752165
64  0.169938038  0.1314209  0.38689228 -0.4113384  0.01441854
summary(dfbetas(redgasmod3))
  (Intercept)              RPM              INLET.TEMP        
 Min.   :-0.4395842   Min.   :-0.802826   Min.   :-0.9902407  
 1st Qu.:-0.0377528   1st Qu.:-0.016223   1st Qu.:-0.0379374  
 Median :-0.0001845   Median : 0.007165   Median : 0.0085339  
 Mean   : 0.0012984   Mean   : 0.001273   Mean   :-0.0004318  
 3rd Qu.: 0.0531244   3rd Qu.: 0.041562   3rd Qu.: 0.0497240  
 Max.   : 0.4051086   Max.   : 0.628791   Max.   : 0.5031571  
    EXH.TEMP             AIRFLOW         
 Min.   :-0.5088269   Min.   :-0.665017  
 1st Qu.:-0.0578702   1st Qu.:-0.038187  
 Median :-0.0072109   Median : 0.009019  
 Mean   :-0.0006093   Mean   : 0.001198  
 3rd Qu.: 0.0172798   3rd Qu.: 0.041099  
 Max.   : 0.7095892   Max.   : 0.526746  
  • Cook’s Distance: Identifies influential observations. Threshold is 4/n
  • Leverage (Hat): Identifies outliers in the x direction (extreme vaues of explanatory variables). Threshold is 2(k+1)/n
  • Studentizied Residuals: Identifies outliers in the y direction. Threshold is +/- 2
  • Circles on “influence plot” are indicative of the cook’s distance value
  • Deleted Studentizied Residuals (R Student): Identifies general outliers and influential observations. Threshold is +/- 2

Next Steps to resolve

Variable transformation: We can directly use a function of the response variable within lm() or add a new variable to the table

# directly transform in the lm() function

gasmodsqrt<-lm(sqrt(HEATRATE)~.-ENGINE-SHAFTS-POWER-CPRATIO,data=gasturbine)
summary(gasmodsqrt)

Call:
lm(formula = sqrt(HEATRATE) ~ . - ENGINE - SHAFTS - POWER - CPRATIO, 
    data = gasturbine)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3749 -1.3484 -0.3872  1.1073  6.1302 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.172e+02  3.616e+00  32.409  < 2e-16 ***
RPM          3.631e-04  5.976e-05   6.076 8.28e-08 ***
INLET.TEMP  -4.344e-02  3.425e-03 -12.683  < 2e-16 ***
EXH.TEMP     6.904e-02  1.005e-02   6.872 3.58e-09 ***
AIRFLOW     -5.240e-03  1.943e-03  -2.697    0.009 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.023 on 62 degrees of freedom
Multiple R-squared:  0.9289,    Adjusted R-squared:  0.9243 
F-statistic: 202.6 on 4 and 62 DF,  p-value: < 2.2e-16
# OR Create new variable
gasturbine$sqrty<-sqrt(gasturbine$HEATRATE)

#remember to remove original response heatrate
gasmodsqrt2<-lm(sqrty~.-ENGINE-SHAFTS-POWER-CPRATIO-HEATRATE,data=gasturbine)
summary(gasmodsqrt2)

Call:
lm(formula = sqrty ~ . - ENGINE - SHAFTS - POWER - CPRATIO - 
    HEATRATE, data = gasturbine)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3749 -1.3484 -0.3872  1.1073  6.1302 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.172e+02  3.616e+00  32.409  < 2e-16 ***
RPM          3.631e-04  5.976e-05   6.076 8.28e-08 ***
INLET.TEMP  -4.344e-02  3.425e-03 -12.683  < 2e-16 ***
EXH.TEMP     6.904e-02  1.005e-02   6.872 3.58e-09 ***
AIRFLOW     -5.240e-03  1.943e-03  -2.697    0.009 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.023 on 62 degrees of freedom
Multiple R-squared:  0.9289,    Adjusted R-squared:  0.9243 
F-statistic: 202.6 on 4 and 62 DF,  p-value: < 2.2e-16

Removing observations from analysis by subsetting the data

#We can remove particular observations
subsetgas<-gasturbine[-c(11,36,61,64),]
gasmod2obs<-lm(HEATRATE~.-ENGINE-SHAFTS-POWER-CPRATIO-sqrty,data=subsetgas)
# Syntax Note: We can use the . to indicate all the variables in the data frame
# And use the - to exclude a particular variable from the model
summary(gasmod2obs)

Call:
lm(formula = HEATRATE ~ . - ENGINE - SHAFTS - POWER - CPRATIO - 
    sqrty, data = subsetgas)

Residuals:
    Min      1Q  Median      3Q     Max 
-957.18 -214.19  -93.73  233.18 1030.40 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.325e+04  7.218e+02  18.361  < 2e-16 ***
RPM          8.526e-02  1.370e-02   6.226 5.77e-08 ***
INLET.TEMP  -8.347e+00  8.048e-01 -10.372 7.88e-15 ***
EXH.TEMP     1.320e+01  2.316e+00   5.698 4.26e-07 ***
AIRFLOW     -9.429e-01  4.004e-01  -2.355   0.0219 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 391.6 on 58 degrees of freedom
Multiple R-squared:  0.9231,    Adjusted R-squared:  0.9178 
F-statistic:   174 on 4 and 58 DF,  p-value: < 2.2e-16