6 MLR Model Evaluation

I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 6 in STAT 501 is MLR Model Evaluation. The SAS version of the code below is posted on my blog.

Overview

For the MLR model, there are several hypotheses tests: a t-test for partial F-test for $H_0: \beta_k=0$; an overall F-test for $H_0: \beta_1=...=\beta_{p-1}=0$; and a partial F-test for any subset of parameters.

This lesson also discusses the partial $R^2$, and the MLR version of the lack of fit test.

R Concepts Worth Noting

Rename a column by name instead of by index. Use names with a filter: names(coolhearts)[names(coolhearts)=="Inf."] <- "Infarc".
Create multiple series in ggplot with the color= aesthetic. Note that the value must be a factor variable. That raises another question - how do you declare a factor variable??? factor(coolhearts$Group, labels = c("Early", "Late", "Control")).

Data Management for this Module

library(readr)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(alr3)

## Warning: package 'alr3' was built under R version 3.4.3

## Loading required package: car

## Warning: package 'car' was built under R version 3.4.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(reshape2)

## Warning: package 'reshape2' was built under R version 3.4.3

# coolhearts is a tab-delimited data file.
# size of infarcted area and region at risk of n=32 rabbits in k=3 temperature groups.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/coolhearts.txt'
coolhearts <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
# get rid of "." from column "Inf.", then rename to "Infarc" because "Inf" is reserved.
names(coolhearts)[names(coolhearts)=="Inf."] <- "Infarc"
coolhearts$Group <- factor(coolhearts$Group, labels = c("Early", "Late", "Control"))
# head(coolhearts)
# summarise(coolhearts, n=n())

# alcoholarm is a tab-delimited data file.
# Total lifetime consumption of alcohol and deltoid muscle strength of nondominant arm for n=50 alcoholic men.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/alcoholarm.txt'
alcoholarm <- read.table(url, sep = '\t', header = TRUE)
#head(alcoholarm)
#summarise(alcoholarm, n=n())

# allenstest is a tab-delimited data file.
# scores in Allen Cognitive Level (ACL), vocabulary, abstraction, and symbol-Digit Modalities Test (SDMT) for a sample of n=69 patients.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/allentest.txt'
allenstest <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(allenstest)
#summarise(allenstest, n=n())

# peru is a tab-delimited data file.
# Variables possibly relating to blood pressures of n = 39 Peruvians.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/peru.txt'
peru <- read.table(url, sep = '\t', header = TRUE)
peru$FracLife = peru$Years / peru$Age
#head(peru)
#summarise(peru, n=n())

# anscombe is a fixed-width data file.
# 
physical <- read_fwf(
  file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/Physical.txt", 
  skip = 1,
  fwf_widths(c(3, 9, 8, 6, 9, 7, 8, 7, 10, 6, 3), c('Obs', 'Sex', 'Height', 'LeftArm', 'RtArm', 'LeftFoot', 'RtFoot', 'LeftHand', 'RtHand', 'HeadCirc', 'nose')))

## Parsed with column specification:
## cols(
##   Obs = col_integer(),
##   Sex = col_character(),
##   Height = col_double(),
##   LeftArm = col_double(),
##   RtArm = col_double(),
##   LeftFoot = col_double(),
##   RtFoot = col_double(),
##   LeftHand = col_double(),
##   RtHand = col_double(),
##   HeadCirc = col_double(),
##   nose = col_double()
## )

#head(physical)
#summarise(physical, n=n())

6.1 - Three Types of Hypotheses

Consider this research question. Does the mean size of the infarcted area of a rabbit subjected to a heart attack differ among the three treatment groups - no cooling, early cooling, and late cooling - when controlling for the size of the region at risk for infarction?

The researcher fits the following model. \[Inf=\beta_0+\beta_1Area+\beta_2X2+\beta_3X3\] and finds \[Inf=-0.13454+0.61265Area-0.24348X2+-0.06566X3\]

where X2=1 for early-cooling of rabbit, X3=1 for late-cooling of rabbit (and X2=X3=0 is no-cooling of rabbit).

First, note that because $X2$ and $X3$ are indicator variables, the parameter estimators are the difference in the mean size of the infarcted area between early-/no-cooling and late-/no-cooling groups.

coolhearts.lm <- lm(Infarc ~ Area + X2 + X3, data = coolhearts)
summary(coolhearts.lm)

## 
## Call:
## lm(formula = Infarc ~ Area + X2 + X3, data = coolhearts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29410 -0.06511 -0.01329  0.07855  0.35949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.13454    0.10402  -1.293 0.206459    
## Area         0.61265    0.10705   5.723 3.87e-06 ***
## X2          -0.24348    0.06229  -3.909 0.000536 ***
## X3          -0.06566    0.06507  -1.009 0.321602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1395 on 28 degrees of freedom
## Multiple R-squared:  0.6377, Adjusted R-squared:  0.5989 
## F-statistic: 16.43 on 3 and 28 DF,  p-value: 2.363e-06

ggplot(coolhearts) +
  geom_point(aes(y=Infarc, x=Area, color=Group)) +
  geom_smooth(data=coolhearts, aes(y=Infarc, x=Area, color=Group), 
              method = "lm", se=FALSE) +
  ggtitle("Scatterplot of Inf vs Area") +
  xlab("Size of Area at Risk (grams)") +
  ylab("Size of Infarcted Area (grams)")

There are three hypothesis tests for slope parameters that answer various research questions.

Is at least one predictor useful in predicting the reponse variable ($H_0: \beta_1=\beta_2=\beta_3=0$)? Test $H_0$ with an analysis of variance F-test.
Is the response variable significantly (linearly) related to a particular predictor ($H_0: \beta_1=0$)? Test $H_0$ with a t-test.
Is the response variable significantly (linearly) related to a subset of predictors ($H_0: \beta_2=\beta_3=0$)? Test $H_0$ with a general linear F-test. (e.g., controlling for area, do any of the treatment groups create a different response?)

6.2 The General Linear F-Test

The general linear F-test tests whether to reject a smaller reduced model in favor of a larger full model. The full (unrestricted model) is the hypothesized appropriate model. The reduced (restricted) model is described by the null hypothesis. The F* statistic is $F^*=(SSE(R)-SSE(F))/(df_R-df_F) / SSE(F)/df_F)$.

For the SLR, the general linear F-test is equivalent to the AVOVA F-test. For example, in the alcoholarm example, $F=(504.04/1) / (720.27/48) = 33.59$.

alcoholarm.lm <- lm(strength ~ alcohol, data=alcoholarm)
#summary(alcoholarm.lm)
anova(alcoholarm.lm)

## Analysis of Variance Table
## 
## Response: strength
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## alcohol    1 504.04  504.04   33.59 5.136e-07 ***
## Residuals 48 720.27   15.01                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.3 Sequential (or Extra) Sums of Squares

The numerator of the general linear F-statistic, SSE(R)-SSE(F) is called the sequential sum of squares or extra sum of squares. It is the reduction in the error sum of squares when additional predictors are added to the model.

Consider an example of test scores of Allen Cognitive Level (ACL), Vocabulary, Abstract, and Symbol-Digit Modalities Test (SDMT) for a sample of n=69 patients. The ANOVA of the SLR of ACL ~ Vocab indicates the SSR for Vocab is 2.6906. Adding SDMT to the model increases the SSR by 9.0872.

anova(lm(ACL ~ Vocab, data=allenstest))

## Analysis of Variance Table
## 
## Response: ACL
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## Vocab      1  2.691 2.69060  4.4667 0.03829 *
## Residuals 67 40.359 0.60237                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(lm(ACL ~ Vocab + SDMT, data=allenstest))

## Analysis of Variance Table
## 
## Response: ACL
##           Df  Sum Sq Mean Sq F value   Pr(>F)    
## Vocab      1  2.6906  2.6906  5.6786  0.02006 *  
## SDMT       1  9.0872  9.0872 19.1789 4.35e-05 ***
## Residuals 66 31.2717  0.4738                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.4 The Hypothesis Tests for the Slopes

The sequential sums of squares applies to all three hypothesis tests ($H_0: \beta_1=\beta_2=\beta_3=0$, $H_0: \beta_1=0$, and $H_0: \beta_2=\beta_3=0$).

Testing all slope parameters equal 0

Here the sequence is the change in SSR when going from a reduced model of 0 parameters to a full model of k parameters. Use the coolhearts data set as an example. To test $H_0: \beta_1=\beta_2=\beta_3=0$, simply use the overall F-test and P-value reported in the analysis of variance table. There is sufficient evidence (F = 16.43, P < 0.001) to conclude that at least one of the slope parameters is not equal to 0.

coolhearts.lm <- lm(Infarc ~ Area + X2 + X3, data =coolhearts)
anova(coolhearts.lm)

## Analysis of Variance Table
## 
## Response: Infarc
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## Area       1 0.62492 0.62492 32.1115 4.504e-06 ***
## X2         1 0.31453 0.31453 16.1622  0.000398 ***
## X3         1 0.01981 0.01981  1.0181  0.321602    
## Residuals 28 0.54491 0.01946                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(coolhearts.lm)

## 
## Call:
## lm(formula = Infarc ~ Area + X2 + X3, data = coolhearts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29410 -0.06511 -0.01329  0.07855  0.35949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.13454    0.10402  -1.293 0.206459    
## Area         0.61265    0.10705   5.723 3.87e-06 ***
## X2          -0.24348    0.06229  -3.909 0.000536 ***
## X3          -0.06566    0.06507  -1.009 0.321602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1395 on 28 degrees of freedom
## Multiple R-squared:  0.6377, Adjusted R-squared:  0.5989 
## F-statistic: 16.43 on 3 and 28 DF,  p-value: 2.363e-06

Testing one slope parameter is 0

To test whether one parameter is significant, the reduced model is full model minus the parameter of interest. In the coolhearts example, to test whether the response variable Infarc is significantly related to explanatory variable Area, run the regression model with Area as the last parameter. There is sufficient evidence (F = 32.7536, P < 0.001) to conclude that the size of the infarct is significantly related to the size of the area at risk. In fact, this F-test has the same p-value as the t-test.

coolhearts.lm <- lm(Infarc ~ X2 + X3 + Area, data =coolhearts)
anova(coolhearts.lm)

## Analysis of Variance Table
## 
## Response: Infarc
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## X2         1 0.29994 0.29994 15.4125 0.0005124 ***
## X3         1 0.02191 0.02191  1.1258 0.2977463    
## Area       1 0.63742 0.63742 32.7536 3.865e-06 ***
## Residuals 28 0.54491 0.01946                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The equivalence of the F-test and the t-test illustrates an important point. The t-test is a test for the marginal significance of the $x_1$ predictor after the other predictors $x_2$ and $x_3$ have been taken into account. It does not test for the significance of the relationship between the response y and the predictor $x_1$ alone.

Testing a subset of slope parameters is 0

To test whether a subset of parameters are significant (as in, are the treatment group levels x2 and x3 significant after controlling for x1?), regress with the parameters of interest at the end of the model. There is sufficient evidence (F = 8.5902, P = 0.0012) to conclude that the type of cooling is significantly related to the extent of damage that occurs after taking into account the size of the region at risk.

coolhearts.lmr <- lm(Infarc ~ Area, data = coolhearts)
coolhearts.lmf <- lm(Infarc ~ Area + X2 + X3, data = coolhearts)
anova(coolhearts.lmr, coolhearts.lmf)

## Analysis of Variance Table
## 
## Model 1: Infarc ~ Area
## Model 2: Infarc ~ Area + X2 + X3
##   Res.Df     RSS Df Sum of Sq      F   Pr(>F)   
## 1     30 0.87926                                
## 2     28 0.54491  2   0.33435 8.5902 0.001233 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.5 Partial R-squared

The partial R^2 or coefficient of partial determination is the additional R^2 gained by adding the additional regressors to the model.

6.6 Lack of Fit Testing in the Multiple Regression Setting

Formal lack of fit testing can also be performed in the multiple regression setting; however, the ability to achieve replicates can be more difficult as more predictors are added to the model.

6.7 Further Examples

Example: Peruvian Blood Pressure Data

Here is an example of nine possible explanatory variables of blood pressure in a sample of n=39 Peruvians. Start with a full model.

peru.lm <- lm(Systol ~ Age + Years + FracLife + Weight + Height + Chin + Forearm + Calf + Pulse, data = peru)
summary(peru.lm)

## 
## Call:
## lm(formula = Systol ~ Age + Years + FracLife + Weight + Height + 
##     Chin + Forearm + Calf + Pulse, data = peru)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3442  -6.3972   0.0507   5.7292  14.5257 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  146.81907   48.97096   2.998 0.005526 ** 
## Age           -1.12144    0.32741  -3.425 0.001855 ** 
## Years          2.45538    0.81458   3.014 0.005306 ** 
## FracLife    -115.29395   30.16900  -3.822 0.000648 ***
## Weight         1.41393    0.43097   3.281 0.002697 ** 
## Height        -0.03464    0.03686  -0.940 0.355194    
## Chin          -0.94369    0.74097  -1.274 0.212923    
## Forearm       -1.17085    1.19329  -0.981 0.334612    
## Calf          -0.15867    0.53716  -0.295 0.769810    
## Pulse          0.11455    0.17043   0.672 0.506818    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.655 on 29 degrees of freedom
## Multiple R-squared:  0.6674, Adjusted R-squared:  0.5641 
## F-statistic: 6.465 on 9 and 29 DF,  p-value: 5.241e-05

The p-values for Height, Chin, Forearm, Calf, and Pulse are not statistically significant. These individual tests are affected by correlations amongst the x-variables, so we will use the General Linear F procedure to see whether it is reasonable to declare that all five non-significant variables can be dropped from the model.

peru.lmr <- lm(Systol ~ Age + Years + FracLife + Weight, data = peru)
peru.lmf <- lm(Systol ~ Age + Years + FracLife + Weight + Height + Chin + Forearm + Calf + Pulse, data = peru)
anova(peru.lmr, peru.lmf)

## Analysis of Variance Table
## 
## Model 1: Systol ~ Age + Years + FracLife + Weight
## Model 2: Systol ~ Age + Years + FracLife + Weight + Height + Chin + Forearm + 
##     Calf + Pulse
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     34 2629.7                           
## 2     29 2172.6  5    457.12 1.2204 0.3247

There is insufficient evidence (F = 1.2204, P = 0.3247) to conclude that Height, Chin, Forearm, Calf, and Pulse are significantly related to blood pressure after taking into account Age, Years, FracLife, and Weight.

Example: Measurements of College Students

Here is an example of four possible explanatory variables of height in a sample of n=55 college students. Start with a full model.

physical.lm <- lm(Height ~ LeftArm + LeftFoot + HeadCirc + nose, data=physical)
summary(physical.lm)

## 
## Call:
## lm(formula = Height ~ LeftArm + LeftFoot + HeadCirc + nose, data = physical)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0660 -1.1850 -0.1569  1.2539  5.5071 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18.50265    7.83031   2.363   0.0221 *  
## LeftArm      0.80205    0.17074   4.697 2.09e-05 ***
## LeftFoot     0.99730    0.16230   6.145 1.30e-07 ***
## HeadCirc     0.08052    0.14952   0.539   0.5926    
## nose        -0.14740    0.49233  -0.299   0.7659    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.183 on 50 degrees of freedom
## Multiple R-squared:  0.774,  Adjusted R-squared:  0.7559 
## F-statistic: 42.81 on 4 and 50 DF,  p-value: 1.447e-15

The p-values for HeadCirc and nose are not statistically significant. Use the General Linear F procedure to see whether it is reasonable to declare that both non-significant variables can be dropped from the model.

physical.lmr <- lm(Height ~ LeftArm + LeftFoot, data=physical)
physical.lmf <- lm(Height ~ LeftArm + LeftFoot + HeadCirc + nose, data=physical)
anova(physical.lmr, physical.lmf)

## Analysis of Variance Table
## 
## Model 1: Height ~ LeftArm + LeftFoot
## Model 2: Height ~ LeftArm + LeftFoot + HeadCirc + nose
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     52 240.18                           
## 2     50 238.35  2    1.8289 0.1918 0.8261

There is insufficient evidence (F = 0.1918, P = 0.8261) to conclude that HeadCirc and nose are significantly related to Height after taking into account LeftArm and LeftFoot.