I am working through the material in the Penn State online class STAT 501 Regression Methods. This R script is a personal exercise to translate the concepts and examples illustrated in Minitab into R and SAS. Lesson 6 in STAT 501 is MLR Model Evaluation. The SAS version of the code below is posted on my blog.
For the MLR model, there are several hypotheses tests: a t-test for partial F-test for \(H_0: \beta_k=0\); an overall F-test for \(H_0: \beta_1=...=\beta_{p-1}=0\); and a partial F-test for any subset of parameters.
This lesson also discusses the partial \(R^2\), and the MLR version of the lack of fit test.
Rename a column by name instead of by index. Use names with a filter: names(coolhearts)[names(coolhearts)=="Inf."] <- "Infarc".
color= aesthetic. Note that the value must be a factor variable. That raises another question - how do you declare a factor variable??? factor(coolhearts$Group, labels = c("Early", "Late", "Control")).
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.4.3
# coolhearts is a tab-delimited data file.
# size of infarcted area and region at risk of n=32 rabbits in k=3 temperature groups.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/coolhearts.txt'
coolhearts <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
# get rid of "." from column "Inf.", then rename to "Infarc" because "Inf" is reserved.
names(coolhearts)[names(coolhearts)=="Inf."] <- "Infarc"
coolhearts$Group <- factor(coolhearts$Group, labels = c("Early", "Late", "Control"))
# head(coolhearts)
# summarise(coolhearts, n=n())
# alcoholarm is a tab-delimited data file.
# Total lifetime consumption of alcohol and deltoid muscle strength of nondominant arm for n=50 alcoholic men.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/alcoholarm.txt'
alcoholarm <- read.table(url, sep = '\t', header = TRUE)
#head(alcoholarm)
#summarise(alcoholarm, n=n())
# allenstest is a tab-delimited data file.
# scores in Allen Cognitive Level (ACL), vocabulary, abstraction, and symbol-Digit Modalities Test (SDMT) for a sample of n=69 patients.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/allentest.txt'
allenstest <- read.table(url, sep = '\t', header = TRUE, fileEncoding = "UTF-16LE")
#head(allenstest)
#summarise(allenstest, n=n())
# peru is a tab-delimited data file.
# Variables possibly relating to blood pressures of n = 39 Peruvians.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/peru.txt'
peru <- read.table(url, sep = '\t', header = TRUE)
peru$FracLife = peru$Years / peru$Age
#head(peru)
#summarise(peru, n=n())
# anscombe is a fixed-width data file.
#
physical <- read_fwf(
file="https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/Physical.txt",
skip = 1,
fwf_widths(c(3, 9, 8, 6, 9, 7, 8, 7, 10, 6, 3), c('Obs', 'Sex', 'Height', 'LeftArm', 'RtArm', 'LeftFoot', 'RtFoot', 'LeftHand', 'RtHand', 'HeadCirc', 'nose')))
## Parsed with column specification:
## cols(
## Obs = col_integer(),
## Sex = col_character(),
## Height = col_double(),
## LeftArm = col_double(),
## RtArm = col_double(),
## LeftFoot = col_double(),
## RtFoot = col_double(),
## LeftHand = col_double(),
## RtHand = col_double(),
## HeadCirc = col_double(),
## nose = col_double()
## )
#head(physical)
#summarise(physical, n=n())
Consider this research question. Does the mean size of the infarcted area of a rabbit subjected to a heart attack differ among the three treatment groups - no cooling, early cooling, and late cooling - when controlling for the size of the region at risk for infarction?
The researcher fits the following model. \[Inf=\beta_0+\beta_1Area+\beta_2X2+\beta_3X3\] and finds \[Inf=-0.13454+0.61265Area-0.24348X2+-0.06566X3\]
where X2=1 for early-cooling of rabbit, X3=1 for late-cooling of rabbit (and X2=X3=0 is no-cooling of rabbit).
First, note that because \(X2\) and \(X3\) are indicator variables, the parameter estimators are the difference in the mean size of the infarcted area between early-/no-cooling and late-/no-cooling groups.
coolhearts.lm <- lm(Infarc ~ Area + X2 + X3, data = coolhearts)
summary(coolhearts.lm)
##
## Call:
## lm(formula = Infarc ~ Area + X2 + X3, data = coolhearts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29410 -0.06511 -0.01329 0.07855 0.35949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.13454 0.10402 -1.293 0.206459
## Area 0.61265 0.10705 5.723 3.87e-06 ***
## X2 -0.24348 0.06229 -3.909 0.000536 ***
## X3 -0.06566 0.06507 -1.009 0.321602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1395 on 28 degrees of freedom
## Multiple R-squared: 0.6377, Adjusted R-squared: 0.5989
## F-statistic: 16.43 on 3 and 28 DF, p-value: 2.363e-06
ggplot(coolhearts) +
geom_point(aes(y=Infarc, x=Area, color=Group)) +
geom_smooth(data=coolhearts, aes(y=Infarc, x=Area, color=Group),
method = "lm", se=FALSE) +
ggtitle("Scatterplot of Inf vs Area") +
xlab("Size of Area at Risk (grams)") +
ylab("Size of Infarcted Area (grams)")
Is at least one predictor useful in predicting the reponse variable (\(H_0: \beta_1=\beta_2=\beta_3=0\))? Test \(H_0\) with an analysis of variance F-test.
Is the response variable significantly (linearly) related to a particular predictor (\(H_0: \beta_1=0\))? Test \(H_0\) with a t-test.
The general linear F-test tests whether to reject a smaller reduced model in favor of a larger full model. The full (unrestricted model) is the hypothesized appropriate model. The reduced (restricted) model is described by the null hypothesis. The F* statistic is \(F^*=(SSE(R)-SSE(F))/(df_R-df_F) / SSE(F)/df_F)\).
For the SLR, the general linear F-test is equivalent to the AVOVA F-test. For example, in the alcoholarm example, \(F=(504.04/1) / (720.27/48) = 33.59\).
alcoholarm.lm <- lm(strength ~ alcohol, data=alcoholarm)
#summary(alcoholarm.lm)
anova(alcoholarm.lm)
## Analysis of Variance Table
##
## Response: strength
## Df Sum Sq Mean Sq F value Pr(>F)
## alcohol 1 504.04 504.04 33.59 5.136e-07 ***
## Residuals 48 720.27 15.01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The numerator of the general linear F-statistic, SSE(R)-SSE(F) is called the sequential sum of squares or extra sum of squares. It is the reduction in the error sum of squares when additional predictors are added to the model.
Consider an example of test scores of Allen Cognitive Level (ACL), Vocabulary, Abstract, and Symbol-Digit Modalities Test (SDMT) for a sample of n=69 patients. The ANOVA of the SLR of ACL ~ Vocab indicates the SSR for Vocab is 2.6906. Adding SDMT to the model increases the SSR by 9.0872.
anova(lm(ACL ~ Vocab, data=allenstest))
## Analysis of Variance Table
##
## Response: ACL
## Df Sum Sq Mean Sq F value Pr(>F)
## Vocab 1 2.691 2.69060 4.4667 0.03829 *
## Residuals 67 40.359 0.60237
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm(ACL ~ Vocab + SDMT, data=allenstest))
## Analysis of Variance Table
##
## Response: ACL
## Df Sum Sq Mean Sq F value Pr(>F)
## Vocab 1 2.6906 2.6906 5.6786 0.02006 *
## SDMT 1 9.0872 9.0872 19.1789 4.35e-05 ***
## Residuals 66 31.2717 0.4738
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The sequential sums of squares applies to all three hypothesis tests (\(H_0: \beta_1=\beta_2=\beta_3=0\), \(H_0: \beta_1=0\), and \(H_0: \beta_2=\beta_3=0\)).
Here the sequence is the change in SSR when going from a reduced model of 0 parameters to a full model of k parameters. Use the coolhearts data set as an example. To test \(H_0: \beta_1=\beta_2=\beta_3=0\), simply use the overall F-test and P-value reported in the analysis of variance table. There is sufficient evidence (F = 16.43, P < 0.001) to conclude that at least one of the slope parameters is not equal to 0.
coolhearts.lm <- lm(Infarc ~ Area + X2 + X3, data =coolhearts)
anova(coolhearts.lm)
## Analysis of Variance Table
##
## Response: Infarc
## Df Sum Sq Mean Sq F value Pr(>F)
## Area 1 0.62492 0.62492 32.1115 4.504e-06 ***
## X2 1 0.31453 0.31453 16.1622 0.000398 ***
## X3 1 0.01981 0.01981 1.0181 0.321602
## Residuals 28 0.54491 0.01946
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(coolhearts.lm)
##
## Call:
## lm(formula = Infarc ~ Area + X2 + X3, data = coolhearts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29410 -0.06511 -0.01329 0.07855 0.35949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.13454 0.10402 -1.293 0.206459
## Area 0.61265 0.10705 5.723 3.87e-06 ***
## X2 -0.24348 0.06229 -3.909 0.000536 ***
## X3 -0.06566 0.06507 -1.009 0.321602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1395 on 28 degrees of freedom
## Multiple R-squared: 0.6377, Adjusted R-squared: 0.5989
## F-statistic: 16.43 on 3 and 28 DF, p-value: 2.363e-06
To test whether one parameter is significant, the reduced model is full model minus the parameter of interest. In the coolhearts example, to test whether the response variable Infarc is significantly related to explanatory variable Area, run the regression model with Area as the last parameter. There is sufficient evidence (F = 32.7536, P < 0.001) to conclude that the size of the infarct is significantly related to the size of the area at risk. In fact, this F-test has the same p-value as the t-test.
coolhearts.lm <- lm(Infarc ~ X2 + X3 + Area, data =coolhearts)
anova(coolhearts.lm)
## Analysis of Variance Table
##
## Response: Infarc
## Df Sum Sq Mean Sq F value Pr(>F)
## X2 1 0.29994 0.29994 15.4125 0.0005124 ***
## X3 1 0.02191 0.02191 1.1258 0.2977463
## Area 1 0.63742 0.63742 32.7536 3.865e-06 ***
## Residuals 28 0.54491 0.01946
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The equivalence of the F-test and the t-test illustrates an important point. The t-test is a test for the marginal significance of the \(x_1\) predictor after the other predictors \(x_2\) and \(x_3\) have been taken into account. It does not test for the significance of the relationship between the response y and the predictor \(x_1\) alone.
To test whether a subset of parameters are significant (as in, are the treatment group levels x2 and x3 significant after controlling for x1?), regress with the parameters of interest at the end of the model. There is sufficient evidence (F = 8.5902, P = 0.0012) to conclude that the type of cooling is significantly related to the extent of damage that occurs after taking into account the size of the region at risk.
coolhearts.lmr <- lm(Infarc ~ Area, data = coolhearts)
coolhearts.lmf <- lm(Infarc ~ Area + X2 + X3, data = coolhearts)
anova(coolhearts.lmr, coolhearts.lmf)
## Analysis of Variance Table
##
## Model 1: Infarc ~ Area
## Model 2: Infarc ~ Area + X2 + X3
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 0.87926
## 2 28 0.54491 2 0.33435 8.5902 0.001233 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The partial R^2 or coefficient of partial determination is the additional R^2 gained by adding the additional regressors to the model.
Formal lack of fit testing can also be performed in the multiple regression setting; however, the ability to achieve replicates can be more difficult as more predictors are added to the model.
Here is an example of nine possible explanatory variables of blood pressure in a sample of n=39 Peruvians. Start with a full model.
peru.lm <- lm(Systol ~ Age + Years + FracLife + Weight + Height + Chin + Forearm + Calf + Pulse, data = peru)
summary(peru.lm)
##
## Call:
## lm(formula = Systol ~ Age + Years + FracLife + Weight + Height +
## Chin + Forearm + Calf + Pulse, data = peru)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3442 -6.3972 0.0507 5.7292 14.5257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 146.81907 48.97096 2.998 0.005526 **
## Age -1.12144 0.32741 -3.425 0.001855 **
## Years 2.45538 0.81458 3.014 0.005306 **
## FracLife -115.29395 30.16900 -3.822 0.000648 ***
## Weight 1.41393 0.43097 3.281 0.002697 **
## Height -0.03464 0.03686 -0.940 0.355194
## Chin -0.94369 0.74097 -1.274 0.212923
## Forearm -1.17085 1.19329 -0.981 0.334612
## Calf -0.15867 0.53716 -0.295 0.769810
## Pulse 0.11455 0.17043 0.672 0.506818
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.655 on 29 degrees of freedom
## Multiple R-squared: 0.6674, Adjusted R-squared: 0.5641
## F-statistic: 6.465 on 9 and 29 DF, p-value: 5.241e-05
The p-values for Height, Chin, Forearm, Calf, and Pulse are not statistically significant. These individual tests are affected by correlations amongst the x-variables, so we will use the General Linear F procedure to see whether it is reasonable to declare that all five non-significant variables can be dropped from the model.
peru.lmr <- lm(Systol ~ Age + Years + FracLife + Weight, data = peru)
peru.lmf <- lm(Systol ~ Age + Years + FracLife + Weight + Height + Chin + Forearm + Calf + Pulse, data = peru)
anova(peru.lmr, peru.lmf)
## Analysis of Variance Table
##
## Model 1: Systol ~ Age + Years + FracLife + Weight
## Model 2: Systol ~ Age + Years + FracLife + Weight + Height + Chin + Forearm +
## Calf + Pulse
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 34 2629.7
## 2 29 2172.6 5 457.12 1.2204 0.3247
There is insufficient evidence (F = 1.2204, P = 0.3247) to conclude that Height, Chin, Forearm, Calf, and Pulse are significantly related to blood pressure after taking into account Age, Years, FracLife, and Weight.
Here is an example of four possible explanatory variables of height in a sample of n=55 college students. Start with a full model.
physical.lm <- lm(Height ~ LeftArm + LeftFoot + HeadCirc + nose, data=physical)
summary(physical.lm)
##
## Call:
## lm(formula = Height ~ LeftArm + LeftFoot + HeadCirc + nose, data = physical)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0660 -1.1850 -0.1569 1.2539 5.5071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.50265 7.83031 2.363 0.0221 *
## LeftArm 0.80205 0.17074 4.697 2.09e-05 ***
## LeftFoot 0.99730 0.16230 6.145 1.30e-07 ***
## HeadCirc 0.08052 0.14952 0.539 0.5926
## nose -0.14740 0.49233 -0.299 0.7659
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.183 on 50 degrees of freedom
## Multiple R-squared: 0.774, Adjusted R-squared: 0.7559
## F-statistic: 42.81 on 4 and 50 DF, p-value: 1.447e-15
The p-values for HeadCirc and nose are not statistically significant. Use the General Linear F procedure to see whether it is reasonable to declare that both non-significant variables can be dropped from the model.
physical.lmr <- lm(Height ~ LeftArm + LeftFoot, data=physical)
physical.lmf <- lm(Height ~ LeftArm + LeftFoot + HeadCirc + nose, data=physical)
anova(physical.lmr, physical.lmf)
## Analysis of Variance Table
##
## Model 1: Height ~ LeftArm + LeftFoot
## Model 2: Height ~ LeftArm + LeftFoot + HeadCirc + nose
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 52 240.18
## 2 50 238.35 2 1.8289 0.1918 0.8261
There is insufficient evidence (F = 0.1918, P = 0.8261) to conclude that HeadCirc and nose are significantly related to Height after taking into account LeftArm and LeftFoot.