Earlier in the course we studied the multiple regression relationship of SBP (Y) to AGE (X1), SMK (X2), and QUET (X3) using the data in Homework 1 of Week 2.
If you need to refer to the dataset from Week 2 homework, you can download the CSV file.
Three regression models will now be considered:
First, use your computer to generate each of the above models.
url <- "https://d396qusza40orc.cloudfront.net/appliedregression/Homeworks/week2-HW-data.csv"
#download.file(url, "dataw2.csv")
data <- read.csv("dataw2.csv")
head(data)
## Person SBP QUET AGE SMK
## 1 1 135 2.876 45 0
## 2 2 122 3.251 41 0
## 3 3 130 3.100 49 0
## 4 4 148 3.768 52 0
## 5 5 146 2.979 54 1
## 6 6 129 2.790 47 1
nms <- names(data)
(nms <- tolower(nms))
## [1] "person" "sbp" "quet" "age" "smk"
names(data) <- nms
(lm1 <- lm(sbp ~ age, data = data))
##
## Call:
## lm(formula = sbp ~ age, data = data)
##
## Coefficients:
## (Intercept) age
## 59.092 1.605
(lm2 <- lm(sbp ~ age + smk, data = data))
##
## Call:
## lm(formula = sbp ~ age + smk, data = data)
##
## Coefficients:
## (Intercept) age smk
## 48.050 1.709 10.294
(lm3 <- lm(sbp ~ age + smk + quet, data = data))
##
## Call:
## lm(formula = sbp ~ age + smk + quet, data = data)
##
## Coefficients:
## (Intercept) age smk quet
## 45.103 1.213 9.946 8.592
Then, complete the following:
(sbp_s50 <- predict(lm3, newdata = data.frame(age = 50, smk = 1, quet = 3.5)))
## 1
## 145.7581
(sbp_ns50 <- predict(lm3, data.frame(age = 50, smk = 0, quet = 3.5)))
## 1
## 135.8125
(higherQ <- predict(lm3, data.frame(age = 50, smk = 1, quet = 3.5)))
## 1
## 145.7581
(lowerQ <- predict(lm3, data.frame(age = 50, smk = 1, quet = 3.0)))
## 1
## 141.4618
(change_sbp <- higherQ - lowerQ)
## 1
## 4.296224
Increase in quetelet from 3.0 to 3.5 corresponds with an increase in systolic blood pressure of 4.2962243 units(mmHg) on average.
(s1 <- summary(lm1)$r.squared)
## [1] 0.6009414
(anova1 <- anova(lm1))
## Analysis of Variance Table
##
## Response: sbp
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 3861.6 3861.6 45.177 1.894e-07 ***
## Residuals 30 2564.3 85.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(rsq1 <- anova1[1, 2] / sum(anova1[, 2]))
## [1] 0.6009414
(s2 <- summary(lm2)$adj.r.squared)
## [1] 0.7111676
(n <- nrow(data))
## [1] 32
(k2 <- length(lm2$coef)-1)
## [1] 2
((anova2 <- anova(lm2)))
## Analysis of Variance Table
##
## Response: sbp
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 3861.6 3861.6 64.498 7.407e-09 ***
## smk 1 828.1 828.1 13.830 0.0008532 ***
## Residuals 29 1736.3 59.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(rsq2 <- sum(anova2[1:2, 2]) / sum(anova2[, 2]))
## [1] 0.7298019
(adj.rsq2 <- rsq2 - (1 - rsq2) *( k2 / (n - k2 - 1)))
## [1] 0.7111676
(s3 <- summary(lm3)$adj.r.squared)
## [1] 0.7353349
(k3 <- length(lm3$coef) -1)
## [1] 3
(anova3 <- anova(lm3))
## Analysis of Variance Table
##
## Response: sbp
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 3861.6 3861.6 70.3877 3.987e-09 ***
## smk 1 828.1 828.1 15.0933 0.0005715 ***
## quet 1 200.1 200.1 3.6481 0.0664268 .
## Residuals 28 1536.1 54.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(SST <- sum(anova3[, 2]))
## [1] 6425.969
(SSE <- sum(anova3[k3 + 1, 2]))
## [1] 1536.143
(rsq3 <- (SST - SSE) / SST)
## [1] 0.7609476
(adj.rsq3 <- rsq3 - (1 - rsq3) * k3 / (n - k3 - 1))
## [1] 0.7353349
identical(round(s3, 7), round(adj.rsq3, 7))
## [1] TRUE
(MSreg1 <- anova1[1, 3])
## [1] 3861.63
(MSE1 <- anova1[2, 3])
## [1] 85.47795
(Fstatistic1 <- MSreg1 / MSE1)
## [1] 45.17692
(pValue <- pf(Fstatistic1, df1 = anova1[1, 1], df2 = anova1[2, 1], lower.tail = FALSE))
## [1] 1.893578e-07
Overall F-test tests whether the slopes for the predictors in a regression are simultaneously equal to zero. In model 1 we have only one predictor, so the overall F - test is only test whether the slope of age is zero.
Models 2 and 3 have more than one predictor, so the null hypothesis of the F - test states that the slopes for the coefficients are simultaneously equal to zero, i.e. that \(\beta_{age}=\beta_{smk}=\beta_{quet}=0\)
(MSreg2 <- sum(anova2[1:k2, 2]) / sum(anova2[1:k2, 1]))
## [1] 2344.842
(MSE2 <- anova2[k2 + 1, 3])
## [1] 59.87188
(Fstatistic2 <- MSreg2 / MSE2)
## [1] 39.16433
(pValue2 <- pf(Fstatistic2, df1 = sum(anova2[1:k2, 1]), df2 = anova2[k2 + 1, 1], lower.tail = FALSE))
## [1] 5.746364e-09
(MSreg3 <- sum(anova3[1:k3, 2]) / sum(anova3[1:k3, 1]))
## [1] 1629.942
(MSE3 <- anova3[k3 + 1, 3])
## [1] 54.86225
(Fstatistic3 <- MSreg3 / MSE3)
## [1] 29.70972
(pValue3 <- pf(Fstatistic3, df1 = sum(anova3[1:k3, 1]), df2 = anova3[k3 + 1, 1], lower.tail = FALSE))
## [1] 7.602274e-09
Alternative way of obtaining F statistics is from the summary output of the linear regression model:
names(summary(lm1))
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
summary(lm1)$fstatistic
## value numdf dendf
## 45.17692 1.00000 30.00000
summary(lm2)$fstatistic
## value numdf dendf
## 39.16433 2.00000 29.00000
summary(lm3)$fstatistic
## value numdf dendf
## 29.70972 3.00000 28.00000
The accompanying table presents the weight (X1), age (X2) and plasma lipid levels of total cholesterol (Y) for a hypothetical sample of 25 patients with hyperlipoproteinemia before drug therapy.
url <- "https://d396qusza40orc.cloudfront.net/appliedregression/Homeworks/week5-HW-data.csv"
download.file(url, "cholesterol.csv")
data <- read.csv("cholesterol.csv")
head(data)
## Patient choles weight age
## 1 1 354 84 46
## 2 2 190 73 20
## 3 3 405 65 52
## 4 4 263 70 30
## 5 5 451 76 57
## 6 6 302 69 25
dim(data)
## [1] 25 4
Complete the following six questions:
(lm1 <- lm(choles ~ weight, data = data))
##
## Call:
## lm(formula = choles ~ weight, data = data)
##
## Coefficients:
## (Intercept) weight
## 199.298 1.622
(lm2 <- lm(choles ~ age, data = data))
##
## Call:
## lm(formula = choles ~ age, data = data)
##
## Coefficients:
## (Intercept) age
## 102.575 5.321
(lm3 <- lm(choles ~ weight + age, data = data))
##
## Call:
## lm(formula = choles ~ weight + age, data = data)
##
## Coefficients:
## (Intercept) weight age
## 77.9825 0.4174 5.2166
nd <- data.frame(weight = 70, age = 30)
(patient4_1 <- predict(lm1, newdata = nd))
## 1
## 312.8615
(patient4_2 <- predict(lm2, newdata = nd))
## 1
## 262.1954
(patient4_3 <- predict(lm3, newdata = nd))
## 1
## 263.6956
Model that has only weight for a predictor has the worse results: it predicts 312.8614924 while the observed value of cholesterol is 263. Models 2 and 3 have almost the same precision, although model 3 is the closest to the observed value, predicts 263.6956098, while model 2 with predicted 262.1954319 is the only one that (slightly) underestimates the level of the cholesterol.
#overall F test
(anova3 <- anova(lm3))
## Analysis of Variance Table
##
## Response: choles
## Df Sum Sq Mean Sq F value Pr(>F)
## weight 1 10232 10232 5.2585 0.03177 *
## age 1 92339 92339 47.4571 6.43e-07 ***
## Residuals 22 42806 1946
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(MSreg3 <- sum(anova3[1:2, 2]) / sum(anova3[1:2, 1]))
## [1] 51285.41
(MSE3 <- anova3[3, 3])
## [1] 1945.738
(Fstatistic3 <- MSreg3 / MSE3)
## [1] 26.35782
(pValue3 <- pf(Fstatistic1, df1 = sum(anova3[1:2, 1]), df2 = anova3[3, 1], lower.tail = FALSE))
## [1] 1.622491e-08
#partial F test for the addition of weight given that age is already in the model
(SSboth <- sum(anova3[1:2, 2]))
## [1] 102570.8
(MSEboth <- anova3[3, 3])
## [1] 1945.738
(anova2 <- anova(lm2))
## Analysis of Variance Table
##
## Response: choles
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 101933 101933 53.964 1.794e-07 ***
## Residuals 23 43444 1889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(SSage <- anova2[1, 2])
## [1] 101932.7
(diffSS <- SSboth - SSage)
## [1] 638.1489
(partialF <- diffSS / MSEboth)
## [1] 0.3279728
(n <- nrow(data))
## [1] 25
(k <- 1)
## [1] 1
(partial.pValue <- pf(partialF, df1 = 1, df2 = n - k - 2, lower.tail = FALSE))
## [1] 0.5726623
partial F - test: \[H_0 = addition of variable weight doesn't improve predictive value of the model\] \[H_A = addition of variable weight improves prediction of model\] p - value of the partial F test says that addition of variable weight to the model that already contains age doesn’t improve the value of prediction and the overall F - test rejects that both slope coefficients are equal to zero
(s1 <- summary(lm1)$r.squared)
## [1] 0.07038062
(anova1 <- anova(lm1))
## Analysis of Variance Table
##
## Response: choles
## Df Sum Sq Mean Sq F value Pr(>F)
## weight 1 10232 10231.7 1.7413 0.2
## Residuals 23 135145 5875.9
(SST1 <- sum(anova1[, 2]))
## [1] 145377
(SSE1 <- anova1[2, 2])
## [1] 135145.3
(SSR1 <- SST1 - SSE1)
## [1] 10231.73
(rsq1 <- SSR1 / SST1)
## [1] 0.07038062
(s2 <- summary(lm2)$r.squared)
## [1] 0.7011607
(anova2 <- anova(lm2))
## Analysis of Variance Table
##
## Response: choles
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 101933 101933 53.964 1.794e-07 ***
## Residuals 23 43444 1889
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(SST2 <- sum(anova2[, 2]))
## [1] 145377
(SSE2 <- anova2[2, 2])
## [1] 43444.37
(SSR2 <- anova2[1, 2])
## [1] 101932.7
(rsq2 <- SSR2 / SST2)
## [1] 0.7011607
(s3 <- summary(lm3)$adj.r.squared)
## [1] 0.6787821
anova3
## Analysis of Variance Table
##
## Response: choles
## Df Sum Sq Mean Sq F value Pr(>F)
## weight 1 10232 10232 5.2585 0.03177 *
## age 1 92339 92339 47.4571 6.43e-07 ***
## Residuals 22 42806 1946
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(SST3 <- sum(anova3[, 2]))
## [1] 145377
(SSE3 <- anova3[3, 2])
## [1] 42806.23
(SSR3 <- sum(anova3[1:2, 2]))
## [1] 102570.8
(rsq3 <- SSR3 / SST3)
## [1] 0.7055503
k <- 2
(adj.rsq3 <- rsq3 - (1 - rsq3) * (k / (n - k - 1)))
## [1] 0.6787821
Predictive model has to be the parsimonious model: model with the minimum number of predictors, in case when we have at least two models with the same explanatory value and the only thing they differ is the number of predictors. In this case, where we have a model with 2 explanatory variables its adjusted \(R^2\) is less than the model with only age as predictor. Therefore, I’d say that the second model cholesterol ~ age is the best predictive model. It explains 70.116069 % of variability in cholesterol level, while the two-predictor model explains 67.8782149 %.