Download and read data
url <- "https://d396qusza40orc.cloudfront.net/appliedregression/Homeworks/week2-HW-data.csv"
download.file(url, "sbp.csv", mode = "wb")
bp <- read.csv("sbp.csv")
names(bp) <- tolower(names(bp))
Three regression models will now be considered:
| Model | Independent Variables Used |
|---|---|
| 1 | AGE \((X_1)\) |
| 2 | AGE \((X_1)\), SMK \((X_2)\) |
| 3 | AGE \((X_1)\), SMK \((X_2)\), QUET \((X_3)\) |
First, use your computer to generate each of the above models.
summary(fit1 <- lm(sbp ~ age, bp))
##
## Call:
## lm(formula = sbp ~ age, data = bp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.548 -6.990 -2.481 5.765 23.892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.0916 12.8163 4.611 6.98e-05 ***
## age 1.6045 0.2387 6.721 1.89e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.245 on 30 degrees of freedom
## Multiple R-squared: 0.6009, Adjusted R-squared: 0.5876
## F-statistic: 45.18 on 1 and 30 DF, p-value: 1.894e-07
summary(fit2 <- update(fit1, . ~ . + smk))
##
## Call:
## lm(formula = sbp ~ age + smk, data = bp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.639 -5.518 -1.637 4.900 19.616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.0496 11.1296 4.317 0.000168 ***
## age 1.7092 0.2018 8.471 2.47e-09 ***
## smk 10.2944 2.7681 3.719 0.000853 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.738 on 29 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7112
## F-statistic: 39.16 on 2 and 29 DF, p-value: 5.746e-09
summary(fit3 <- update(fit2, . ~ . + quet))
##
## Call:
## lm(formula = sbp ~ age + smk + quet, data = bp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5420 -6.1812 -0.7282 5.2908 15.7050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.1032 10.7649 4.190 0.000252 ***
## age 1.2127 0.3238 3.745 0.000829 ***
## smk 9.9456 2.6561 3.744 0.000830 ***
## quet 8.5924 4.4987 1.910 0.066427 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.407 on 28 degrees of freedom
## Multiple R-squared: 0.7609, Adjusted R-squared: 0.7353
## F-statistic: 29.71 on 3 and 28 DF, p-value: 7.602e-09
Then, complete the following:
1. What is the predicted SBP for a 50-year old smoker with a quetelet (QUET) index of 3.5?
ndata <- data.frame(age = 50, smk = 1, quet = 3.5)
predict(fit3, ndata, type="response")
## 1
## 145.7581
2. What is the predicted SBP for a 50-year-old non-smoker with a quetelet index of 3.5?
ndata <- data.frame(age = 50, smk = 0, quet = 3.5)
predict(fit3, ndata, type="response")
## 1
## 135.8125
Really? That less?
As we can remember from the homework 2, the effect of smoking status on SBP estimated alone was very little. But here we estimated this effect after eliminating the variance coming from age, so it is statistically significant after all.
3. For 50-year-old smokers, give an estimate of the change in SBP corresponding to an increase in quetelet index from 3.0 to 3.5.
This is just \(\frac{\beta_3}{2}\).
(beta3 <- fit3[[1]][4]) # the beta3 coefficient from the model
## quet
## 8.592449
beta3 / 2
## quet
## 4.296224
First, write a couple of custom functions to do so.
sumSq <- function(x) {
# This function extracts SS and DFs from 'lm' objects
# make ANOVA table
x <- anova(x)
# extract sums of squares and DFs
sumsq <- x[c(1,2)]
n <- nrow(sumsq)
# add up due regression SS and DFs across all columns
x <- rbind(apply(as.matrix(sumsq[-n, ]), 2, sum), sumsq[n, ])
row.names(x) <- c("Due regression", "Residual")
x
}
rsq_ftest <- function(x, type) {
# This function computes R squared and performs an F-test from a
# 2x2 matrix with DFs in the first column and SS in the second column
switch(type,
rsq = {
# Due regression SS / Total SS
structure(x[1, 2] / (x[1, 2] + x[2, 2]),
names = "R squared")
},
ftest = {
# get DR and residual mean squares:
meansq <- x[, 2] / x[, 1]
# get the F-value:
fval <- meansq[1] / meansq[2]
# get the P-value, compare to the
# F(k, n-2-k):
pval <- pf(fval, x[1, 1], x[2, 1], lower.tail = FALSE)
structure(c(fval, pval), names = c("F-value",
"P-value"))
}
)
}
Now get them to work
(sumsq <- lapply(list(fit1,fit2,fit3), sumSq))
## [[1]]
## Df Sum Sq
## Due regression 1 3861.6
## Residual 30 2564.3
##
## [[2]]
## Df Sum Sq
## Due regression 2 4689.7
## Residual 29 1736.3
##
## [[3]]
## Df Sum Sq
## Due regression 3 4889.8
## Residual 28 1536.1
lapply(sumsq, rsq_ftest, type = "rsq")
## [[1]]
## R squared
## 0.6009414
##
## [[2]]
## R squared
## 0.7298019
##
## [[3]]
## R squared
## 0.7609476
(sbp ~ age)
\(H_0: {\beta_1} = 0\)
rsq_ftest(sumsq[[1]], type = "ftest")
## F-value P-value
## 4.517692e+01 1.893578e-07
Reject the null hypothesis
(sbp ~ age + smk)
\(H_0: {\beta_1} = {\beta_2}= 0\)
rsq_ftest(sumsq[[2]], "ftest")
## F-value P-value
## 3.916433e+01 5.746364e-09
Reject the null hypothesis
(sbp ~ age + smk + quet)
\(H_0: {\beta_1} = {\beta_2} = {\beta_3} = 0\)
rsq_ftest(sumsq[[3]], "ftest")
## F-value P-value
## 2.970972e+01 7.602274e-09
Reject again.
Download and read data
url <- "https://d396qusza40orc.cloudfront.net/appliedregression/Homeworks/week5-HW-data.csv"
download.file(url, "chol.csv", mode = "wb")
chol <- read.csv("chol.csv")
1. Generate the separate straight-line regressions of \(Y\) on \(X_1\) (model 1) and \(Y\) on \(X_2\) (model 2).
2. Generate the regression model of \(Y\) on both \(X_1\) and \(X_2\).
f <- c("choles ~ weight", "choles ~ age", "choles ~ weight + age")
form <- lapply(f, as.formula, env = new.env())
modList <- lapply(form, function(x) lm(x, chol))
3. For each of the models in questions 1 and 2, determine the predicted cholesterol level \(Y\) for patient 4 (with \(Y\) = 263, \(X_1\) = 70, and \(X_2\) = 30) and compare these predicted cholesterol levels with the observed value.
ndata <- data.frame(weight = 70, age = 30)
library(pryr)
# fill in some predict() arguments
pred <- partial(predict, newdata = ndata, type = "response")
# make a prediction
lapply(modList, pred)
## [[1]]
## 1
## 312.8615
##
## [[2]]
## 1
## 262.1954
##
## [[3]]
## 1
## 263.6956
# compare w/observed value
chol[4, 2]
## [1] 263
Looks like the first model (choles ~ age) isn’t particularly good at predictions. We need to know more than weight.
summary(modList[[1]])$r.sq
## [1] 0.07038062
4. Carry out the overall F-test for the two-variable model and the partial F-test for the addition of \(X_1\) to the model, given that \(X_2\) is already in the model.
lapply(modList, sumSq)
## [[1]]
## Df Sum Sq
## Due regression 1 10232
## Residual 23 135145
##
## [[2]]
## Df Sum Sq
## Due regression 1 101933
## Residual 23 43444
##
## [[3]]
## Df Sum Sq
## Due regression 2 102571
## Residual 22 42806
# why am I doing this?..
# Overall:
rsq_ftest(sumSq(modList[[3]]), "ftest")
## F-value P-value
## 2.635782e+01 1.442521e-06
# Partial:
(fval <- (102571 - 101933) / (42806 / 22))
## [1] 0.327898
pf(fval, 1, 22, lower.tail = FALSE)
## [1] 0.5727058
Looks like \(X_1\) isn’t needed.
A better way to do it:
anova(modList[[2]], modList[[3]])
## Analysis of Variance Table
##
## Model 1: choles ~ age
## Model 2: choles ~ weight + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 43444
## 2 22 42806 1 638.15 0.328 0.5727
5. Compute and compare the \(R^2\)-values for each of the three models considered in questions 1 and 2.
sumsq <- lapply(modList, sumSq)
lapply(sumsq, rsq_ftest, "rsq")
## [[1]]
## R squared
## 0.07038062
##
## [[2]]
## R squared
## 0.7011607
##
## [[3]]
## R squared
## 0.7055503
The \(R^2\) of the choles ~ age model is very low, as we already have seen. Models 2 and 3 fit reasonably well, the third one just a little bit better.
6. Based on the results obtained in questions 1-5, what do you consider to be the best predictive model involving either one or both of the independent variables considered? Why?
Age (\(X_2\)) is sufficient, because removal of \(X_1\) does not make two-parameter model fit much worse. We don’t need both parameters.