Applied Regression Analysis Week 5 Homework

Exercise One: Blood Pressure

Download and read data

url <- "https://d396qusza40orc.cloudfront.net/appliedregression/Homeworks/week2-HW-data.csv"
download.file(url, "sbp.csv", mode = "wb")
bp <- read.csv("sbp.csv")
names(bp) <- tolower(names(bp))

Three regression models will now be considered:

Model	Independent Variables Used
1	AGE \((X_1)\)
2	AGE \((X_1)\), SMK \((X_2)\)
3	AGE \((X_1)\), SMK \((X_2)\), QUET \((X_3)\)

First, use your computer to generate each of the above models.

summary(fit1 <- lm(sbp ~ age, bp))

## 
## Call:
## lm(formula = sbp ~ age, data = bp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.548  -6.990  -2.481   5.765  23.892 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59.0916    12.8163   4.611 6.98e-05 ***
## age           1.6045     0.2387   6.721 1.89e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.245 on 30 degrees of freedom
## Multiple R-squared:  0.6009, Adjusted R-squared:  0.5876 
## F-statistic: 45.18 on 1 and 30 DF,  p-value: 1.894e-07

summary(fit2 <- update(fit1, . ~ . + smk))

## 
## Call:
## lm(formula = sbp ~ age + smk, data = bp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.639  -5.518  -1.637   4.900  19.616 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  48.0496    11.1296   4.317 0.000168 ***
## age           1.7092     0.2018   8.471 2.47e-09 ***
## smk          10.2944     2.7681   3.719 0.000853 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.738 on 29 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7112 
## F-statistic: 39.16 on 2 and 29 DF,  p-value: 5.746e-09

summary(fit3 <- update(fit2, . ~ . + quet))

## 
## Call:
## lm(formula = sbp ~ age + smk + quet, data = bp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5420  -6.1812  -0.7282   5.2908  15.7050 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.1032    10.7649   4.190 0.000252 ***
## age           1.2127     0.3238   3.745 0.000829 ***
## smk           9.9456     2.6561   3.744 0.000830 ***
## quet          8.5924     4.4987   1.910 0.066427 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.407 on 28 degrees of freedom
## Multiple R-squared:  0.7609, Adjusted R-squared:  0.7353 
## F-statistic: 29.71 on 3 and 28 DF,  p-value: 7.602e-09

Then, complete the following:

A. Use model 3 to determine:

1. What is the predicted SBP for a 50-year old smoker with a quetelet (QUET) index of 3.5?

ndata <- data.frame(age = 50, smk = 1, quet = 3.5)
predict(fit3, ndata, type="response")

##        1 
## 145.7581

2. What is the predicted SBP for a 50-year-old non-smoker with a quetelet index of 3.5?

ndata <- data.frame(age = 50, smk = 0, quet = 3.5)
predict(fit3, ndata, type="response")

##        1 
## 135.8125

Really? That less?

As we can remember from the homework 2, the effect of smoking status on SBP estimated alone was very little. But here we estimated this effect after eliminating the variance coming from age, so it is statistically significant after all.

3. For 50-year-old smokers, give an estimate of the change in SBP corresponding to an increase in quetelet index from 3.0 to 3.5.

This is just \(\frac{\beta_3}{2}\).

(beta3 <- fit3[[1]][4])   # the beta3 coefficient from the model

##     quet 
## 8.592449

beta3 / 2

##     quet 
## 4.296224

B. Using the ANOVA tables, compute and compare the \(R^2\)-values for models 1, 2, and 3.

First, write a couple of custom functions to do so.

sumSq <- function(x) {
# This function extracts SS and DFs from 'lm' objects
                
                # make ANOVA table
                x <- anova(x)
                # extract sums of squares and DFs
                sumsq <- x[c(1,2)]
                n <- nrow(sumsq)
                # add up due regression SS and DFs across all columns
                x <- rbind(apply(as.matrix(sumsq[-n, ]), 2, sum), sumsq[n, ])
                row.names(x) <- c("Due regression", "Residual")
                x
}


rsq_ftest <- function(x, type) {
# This function computes R squared and performs an F-test from a
# 2x2 matrix with DFs in the first column and SS in the second column
                
                switch(type, 
                       rsq = {
                              # Due regression SS / Total SS
                              structure(x[1, 2] / (x[1, 2] + x[2, 2]),
                                        names = "R squared")
                              },
                       ftest = {
                                # get DR and residual mean squares:
                                meansq <- x[, 2] / x[, 1]
                                # get the F-value:
                                fval <- meansq[1] / meansq[2]
                                # get the P-value, compare to the 
                                # F(k, n-2-k):
                                pval <- pf(fval, x[1, 1], x[2, 1],                                                          lower.tail = FALSE)
                                structure(c(fval, pval), names = c("F-value",
                                                                   "P-value"))
                                }
                       )
}

Now get them to work

(sumsq <- lapply(list(fit1,fit2,fit3), sumSq))

## [[1]]
##                Df Sum Sq
## Due regression  1 3861.6
## Residual       30 2564.3
## 
## [[2]]
##                Df Sum Sq
## Due regression  2 4689.7
## Residual       29 1736.3
## 
## [[3]]
##                Df Sum Sq
## Due regression  3 4889.8
## Residual       28 1536.1

lapply(sumsq, rsq_ftest, type = "rsq")

## [[1]]
## R squared 
## 0.6009414 
## 
## [[2]]
## R squared 
## 0.7298019 
## 
## [[3]]
## R squared 
## 0.7609476

C. Conduct (separately) the overall F-tests for significant regression under models 1, 2, and 3. Be sure to state your null hypothesis for each model in terms of regression coefficients.

(sbp ~ age)
\(H_0: {\beta_1} = 0\)

rsq_ftest(sumsq[[1]], type = "ftest")

##      F-value      P-value 
## 4.517692e+01 1.893578e-07

Reject the null hypothesis

(sbp ~ age + smk)
\(H_0: {\beta_1} = {\beta_2}= 0\)

rsq_ftest(sumsq[[2]], "ftest")

##      F-value      P-value 
## 3.916433e+01 5.746364e-09

Reject the null hypothesis

(sbp ~ age + smk + quet)
\(H_0: {\beta_1} = {\beta_2} = {\beta_3} = 0\)

rsq_ftest(sumsq[[3]], "ftest")

##      F-value      P-value 
## 2.970972e+01 7.602274e-09

Reject again.

Exercise Two: Cholesterol Levels

Download and read data

url <- "https://d396qusza40orc.cloudfront.net/appliedregression/Homeworks/week5-HW-data.csv"
download.file(url, "chol.csv", mode = "wb")
chol <- read.csv("chol.csv")

1. Generate the separate straight-line regressions of \(Y\) on \(X_1\) (model 1) and \(Y\) on \(X_2\) (model 2).
2. Generate the regression model of \(Y\) on both \(X_1\) and \(X_2\).

f <- c("choles ~ weight", "choles ~ age", "choles ~ weight + age")
form <- lapply(f, as.formula, env = new.env())
modList <- lapply(form, function(x) lm(x, chol))

3. For each of the models in questions 1 and 2, determine the predicted cholesterol level \(Y\) for patient 4 (with \(Y\) = 263, \(X_1\) = 70, and \(X_2\) = 30) and compare these predicted cholesterol levels with the observed value.

ndata <- data.frame(weight = 70, age = 30)

library(pryr)
# fill in some predict() arguments
pred <- partial(predict, newdata = ndata, type = "response")
# make a prediction
lapply(modList, pred)

## [[1]]
##        1 
## 312.8615 
## 
## [[2]]
##        1 
## 262.1954 
## 
## [[3]]
##        1 
## 263.6956

# compare w/observed value
chol[4, 2]

## [1] 263

Looks like the first model (choles ~ age) isn’t particularly good at predictions. We need to know more than weight.

summary(modList[[1]])$r.sq

## [1] 0.07038062

4. Carry out the overall F-test for the two-variable model and the partial F-test for the addition of \(X_1\) to the model, given that \(X_2\) is already in the model.

lapply(modList, sumSq)

## [[1]]
##                Df Sum Sq
## Due regression  1  10232
## Residual       23 135145
## 
## [[2]]
##                Df Sum Sq
## Due regression  1 101933
## Residual       23  43444
## 
## [[3]]
##                Df Sum Sq
## Due regression  2 102571
## Residual       22  42806

# why am I doing this?..
# Overall:
rsq_ftest(sumSq(modList[[3]]), "ftest")

##      F-value      P-value 
## 2.635782e+01 1.442521e-06

# Partial:
(fval <- (102571 - 101933) / (42806 / 22))

## [1] 0.327898

pf(fval, 1, 22, lower.tail = FALSE)

## [1] 0.5727058

Looks like \(X_1\) isn’t needed.
A better way to do it:

anova(modList[[2]], modList[[3]])

## Analysis of Variance Table
## 
## Model 1: choles ~ age
## Model 2: choles ~ weight + age
##   Res.Df   RSS Df Sum of Sq     F Pr(>F)
## 1     23 43444                          
## 2     22 42806  1    638.15 0.328 0.5727

5. Compute and compare the \(R^2\)-values for each of the three models considered in questions 1 and 2.

sumsq <- lapply(modList, sumSq)
lapply(sumsq, rsq_ftest, "rsq")

## [[1]]
##  R squared 
## 0.07038062 
## 
## [[2]]
## R squared 
## 0.7011607 
## 
## [[3]]
## R squared 
## 0.7055503

The \(R^2\) of the choles ~ age model is very low, as we already have seen. Models 2 and 3 fit reasonably well, the third one just a little bit better.

6. Based on the results obtained in questions 1-5, what do you consider to be the best predictive model involving either one or both of the independent variables considered? Why?

Age (\(X_2\)) is sufficient, because removal of \(X_1\) does not make two-parameter model fit much worse. We don’t need both parameters.