Applied Regression Analysis

Exercise One

Earlier in the course we studied the multiple regression relationship of SBP (Y) to AGE (X1), SMK (X2), and QUET (X3) using the data in Homework 1 of Week 2.

If you need to refer to the dataset from Week 2 homework, you can download the CSV file.

Three regression models will now be considered:

hwpic

First, use your computer to generate each of the above models.

url <- "https://d396qusza40orc.cloudfront.net/appliedregression/Homeworks/week2-HW-data.csv"
#download.file(url, "dataw2.csv")
data <- read.csv("dataw2.csv")
head(data)

##   Person SBP  QUET AGE SMK
## 1      1 135 2.876  45   0
## 2      2 122 3.251  41   0
## 3      3 130 3.100  49   0
## 4      4 148 3.768  52   0
## 5      5 146 2.979  54   1
## 6      6 129 2.790  47   1

nms <- names(data)
(nms <- tolower(nms))

## [1] "person" "sbp"    "quet"   "age"    "smk"

names(data) <- nms

(lm1 <- lm(sbp ~ age, data = data))

## 
## Call:
## lm(formula = sbp ~ age, data = data)
## 
## Coefficients:
## (Intercept)          age  
##      59.092        1.605

(lm2 <- lm(sbp ~ age + smk, data = data))

## 
## Call:
## lm(formula = sbp ~ age + smk, data = data)
## 
## Coefficients:
## (Intercept)          age          smk  
##      48.050        1.709       10.294

(lm3 <- lm(sbp ~ age + smk + quet, data = data))

## 
## Call:
## lm(formula = sbp ~ age + smk + quet, data = data)
## 
## Coefficients:
## (Intercept)          age          smk         quet  
##      45.103        1.213        9.946        8.592

Then, complete the following:

A. Use model 3 to determine:
1. What is the predicted SBP for a 50-year old smoker with a quetelet (QUET) index of 3.5?

(sbp_s50 <- predict(lm3, newdata = data.frame(age = 50, smk = 1, quet = 3.5)))

##        1 
## 145.7581

What is the predicted SBP for a 50-year-old non-smoker with a quetelet index of 3.5?

(sbp_ns50 <- predict(lm3, data.frame(age = 50, smk = 0, quet = 3.5)))

##        1 
## 135.8125

For 50-year-old smokers, give an estimate of the change in SBP corresponding to an increase in quetelet index from 3.0 to 3.5.

(higherQ <- predict(lm3, data.frame(age = 50, smk = 1, quet = 3.5)))

##        1 
## 145.7581

(lowerQ <- predict(lm3, data.frame(age = 50, smk = 1, quet = 3.0)))

##        1 
## 141.4618

(change_sbp <- higherQ - lowerQ)

##        1 
## 4.296224

Increase in quetelet from 3.0 to 3.5 corresponds with an increase in systolic blood pressure of 4.2962243 units(mmHg) on average.

B.Using the ANOVA tables, compute and compare the R2-values for models 1,2, and 3.
To calculate \(R^2\) and adjusted \(R^2\) from anova table we use formulas: \[R^2 = \dfrac{SST - SSE}{SST}\] \[adjusted\ R^2 = R^2 - (1 - R^2)\dfrac{k}{n - k - 1}\]
where: n = number of observations, k = number of predictors in the model, R = correlation coefficient

(s1 <- summary(lm1)$r.squared)

## [1] 0.6009414

(anova1 <- anova(lm1))

## Analysis of Variance Table
## 
## Response: sbp
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## age        1 3861.6  3861.6  45.177 1.894e-07 ***
## Residuals 30 2564.3    85.5                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(rsq1 <- anova1[1, 2] / sum(anova1[, 2]))

## [1] 0.6009414

(s2 <- summary(lm2)$adj.r.squared)

## [1] 0.7111676

(n <- nrow(data))

## [1] 32

(k2 <- length(lm2$coef)-1)

## [1] 2

((anova2 <- anova(lm2)))

## Analysis of Variance Table
## 
## Response: sbp
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## age        1 3861.6  3861.6  64.498 7.407e-09 ***
## smk        1  828.1   828.1  13.830 0.0008532 ***
## Residuals 29 1736.3    59.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(rsq2 <- sum(anova2[1:2, 2]) / sum(anova2[, 2]))

## [1] 0.7298019

(adj.rsq2 <- rsq2 - (1 - rsq2) *( k2 / (n - k2 - 1)))

## [1] 0.7111676

(s3 <- summary(lm3)$adj.r.squared)

## [1] 0.7353349

(k3 <- length(lm3$coef) -1)

## [1] 3

(anova3 <- anova(lm3))

## Analysis of Variance Table
## 
## Response: sbp
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## age        1 3861.6  3861.6 70.3877 3.987e-09 ***
## smk        1  828.1   828.1 15.0933 0.0005715 ***
## quet       1  200.1   200.1  3.6481 0.0664268 .  
## Residuals 28 1536.1    54.9                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(SST <- sum(anova3[, 2]))

## [1] 6425.969

(SSE <- sum(anova3[k3 + 1, 2]))

## [1] 1536.143

(rsq3 <- (SST - SSE) / SST)

## [1] 0.7609476

(adj.rsq3 <- rsq3 - (1 - rsq3) * k3 / (n - k3 - 1))

## [1] 0.7353349

identical(round(s3, 7), round(adj.rsq3, 7))

## [1] TRUE

C: Conduct (separately) the overall \(F\) tests for significant regression under models 1,2, and 3. Be sure to state your null hypothesis for each model in terms of regression coefficients.

(MSreg1 <- anova1[1, 3])

## [1] 3861.63

(MSE1 <- anova1[2, 3])

## [1] 85.47795

(Fstatistic1 <- MSreg1 / MSE1)

## [1] 45.17692

(pValue <- pf(Fstatistic1, df1 = anova1[1, 1], df2 = anova1[2, 1], lower.tail = FALSE))

## [1] 1.893578e-07

Overall F-test tests whether the slopes for the predictors in a regression are simultaneously equal to zero. In model 1 we have only one predictor, so the overall F - test is only test whether the slope of age is zero.
Models 2 and 3 have more than one predictor, so the null hypothesis of the F - test states that the slopes for the coefficients are simultaneously equal to zero, i.e. that \(\beta_{age}=\beta_{smk}=\beta_{quet}=0\)

(MSreg2 <- sum(anova2[1:k2, 2]) / sum(anova2[1:k2, 1]))

## [1] 2344.842

(MSE2 <- anova2[k2 + 1, 3])

## [1] 59.87188

(Fstatistic2 <- MSreg2 / MSE2)

## [1] 39.16433

(pValue2 <- pf(Fstatistic2, df1 = sum(anova2[1:k2, 1]), df2 = anova2[k2 + 1, 1], lower.tail = FALSE))

## [1] 5.746364e-09

(MSreg3 <- sum(anova3[1:k3, 2]) / sum(anova3[1:k3, 1]))

## [1] 1629.942

(MSE3 <- anova3[k3 + 1, 3])

## [1] 54.86225

(Fstatistic3 <- MSreg3 / MSE3)

## [1] 29.70972

(pValue3 <- pf(Fstatistic3, df1 = sum(anova3[1:k3, 1]), df2 = anova3[k3 + 1, 1], lower.tail = FALSE))

## [1] 7.602274e-09

Alternative way of obtaining F statistics is from the summary output of the linear regression model:

names(summary(lm1))

##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"

summary(lm1)$fstatistic

##    value    numdf    dendf 
## 45.17692  1.00000 30.00000

summary(lm2)$fstatistic

##    value    numdf    dendf 
## 39.16433  2.00000 29.00000

summary(lm3)$fstatistic

##    value    numdf    dendf 
## 29.70972  3.00000 28.00000

Exercise Two

The accompanying table presents the weight (X1), age (X2) and plasma lipid levels of total cholesterol (Y) for a hypothetical sample of 25 patients with hyperlipoproteinemia before drug therapy.

url <- "https://d396qusza40orc.cloudfront.net/appliedregression/Homeworks/week5-HW-data.csv"
download.file(url, "cholesterol.csv")
data <- read.csv("cholesterol.csv")
head(data)

##   Patient choles weight age
## 1       1    354     84  46
## 2       2    190     73  20
## 3       3    405     65  52
## 4       4    263     70  30
## 5       5    451     76  57
## 6       6    302     69  25

dim(data)

## [1] 25  4

Complete the following six questions:

Generate the separate straight-line regressions of Y on X1 (model 1) and Y on X2 (model 2).

(lm1 <- lm(choles ~ weight, data = data))

## 
## Call:
## lm(formula = choles ~ weight, data = data)
## 
## Coefficients:
## (Intercept)       weight  
##     199.298        1.622

(lm2 <- lm(choles ~ age, data = data))

## 
## Call:
## lm(formula = choles ~ age, data = data)
## 
## Coefficients:
## (Intercept)          age  
##     102.575        5.321

Generate the regression model of Y on both X1 and X2

(lm3 <- lm(choles ~ weight + age, data = data))

## 
## Call:
## lm(formula = choles ~ weight + age, data = data)
## 
## Coefficients:
## (Intercept)       weight          age  
##     77.9825       0.4174       5.2166

For each of the models in questions 1 and 2, determine the predicted cholesterol level Y for patient 4 (with Y = 263, X1 = 70, and X2 = 30) and compare these predicted cholesterol levels with the observed value. Comment on your findings in the homework forum.

nd <- data.frame(weight = 70, age = 30)
(patient4_1 <- predict(lm1, newdata = nd))

##        1 
## 312.8615

(patient4_2 <- predict(lm2, newdata = nd))

##        1 
## 262.1954

(patient4_3 <- predict(lm3, newdata = nd))

##        1 
## 263.6956

Model that has only weight for a predictor has the worse results: it predicts 312.8614924 while the observed value of cholesterol is 263. Models 2 and 3 have almost the same precision, although model 3 is the closest to the observed value, predicts 263.6956098, while model 2 with predicted 262.1954319 is the only one that (slightly) underestimates the level of the cholesterol.

Carry out the overall F-test for the two-variable model and the partial F-test for the addition of X1 to the model, given that X2 is already in the model.

#overall F test 
(anova3 <- anova(lm3))

## Analysis of Variance Table
## 
## Response: choles
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## weight     1  10232   10232  5.2585  0.03177 *  
## age        1  92339   92339 47.4571 6.43e-07 ***
## Residuals 22  42806    1946                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(MSreg3 <- sum(anova3[1:2, 2]) / sum(anova3[1:2, 1]))

## [1] 51285.41

(MSE3 <- anova3[3, 3])

## [1] 1945.738

(Fstatistic3 <- MSreg3 / MSE3)

## [1] 26.35782

(pValue3 <- pf(Fstatistic1, df1 = sum(anova3[1:2, 1]), df2 = anova3[3, 1], lower.tail = FALSE))

## [1] 1.622491e-08

#partial F test for the addition of weight given that age is already in the model
(SSboth <- sum(anova3[1:2, 2]))

## [1] 102570.8

(MSEboth <- anova3[3, 3])

## [1] 1945.738

(anova2 <- anova(lm2))

## Analysis of Variance Table
## 
## Response: choles
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## age        1 101933  101933  53.964 1.794e-07 ***
## Residuals 23  43444    1889                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(SSage <- anova2[1, 2])

## [1] 101932.7

(diffSS <- SSboth - SSage)

## [1] 638.1489

(partialF <- diffSS / MSEboth)

## [1] 0.3279728

(n <- nrow(data))

## [1] 25

(k <- 1)

## [1] 1

(partial.pValue <- pf(partialF, df1 = 1, df2 = n - k - 2, lower.tail = FALSE))

## [1] 0.5726623

partial F - test: \[H_0 = addition of variable weight doesn't improve predictive value of the model\] \[H_A = addition of variable weight improves prediction of model\] p - value of the partial F test says that addition of variable weight to the model that already contains age doesn’t improve the value of prediction and the overall F - test rejects that both slope coefficients are equal to zero

Compute and compare the \(R^2\)-values for each of the three models considered in questions 1 and 2.

(s1 <- summary(lm1)$r.squared)

## [1] 0.07038062

(anova1 <- anova(lm1))

## Analysis of Variance Table
## 
## Response: choles
##           Df Sum Sq Mean Sq F value Pr(>F)
## weight     1  10232 10231.7  1.7413    0.2
## Residuals 23 135145  5875.9

(SST1 <- sum(anova1[, 2]))

## [1] 145377

(SSE1 <- anova1[2, 2])

## [1] 135145.3

(SSR1 <- SST1 - SSE1)

## [1] 10231.73

(rsq1 <- SSR1 / SST1)

## [1] 0.07038062

(s2 <- summary(lm2)$r.squared)

## [1] 0.7011607

(anova2 <- anova(lm2))

## Analysis of Variance Table
## 
## Response: choles
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## age        1 101933  101933  53.964 1.794e-07 ***
## Residuals 23  43444    1889                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(SST2 <- sum(anova2[, 2]))

## [1] 145377

(SSE2 <- anova2[2, 2])

## [1] 43444.37

(SSR2 <- anova2[1, 2])

## [1] 101932.7

(rsq2 <- SSR2 / SST2)

## [1] 0.7011607

(s3 <- summary(lm3)$adj.r.squared)

## [1] 0.6787821

anova3

## Analysis of Variance Table
## 
## Response: choles
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## weight     1  10232   10232  5.2585  0.03177 *  
## age        1  92339   92339 47.4571 6.43e-07 ***
## Residuals 22  42806    1946                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(SST3 <- sum(anova3[, 2]))

## [1] 145377

(SSE3 <- anova3[3, 2])

## [1] 42806.23

(SSR3 <- sum(anova3[1:2, 2]))

## [1] 102570.8

(rsq3 <- SSR3 / SST3)

## [1] 0.7055503

k <- 2
(adj.rsq3 <- rsq3 - (1 - rsq3) * (k / (n - k - 1)))

## [1] 0.6787821

Based on the results obtained in questions 1-5, what do you consider to be the best predictive model involving either one or both of the independent variables considered? Why? Discuss your response in the homework forum.

Predictive model has to be the parsimonious model: model with the minimum number of predictors, in case when we have at least two models with the same explanatory value and the only thing they differ is the number of predictors. In this case, where we have a model with 2 explanatory variables its adjusted \(R^2\) is less than the model with only age as predictor. Therefore, I’d say that the second model cholesterol ~ age is the best predictive model. It explains 70.116069 % of variability in cholesterol level, while the two-predictor model explains 67.8782149 %.

Applied Regression Analysis_HW5

Exercise One

Exercise Two