Earlier in the course we studied the multiple regression relationship of SBP \((Y)\) to AGE \((X_1)\), SMK \((X_2)\) and QUET \((X_3)\) using the data in Homework 1 of Week 2.
That data represents results from an environmental engineering study of a certain chemical reaction. The concentrations of 18 separately prepared solutions were recorded at different times (three measurements at each of six times). The natural logarithms of the concentrations were also computed.
Three regression models will now be considered:
| Model | Independent Variables Used |
|---|---|
| 1 | AGE \((X_1)\) |
| 2 | AGE \((X_1)\), SMK \((X_2)\), |
| 3 | AGE \((X_1)\), SMK \((X_2)\), QUET \((X_3)\) |
First, generate each of the above models.
Then, complete the following:
A. Use model 3 to determine:
What is the predicted SBP for a 50-year old smoker with a quetelet (QUET) index of 3.5?
What is the predicted SBP for a 50-year-old non-smoker with a quetelet index of 3.5?
For 50-year-old smokers, give an estimate of the change in SBP corresponding to an increase in quetelet index from 3.0 to 3.5.
B. Using the ANOVA tables, compute and compare the \(R^2\)-values for models 1, 2 and 3.
data = read.csv("week2-HW-data.csv", header = T, sep = ",", row.names = 1)
attach(data)
m1 = lm( SBP ~ AGE)
m2 = lm( SBP ~ AGE + SMK)
m3 = lm( SBP ~ AGE + SMK + QUET)
We have model 3 gives by
\[\hat y = 45.103 + 1.213\times AGE + 9.946 \times SMK + 8.592 \times QUET.\]
summary(m3)
##
## Call:
## lm(formula = SBP ~ AGE + SMK + QUET)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5420 -6.1812 -0.7282 5.2908 15.7050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.1032 10.7649 4.190 0.000252 ***
## AGE 1.2127 0.3238 3.745 0.000829 ***
## SMK 9.9456 2.6561 3.744 0.000830 ***
## QUET 8.5924 4.4987 1.910 0.066427 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.407 on 28 degrees of freedom
## Multiple R-squared: 0.7609, Adjusted R-squared: 0.7353
## F-statistic: 29.71 on 3 and 28 DF, p-value: 7.602e-09
predict(m3, data.frame(cbind(AGE = c(50,50,50), SMK = c(1,0,1), QUET = c(3.5,3.5,3.0))))
## 1 2 3
## 145.7581 135.8125 141.4618
A.
145.758
135.812
Two ways:
round(8.5924*.5, 3)
## [1] 4.296
round(145.7581 - 141.461, 3)
## [1] 4.297
B. \(R^2\) In percentual:
round(100*summary(m1)$r.square, 2)
## [1] 60.09
round(100*summary(m2)$r.square, 2)
## [1] 72.98
round(100*summary(m3)$r.square, 2)
## [1] 76.09
summary(m1)$fstatistic
## value numdf dendf
## 45.17692 1.00000 30.00000
p1 = pf(45.17692, 1, 30, lower.tail = F); round(p1,3)
## [1] 0
summary(m2)$fstatistic
## value numdf dendf
## 39.16433 2.00000 29.00000
p2 = pf(39.16433, 2, 29, lower.tail = F); round(p2,3)
## [1] 0
summary(m3)$fstatistic
## value numdf dendf
## 29.70972 3.00000 28.00000
p3 = pf(29.70972, 3, 28, lower.tail = F); round(p3,3)
## [1] 0
If we take a confidence level of 5%, we have a significative evidence for reject the null hypothesis (\(H_0\)) in each case, because the probability of error in rejecting the true null hypothesis would be less than 5% previously admitted.
detach(data)
The data presents the weight \((X_1)\), age \((X_2)\) and plasma lipid levels of total cholesterol \((Y)\) for a hypothetical sample of 25 patients with hyperlipoproteinemia before drug therapy.
Complete the following six questions:
Generate the separate straight-line regressions of \(Y\) on \(X_1\) (model 1) and \(Y\) on \(X_2\) (model 2).
Generate the regression model of \(Y\) on both \(X_1\) and \(X_2\).
For each of the models in questions 1 and 2, determine the predicted cholesterol level \(Y\) for patient 4 (with \(Y = 263\), \(X_1 = 70\), and \(X_2 = 30\)) and compare these predicted cholesterol levels with the observed value. Comment on your findings in the homework forum.
Carry out the overall F-test for the two-variable model and the partial F-test for the addition of \(X_1\) to the model, given that \(X_2\) is already in the model.
Compute and compare the \(R^2\)-values for each of the three models considered in questions 1 and 2.
Based on the results obtained in questions 1-5, what do you consider to be the best predictive model involving either one or both of the independent variables considered? Why? Discuss your response in the homework forum.
data = read.csv("week5-HW-data.csv", header = T, sep = ",", row.names = 1)
attach(data)
m1 = lm( choles ~ weight); summary(m1)
##
## Call:
## lm(formula = choles ~ weight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.729 -53.686 -9.239 46.537 128.404
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 199.298 85.818 2.322 0.0294 *
## weight 1.622 1.229 1.320 0.2000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.65 on 23 degrees of freedom
## Multiple R-squared: 0.07038, Adjusted R-squared: 0.02996
## F-statistic: 1.741 on 1 and 23 DF, p-value: 0.2
m2 = lm( choles ~ age); summary(m2)
##
## Call:
## lm(formula = choles ~ age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.478 -26.816 -3.854 28.315 90.881
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.5751 29.6376 3.461 0.00212 **
## age 5.3207 0.7243 7.346 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.46 on 23 degrees of freedom
## Multiple R-squared: 0.7012, Adjusted R-squared: 0.6882
## F-statistic: 53.96 on 1 and 23 DF, p-value: 1.794e-07
m3 = lm( choles ~ weight + age); summary(m3)
##
## Call:
## lm(formula = choles ~ weight + age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.570 -30.374 -5.449 28.626 89.170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.9825 52.4296 1.487 0.151
## weight 0.4174 0.7288 0.573 0.573
## age 5.2166 0.7572 6.889 6.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.11 on 22 degrees of freedom
## Multiple R-squared: 0.7056, Adjusted R-squared: 0.6788
## F-statistic: 26.36 on 2 and 22 DF, p-value: 1.443e-06
Predict patient 4 by models:
predict(m1, data.frame(cbind(weight = c(70), age = c(30))))
## 1
## 312.8615
predict(m2, data.frame(cbind(weight = c(70), age = c(30))))
## 1
## 262.1954
predict(m3, data.frame(cbind(weight = c(70), age = c(30))))
## 1
## 263.6956
The predict is better for the regression model of \(Y\) on both \(X_1\) and \(X_2\).
m4 = lm( choles ~ age + weight); summary(m4)
##
## Call:
## lm(formula = choles ~ age + weight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.570 -30.374 -5.449 28.626 89.170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.9825 52.4296 1.487 0.151
## age 5.2166 0.7572 6.889 6.43e-07 ***
## weight 0.4174 0.7288 0.573 0.573
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.11 on 22 degrees of freedom
## Multiple R-squared: 0.7056, Adjusted R-squared: 0.6788
## F-statistic: 26.36 on 2 and 22 DF, p-value: 1.443e-06
summary(m4)$fstatistic
## value numdf dendf
## 26.35782 2.00000 22.00000
p4 = pf(26.35782, 2, 22, lower.tail = F); round(p4,3)
## [1] 0
If we take a confidence level of 5%, we have a significative evidence for reject the null hypothesis (\(H_0\)), because the probability of error in rejecting the true null hypothesis would be less than 5% previously admitted.
summary(m4)$coefficients[3,4]
## [1] 0.5726623
If we take a confidence level of 5%, we haven’t a significative evidence for reject the null hypothesis (\(H_0\)), because the probability of error in rejecting the true null hypothesis would be higher than 5% previously admitted.
round(100*summary(m1)$r.square, 2)
## [1] 7.04
round(100*summary(m2)$r.square, 2)
## [1] 70.12
round(100*summary(m3)$r.square, 2)
## [1] 70.56
detach(data)