Homework5 - Applied Regression Analysis

Exercise One:

Earlier in the course we studied the multiple regression relationship of SBP \((Y)\) to AGE \((X_1)\), SMK \((X_2)\) and QUET \((X_3)\) using the data in Homework 1 of Week 2.

That data represents results from an environmental engineering study of a certain chemical reaction. The concentrations of 18 separately prepared solutions were recorded at different times (three measurements at each of six times). The natural logarithms of the concentrations were also computed.

Three regression models will now be considered:

Model	Independent Variables Used
1	AGE \((X_1)\)
2	AGE \((X_1)\), SMK \((X_2)\),
3	AGE \((X_1)\), SMK \((X_2)\), QUET \((X_3)\)

First, generate each of the above models.

Then, complete the following:

A. Use model 3 to determine:

What is the predicted SBP for a 50-year old smoker with a quetelet (QUET) index of 3.5?
What is the predicted SBP for a 50-year-old non-smoker with a quetelet index of 3.5?
For 50-year-old smokers, give an estimate of the change in SBP corresponding to an increase in quetelet index from 3.0 to 3.5.

B. Using the ANOVA tables, compute and compare the \(R^2\)-values for models 1, 2 and 3.

Conduct (separately) the overall F-tests for significant regression under models 1, 2 and 3. Be sure to state your null hypothesis for each model in terms of regression coefficients.

Solving

data = read.csv("week2-HW-data.csv", header = T, sep = ",", row.names = 1)
attach(data)

m1 = lm( SBP ~ AGE)
m2 = lm( SBP ~ AGE + SMK)
m3 = lm( SBP ~ AGE + SMK + QUET)

We have model 3 gives by

\[\hat y = 45.103 + 1.213\times AGE + 9.946 \times SMK + 8.592 \times QUET.\]

summary(m3)

## 
## Call:
## lm(formula = SBP ~ AGE + SMK + QUET)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5420  -6.1812  -0.7282   5.2908  15.7050 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.1032    10.7649   4.190 0.000252 ***
## AGE           1.2127     0.3238   3.745 0.000829 ***
## SMK           9.9456     2.6561   3.744 0.000830 ***
## QUET          8.5924     4.4987   1.910 0.066427 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.407 on 28 degrees of freedom
## Multiple R-squared:  0.7609, Adjusted R-squared:  0.7353 
## F-statistic: 29.71 on 3 and 28 DF,  p-value: 7.602e-09

predict(m3, data.frame(cbind(AGE = c(50,50,50), SMK = c(1,0,1), QUET = c(3.5,3.5,3.0))))

##        1        2        3 
## 145.7581 135.8125 141.4618

145.758
135.812
Two ways:

round(8.5924*.5, 3)

## [1] 4.296

round(145.7581 - 141.461, 3)

## [1] 4.297

B. \(R^2\) In percentual:

round(100*summary(m1)$r.square, 2)

## [1] 60.09

round(100*summary(m2)$r.square, 2)

## [1] 72.98

round(100*summary(m3)$r.square, 2)

## [1] 76.09

Overall F-tests:

\(H_0: \beta_0 = 0\)

summary(m1)$fstatistic

##    value    numdf    dendf 
## 45.17692  1.00000 30.00000

p1 = pf(45.17692, 1, 30, lower.tail = F); round(p1,3)

## [1] 0

\(H_0: \beta_0 = \beta_1 = 0\)

summary(m2)$fstatistic

##    value    numdf    dendf 
## 39.16433  2.00000 29.00000

p2 = pf(39.16433, 2, 29, lower.tail = F); round(p2,3)

## [1] 0

\(H_0: \beta_0 = \beta_1 = \beta_2 = 0\)

summary(m3)$fstatistic

##    value    numdf    dendf 
## 29.70972  3.00000 28.00000

p3 = pf(29.70972, 3, 28, lower.tail = F); round(p3,3)

## [1] 0

If we take a confidence level of 5%, we have a significative evidence for reject the null hypothesis (\(H_0\)) in each case, because the probability of error in rejecting the true null hypothesis would be less than 5% previously admitted.

detach(data)

Exercise Two:

The data presents the weight \((X_1)\), age \((X_2)\) and plasma lipid levels of total cholesterol \((Y)\) for a hypothetical sample of 25 patients with hyperlipoproteinemia before drug therapy.

Complete the following six questions:

Generate the separate straight-line regressions of \(Y\) on \(X_1\) (model 1) and \(Y\) on \(X_2\) (model 2).
Generate the regression model of \(Y\) on both \(X_1\) and \(X_2\).
For each of the models in questions 1 and 2, determine the predicted cholesterol level \(Y\) for patient 4 (with \(Y = 263\), \(X_1 = 70\), and \(X_2 = 30\)) and compare these predicted cholesterol levels with the observed value. Comment on your findings in the homework forum.
Carry out the overall F-test for the two-variable model and the partial F-test for the addition of \(X_1\) to the model, given that \(X_2\) is already in the model.
Compute and compare the \(R^2\)-values for each of the three models considered in questions 1 and 2.
Based on the results obtained in questions 1-5, what do you consider to be the best predictive model involving either one or both of the independent variables considered? Why? Discuss your response in the homework forum.

Solving

data = read.csv("week5-HW-data.csv", header = T, sep = ",", row.names = 1)
attach(data)

Generate the separate straight-line regressions of \(Y\) on \(X_1\) (model 1) and \(Y\) on \(X_2\) (model 2):

m1 = lm( choles ~ weight); summary(m1)

## 
## Call:
## lm(formula = choles ~ weight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -127.729  -53.686   -9.239   46.537  128.404 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  199.298     85.818   2.322   0.0294 *
## weight         1.622      1.229   1.320   0.2000  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.65 on 23 degrees of freedom
## Multiple R-squared:  0.07038,    Adjusted R-squared:  0.02996 
## F-statistic: 1.741 on 1 and 23 DF,  p-value: 0.2

m2 = lm( choles ~ age); summary(m2)

## 
## Call:
## lm(formula = choles ~ age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.478 -26.816  -3.854  28.315  90.881 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 102.5751    29.6376   3.461  0.00212 ** 
## age           5.3207     0.7243   7.346 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.46 on 23 degrees of freedom
## Multiple R-squared:  0.7012, Adjusted R-squared:  0.6882 
## F-statistic: 53.96 on 1 and 23 DF,  p-value: 1.794e-07

Generate the regression model of \(Y\) on both \(X_1\) and \(X_2\):

m3 = lm( choles ~ weight + age); summary(m3)

## 
## Call:
## lm(formula = choles ~ weight + age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.570 -30.374  -5.449  28.626  89.170 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  77.9825    52.4296   1.487    0.151    
## weight        0.4174     0.7288   0.573    0.573    
## age           5.2166     0.7572   6.889 6.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.11 on 22 degrees of freedom
## Multiple R-squared:  0.7056, Adjusted R-squared:  0.6788 
## F-statistic: 26.36 on 2 and 22 DF,  p-value: 1.443e-06

Observed patient 4: \(Y = 263\), \(X_1 = 70\), and \(X_2 = 30\).

Predict patient 4 by models:

predict(m1, data.frame(cbind(weight = c(70), age = c(30))))

##        1 
## 312.8615

predict(m2, data.frame(cbind(weight = c(70), age = c(30))))

##        1 
## 262.1954

predict(m3, data.frame(cbind(weight = c(70), age = c(30))))

##        1 
## 263.6956

The predict is better for the regression model of \(Y\) on both \(X_1\) and \(X_2\).

Regression model of \(Y\) on both \(X_2\) and \(X_1\):

m4 = lm( choles ~ age + weight); summary(m4)

## 
## Call:
## lm(formula = choles ~ age + weight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.570 -30.374  -5.449  28.626  89.170 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  77.9825    52.4296   1.487    0.151    
## age           5.2166     0.7572   6.889 6.43e-07 ***
## weight        0.4174     0.7288   0.573    0.573    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.11 on 22 degrees of freedom
## Multiple R-squared:  0.7056, Adjusted R-squared:  0.6788 
## F-statistic: 26.36 on 2 and 22 DF,  p-value: 1.443e-06

Overall F-tests: \(H_0: \beta_0 = \beta_1 = \beta_2 = 0\)

summary(m4)$fstatistic

##    value    numdf    dendf 
## 26.35782  2.00000 22.00000

p4 = pf(26.35782, 2, 22, lower.tail = F); round(p4,3)

## [1] 0

If we take a confidence level of 5%, we have a significative evidence for reject the null hypothesis (\(H_0\)), because the probability of error in rejecting the true null hypothesis would be less than 5% previously admitted.

Partial F-test for the addition of \(X_1\) to the model, given that \(X_2\) is already in the model (p): \(H_0: \beta_2 = 0\)

summary(m4)$coefficients[3,4]

## [1] 0.5726623

If we take a confidence level of 5%, we haven’t a significative evidence for reject the null hypothesis (\(H_0\)), because the probability of error in rejecting the true null hypothesis would be higher than 5% previously admitted.

\(R^2\) In percentual:

round(100*summary(m1)$r.square, 2)

## [1] 7.04

round(100*summary(m2)$r.square, 2)

## [1] 70.12

round(100*summary(m3)$r.square, 2)

## [1] 70.56

Based on the results obtained in questions 1-5, the straight line regression of the plasma lipid levels of total cholesterol \((Y)\) on age \((X_2)\) is better, because addition of the \(X_1\) to the model does not significantly improves the prediction.

detach(data)