Problem 1

Refer to Problem 5 of homework 10. In last homework, we have built a multiple regression model for predicting the total billing (TotalBill02) using three predictors: number of architects (N_Arch), engineers (N_Eng), and staff (N_Staff). In this problem, we would like to see whether a reduced model would be enough. The “billing” dataset has already been loaded in a hidden R chunk.

  1. Fit a reduced linear regression model using only the number of architects (N_Arch) as predictor. Report the fitted regression equation and the \(R^2\).
model1 = lm(TotalBill02 ~ N_Arch, data = billing)
sm = summary(model1)
sm
## 
## Call:
## lm(formula = TotalBill02 ~ N_Arch, data = billing)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.359 -2.237 -0.867  1.499 10.781 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.9730     1.4759   1.337    0.197    
## N_Arch        0.5940     0.1079   5.507 2.59e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.294 on 19 degrees of freedom
## Multiple R-squared:  0.6148, Adjusted R-squared:  0.5946 
## F-statistic: 30.33 on 1 and 19 DF,  p-value: 2.593e-05

The reported line is y^ = .5940x + 1.9730 The r-squared is .6148

  1. Perform an ANOVA F test to compare the reduced models in part (a) and the full model (TotalBill02~N_Arch+N_Eng+N_Staff) you fitted in homework 10. Report the F test statistic and the p-value. Based on the result, which model would you choose?
model2 = lm(TotalBill02~ N_Arch + N_Eng + N_Staff, data= billing)
summary(model2)
## 
## Call:
## lm(formula = TotalBill02 ~ N_Arch + N_Eng + N_Staff, data = billing)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5955 -1.1713 -0.4623  1.2073  4.2064 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.77993    0.71258   1.095  0.28899   
## N_Arch       0.01429    0.12518   0.114  0.91046   
## N_Eng       -0.13643    0.15581  -0.876  0.39342   
## N_Staff      0.13773    0.04104   3.356  0.00375 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.935 on 17 degrees of freedom
## Multiple R-squared:   0.93,  Adjusted R-squared:  0.9177 
## F-statistic: 75.31 on 3 and 17 DF,  p-value: 5.058e-10
anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: TotalBill02 ~ N_Arch
## Model 2: TotalBill02 ~ N_Arch + N_Eng + N_Staff
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     19 350.36                                  
## 2     17  63.66  2     286.7 38.281 5.064e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The full model is clearly better than the reduced model because the r-squared value which interprets how much of the data can be represented by the linear relationship between the variables is much higher, the f-statistic is extremely large as well and the p-value is much smaller than the reduced model.

F-statistic full model: 75.31

p-value full model: 5.058e-10

  1. The F test in part (b) can also be performed using the \(R^2\) values from the full and reduced models. The test statistic is \[ F=\left(\frac{n-p-1}{q}\right)\left(\frac{R_1^2-R_2^2}{1-R_1^2}\right)\] with \(q\) and \(n-p-1\) degrees of freedom. \(R_1^2\) is the value for the full model and \(R_2^2\) is the value for the reduced model. \(p\) is the number of predictor in the full model and \(q\) is the number of predictors removed from the full model compared to the reduced model. Using the above formula, compute the test statictis and the p-value. Do you get the same result as part (b)?

The test statistics calculated depend on which degree of freedom you choosed for N. With 19 df, the f-statistic ends up being 33.77

P-value is extremely small.

Problem 2

Healthy bones are continually being renewed by two processes. Through bone formation, new bone is built; through bone resorption, old bone is removed. The variables VO+ and VO- measure bone formation and bone resorption, respectively. Osteocalcin (OC) is a biochemial marker for bone formation: higher levels of bone formation are associated with higher levels of OC. Similarly, tartrate-resistant acid phosphatase (TRAP) is a biochemial marker for bone resorption. Both OC and TRAP are measured in blood samples. These variables were measured in a study of 31 healthy women aged 11 to 32 years. We consider a regression model for predicting VO+ using OC, TRAP and VO-. The data has been loaded in the “biomark” data frame.

  1. Run the multiple regression to predict VO+ using OC, TRAP and VO-. Report the fitted regression equation and the estimate of \(\sigma\).
model1 = lm(voplus ~ oc + trap + vominus, data = biomark)
summary(model1)
## 
## Call:
## lm(formula = voplus ~ oc + trap + vominus, data = biomark)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -364.19 -158.57  -15.13  120.08  441.11 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -243.4877    94.2183  -2.584  0.01549 *  
## oc             8.2349     2.8397   2.900  0.00733 ** 
## trap           6.6071    10.3340   0.639  0.52797    
## vominus        0.9746     0.1211   8.048  1.2e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 207.8 on 27 degrees of freedom
## Multiple R-squared:  0.8844, Adjusted R-squared:  0.8715 
## F-statistic: 68.84 on 3 and 27 DF,  p-value: 9.031e-13

Regression equation: y = 8.23x + 6.6071y + .9746z - 243.4877

  1. Are the coefficients of each explanatory variables statistically significant?

According to the computer generated summary, Not all of the coefficients of each explanatory variable are statiscally significant. For the “trap” variable, the results were not significant.

  1. Investigate the residuals. Are there any concerns about linearity, variance, or Normality?
resid = residuals(model1)
resid
##          1          2          3          4          5          6 
##  273.88823  135.16205  441.11484 -264.52623  183.40053 -163.29267 
##          7          8          9         10         11         12 
##  104.99757   36.14774  -93.01310 -271.84524  308.77537   53.68823 
##         13         14         15         16         17         18 
## -364.18695  -20.78584   59.29864  -15.13224 -115.34787  250.53862 
##         19         20         21         22         23         24 
##   24.60105 -116.51778 -212.24008 -191.76039  -20.08015  269.60758 
##         25         26         27         28         29         30 
##  234.33115   75.57292 -221.65710 -154.28218 -105.84925 -162.85226 
##         31 
##   42.24483
fitted = fitted.values(model1)
plot(fitted, resid, xlab = "Fitted")

The residuals seem random and variant enough. No concerns.

Problem 3

Refer to Problem 2. Because the distributions of VO+,VO-, OC and TRAP tend to be skewed, it is common to work with logarithms rather than the measured values.

  1. Do a logarithm transformation to VO+,VO-, OC and TRAP. Rerun the multiple regression using the transformed data. Report the fitted regression equation and the estimate of \(\sigma\).
# logarithm transformation
lvoplus=log(biomark$voplus)
lvominus=log(biomark$vominus)
loc=log(biomark$oc)
ltrap=log(biomark$trap)
# refit the model using "lvoplus","lvominus","loc" and "ltrap"
newModel = lm(lvoplus ~ lvominus + loc + ltrap, data = billing)
summary(newModel)
## 
## Call:
## lm(formula = lvoplus ~ lvominus + loc + ltrap, data = billing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.44054 -0.14746 -0.00628  0.16318  0.39937 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.87239    0.64020   1.363  0.18424    
## lvominus     0.67248    0.11778   5.709 4.56e-06 ***
## loc          0.39203    0.11538   3.398  0.00212 ** 
## ltrap        0.02746    0.15697   0.175  0.86242    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2327 on 27 degrees of freedom
## Multiple R-squared:  0.8421, Adjusted R-squared:  0.8245 
## F-statistic: 47.99 on 3 and 27 DF,  p-value: 5.943e-11

New Regression Model: y = .67x + .392y + .027z + .872

  1. Are the coefficients of each explanatory variables statistically significant?

Only lvominus and loc are statistically significant with this logartithmic transformation.

  1. Investigate the residuals. Are there any concerns about linearity, variance, or Normality?
resid1 = residuals(newModel)
resid1
##            1            2            3            4            5 
##  0.191638943  0.146877643  0.272642690 -0.202665853  0.292852450 
##            6            7            8            9           10 
## -0.133278483  0.115818239  0.039111738  0.090940661 -0.202900419 
##           11           12           13           14           15 
##  0.294298390  0.015255393 -0.258226958 -0.006282355 -0.112944651 
##           16           17           18           19           20 
## -0.187483287 -0.365876364  0.297095392 -0.094247886 -0.152155122 
##           21           22           23           24           25 
## -0.103486654 -0.440539143 -0.112618735  0.390301599  0.399365777 
##           26           27           28           29           30 
##  0.179478141 -0.235015480  0.074316604 -0.125621766 -0.142765517 
##           31 
##  0.076115012
fitted = fitted.values(newModel)
plot(fitted, resid1, xlab = "Fitted")

The residual plot seems alot less variant and more linear than the previous residual plot. That is something to be concerned about.

Problem 4

  1. For each of the following situations, find the degrees of freedom for the F statistic and the p-value.

df: 6 num, 5 denominator p-value: <.1

df: 5 num, 30 denom p-value: p<.001

  1. For each of the following situations, find the F statistic and the degrees of freedom.

F-Stat: 7.2 df: 4 in numerator, 9 in denominator

F-Stat: 1.72 df: num is 2, denom is 11

Problem 5

Various studies have shown the benefits of massage to manage pain. In one study, 125 adults suffering from osteoarthritis of the knees were randomly assigned to one of five 8-week regimens. The outcome was the change in the Western Ontario and McMaster Universities Arthritis Index (WOMAC-Global), which is used to assess pain and functioning in those suffering from arthritis. Negative values indicate improvement. The following table summarizes the results of those completing the study. The data has also been loaded in a hidden R chunk (in the “massage” data frame).

Regimen n \(\bar{x}\) s
30 min massage 1\(\times\)/wk 22 -17.4 17.9
30 min massage 2\(\times\)/wk 24 -18.4 20.7
60 min massage 1\(\times\)/wk 24 -24.0 18.4
60 min massage 2\(\times\)/wk 25 -24.0 19.8
Usual care, no massage 24 -6.3 14.6
  1. What proportion of adults dropped out of the study before completion?

.048 dropped before completion

  1. Is it reasonable to use the assumption of equal standard deviations when we analyze these data? Give a reason for your answer.

Yes, random assignment to each group.

  1. Find the pooled standard deviation \(s_p\).

(n1-1)s1^2+ (n2-1)s2^2 …/(n1 +n2 +nk -k)

42615.4/(119-4) = 370.57

  1. The SS(Regimen)=5060.346. Test the null hypothesis that the mean change in WOMAC-Global score is the same for all regimens. Report the F statistic, degrees of freedom and the p-value. State your conclusion.

F-Stat = 13.65 df = 21 p-value = very small.

Conclusion: We have sufficient evidence to conclude that the mean change in WOMAC-global score is not the same for all regimens.

  1. There are 10 pairs of means to compare. For the Bonferroni multiple-comparisons method, the critical t-value is 2.863. For the following pairs, determine which pairs of means are found to be significantly different?

Significant

Significant

Not significant