Refer to Problem 5 of homework 10. In last homework, we have built a multiple regression model for predicting the total billing (TotalBill02
) using three predictors: number of architects (N_Arch
), engineers (N_Eng
), and staff (N_Staff
). In this problem, we would like to see whether a reduced model would be enough. The “billing” dataset has already been loaded in a hidden R chunk.
N_Arch
) as predictor. Report the fitted regression equation and the \(R^2\).model1 = lm(TotalBill02 ~ N_Arch, data = billing)
sm = summary(model1)
sm
##
## Call:
## lm(formula = TotalBill02 ~ N_Arch, data = billing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.359 -2.237 -0.867 1.499 10.781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.9730 1.4759 1.337 0.197
## N_Arch 0.5940 0.1079 5.507 2.59e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.294 on 19 degrees of freedom
## Multiple R-squared: 0.6148, Adjusted R-squared: 0.5946
## F-statistic: 30.33 on 1 and 19 DF, p-value: 2.593e-05
The reported line is y^ = .5940x + 1.9730 The r-squared is .6148
TotalBill02~N_Arch+N_Eng+N_Staff
) you fitted in homework 10. Report the F test statistic and the p-value. Based on the result, which model would you choose?model2 = lm(TotalBill02~ N_Arch + N_Eng + N_Staff, data= billing)
summary(model2)
##
## Call:
## lm(formula = TotalBill02 ~ N_Arch + N_Eng + N_Staff, data = billing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5955 -1.1713 -0.4623 1.2073 4.2064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.77993 0.71258 1.095 0.28899
## N_Arch 0.01429 0.12518 0.114 0.91046
## N_Eng -0.13643 0.15581 -0.876 0.39342
## N_Staff 0.13773 0.04104 3.356 0.00375 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.935 on 17 degrees of freedom
## Multiple R-squared: 0.93, Adjusted R-squared: 0.9177
## F-statistic: 75.31 on 3 and 17 DF, p-value: 5.058e-10
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: TotalBill02 ~ N_Arch
## Model 2: TotalBill02 ~ N_Arch + N_Eng + N_Staff
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19 350.36
## 2 17 63.66 2 286.7 38.281 5.064e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The full model is clearly better than the reduced model because the r-squared value which interprets how much of the data can be represented by the linear relationship between the variables is much higher, the f-statistic is extremely large as well and the p-value is much smaller than the reduced model.
F-statistic full model: 75.31
p-value full model: 5.058e-10
The test statistics calculated depend on which degree of freedom you choosed for N. With 19 df, the f-statistic ends up being 33.77
P-value is extremely small.
Healthy bones are continually being renewed by two processes. Through bone formation, new bone is built; through bone resorption, old bone is removed. The variables VO+ and VO- measure bone formation and bone resorption, respectively. Osteocalcin (OC) is a biochemial marker for bone formation: higher levels of bone formation are associated with higher levels of OC. Similarly, tartrate-resistant acid phosphatase (TRAP) is a biochemial marker for bone resorption. Both OC and TRAP are measured in blood samples. These variables were measured in a study of 31 healthy women aged 11 to 32 years. We consider a regression model for predicting VO+ using OC, TRAP and VO-. The data has been loaded in the “biomark” data frame.
model1 = lm(voplus ~ oc + trap + vominus, data = biomark)
summary(model1)
##
## Call:
## lm(formula = voplus ~ oc + trap + vominus, data = biomark)
##
## Residuals:
## Min 1Q Median 3Q Max
## -364.19 -158.57 -15.13 120.08 441.11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -243.4877 94.2183 -2.584 0.01549 *
## oc 8.2349 2.8397 2.900 0.00733 **
## trap 6.6071 10.3340 0.639 0.52797
## vominus 0.9746 0.1211 8.048 1.2e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 207.8 on 27 degrees of freedom
## Multiple R-squared: 0.8844, Adjusted R-squared: 0.8715
## F-statistic: 68.84 on 3 and 27 DF, p-value: 9.031e-13
Regression equation: y = 8.23x + 6.6071y + .9746z - 243.4877
According to the computer generated summary, Not all of the coefficients of each explanatory variable are statiscally significant. For the “trap” variable, the results were not significant.
resid = residuals(model1)
resid
## 1 2 3 4 5 6
## 273.88823 135.16205 441.11484 -264.52623 183.40053 -163.29267
## 7 8 9 10 11 12
## 104.99757 36.14774 -93.01310 -271.84524 308.77537 53.68823
## 13 14 15 16 17 18
## -364.18695 -20.78584 59.29864 -15.13224 -115.34787 250.53862
## 19 20 21 22 23 24
## 24.60105 -116.51778 -212.24008 -191.76039 -20.08015 269.60758
## 25 26 27 28 29 30
## 234.33115 75.57292 -221.65710 -154.28218 -105.84925 -162.85226
## 31
## 42.24483
fitted = fitted.values(model1)
plot(fitted, resid, xlab = "Fitted")
The residuals seem random and variant enough. No concerns.
Refer to Problem 2. Because the distributions of VO+,VO-, OC and TRAP tend to be skewed, it is common to work with logarithms rather than the measured values.
# logarithm transformation
lvoplus=log(biomark$voplus)
lvominus=log(biomark$vominus)
loc=log(biomark$oc)
ltrap=log(biomark$trap)
# refit the model using "lvoplus","lvominus","loc" and "ltrap"
newModel = lm(lvoplus ~ lvominus + loc + ltrap, data = billing)
summary(newModel)
##
## Call:
## lm(formula = lvoplus ~ lvominus + loc + ltrap, data = billing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.44054 -0.14746 -0.00628 0.16318 0.39937
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.87239 0.64020 1.363 0.18424
## lvominus 0.67248 0.11778 5.709 4.56e-06 ***
## loc 0.39203 0.11538 3.398 0.00212 **
## ltrap 0.02746 0.15697 0.175 0.86242
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2327 on 27 degrees of freedom
## Multiple R-squared: 0.8421, Adjusted R-squared: 0.8245
## F-statistic: 47.99 on 3 and 27 DF, p-value: 5.943e-11
New Regression Model: y = .67x + .392y + .027z + .872
Only lvominus and loc are statistically significant with this logartithmic transformation.
resid1 = residuals(newModel)
resid1
## 1 2 3 4 5
## 0.191638943 0.146877643 0.272642690 -0.202665853 0.292852450
## 6 7 8 9 10
## -0.133278483 0.115818239 0.039111738 0.090940661 -0.202900419
## 11 12 13 14 15
## 0.294298390 0.015255393 -0.258226958 -0.006282355 -0.112944651
## 16 17 18 19 20
## -0.187483287 -0.365876364 0.297095392 -0.094247886 -0.152155122
## 21 22 23 24 25
## -0.103486654 -0.440539143 -0.112618735 0.390301599 0.399365777
## 26 27 28 29 30
## 0.179478141 -0.235015480 0.074316604 -0.125621766 -0.142765517
## 31
## 0.076115012
fitted = fitted.values(newModel)
plot(fitted, resid1, xlab = "Fitted")
The residual plot seems alot less variant and more linear than the previous residual plot. That is something to be concerned about.
df: 6 num, 5 denominator p-value: <.1
df: 5 num, 30 denom p-value: p<.001
F-Stat: 7.2 df: 4 in numerator, 9 in denominator
F-Stat: 1.72 df: num is 2, denom is 11
Various studies have shown the benefits of massage to manage pain. In one study, 125 adults suffering from osteoarthritis of the knees were randomly assigned to one of five 8-week regimens. The outcome was the change in the Western Ontario and McMaster Universities Arthritis Index (WOMAC-Global), which is used to assess pain and functioning in those suffering from arthritis. Negative values indicate improvement. The following table summarizes the results of those completing the study. The data has also been loaded in a hidden R chunk (in the “massage” data frame).
Regimen | n | \(\bar{x}\) | s |
---|---|---|---|
30 min massage 1\(\times\)/wk | 22 | -17.4 | 17.9 |
30 min massage 2\(\times\)/wk | 24 | -18.4 | 20.7 |
60 min massage 1\(\times\)/wk | 24 | -24.0 | 18.4 |
60 min massage 2\(\times\)/wk | 25 | -24.0 | 19.8 |
Usual care, no massage | 24 | -6.3 | 14.6 |
.048 dropped before completion
Yes, random assignment to each group.
(n1-1)s1^2+ (n2-1)s2^2 …/(n1 +n2 +nk -k)
42615.4/(119-4) = 370.57
F-Stat = 13.65 df = 21 p-value = very small.
Conclusion: We have sufficient evidence to conclude that the mean change in WOMAC-global score is not the same for all regimens.
Significant
Significant
Not significant