We’re going to analyze the ToothGrowth data in the R datasets package. Load the ToothGrowth data and perform some basic exploratory data analyses Provide a basic summary of the data. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering) State your conclusions and the assumptions needed for your conclusions.
tg_ds<-ToothGrowth
head(tg_ds)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
names(tg_ds)
## [1] "len" "supp" "dose"
length(tg_ds)
## [1] 3
class(tg_ds)
## [1] "data.frame"
str(tg_ds)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
dim(tg_ds)
## [1] 60 3
summary(tg_ds)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
ToothGrowth {datasets} R Documentation The Effect of Vitamin C on Tooth Growth in Guinea Pigs
Description
The response is the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid).
Usage
ToothGrowth Format
A data frame with 60 observations on 3 variables.
[,1] len numeric Tooth length [,2] supp factor Supplement type (VC or OJ). [,3] dose numeric Dose in milligrams. Source
C. I. Bliss (1952) The Statistics of Bioassay. Academic Press.
References
McNeil, D. R. (1977) Interactive Data Analysis. New York: Wiley.
Examples
require(graphics)
coplot(len ~ dose | supp, data = ToothGrowth, panel = panel.smooth,
xlab = "ToothGrowth data: length vs dose, given type of supplement")
____________________________________________________________________
A priori, one thing we can do is the same plot but using different tools. Eample, let’s try to use GNU R ‘anova’. Just to be in the safe side and being very honest, I found this reference and I just wanted to work it stepe by step, which I did but, lack of the right time to explain it step by step! So here it goes:
“anova {stats} R Documentation Anova Tables Description: Compute analysis of variance (or deviance) tables for one or more fitted model objects.”
Being short of time, here is a summary of what I tried/did/got:
table(tg_ds$dose)
##
## 0.5 1 2
## 20 20 20
tg_ds$dose <- factor(tg_ds$dose)
summary(tg_ds)
## len supp dose
## Min. : 4.20 OJ:30 0.5:20
## 1st Qu.:13.07 VC:30 1 :20
## Median :19.25 2 :20
## Mean :18.81
## 3rd Qu.:25.27
## Max. :33.90
We can observe that equal number of observations appear in each level of supp and dose, but this does not guarantee that the design is perfectly balanced, i.e. that each cell of the 2 3 design contains equal number of observations. To verify that it does, we will generate the two-way frequency table for supp and dose and then plot the response versus each of the factors
table( tg_ds$supp, tg_ds$dose)
##
## 0.5 1 2
## OJ 10 10 10
## VC 10 10 10
plot( len ~ supp, data=tg_ds)
plot( len ~ dose, data=tg_ds)
Then we can fit linear models that regress the response variable on each factor separately (one-way ANOVA’s) and on both factors (two-way ANOVA). By default, the coding scheme is dummy coding. Let’s change it so that the terms displayed in the table of coefficients correspond to the \(\alpha_i\)’s, \(\beta_j\)’s, and \(\beta_{ij}\)’s.
contrasts( tg_ds$supp)
## VC
## OJ 0
## VC 1
contrasts( tg_ds$dose)
## 1 2
## 0.5 0 0
## 1 1 0
## 2 0 1
contrasts( tg_ds$supp) <- contr.sum
contrasts( tg_ds$dose) <- contr.sum
contrasts( tg_ds$supp)
## [,1]
## OJ 1
## VC -1
contrasts( tg_ds$dose)
## [,1] [,2]
## 0.5 1 0
## 1 0 1
## 2 -1 -1
And now here is the one-way ANOVA regresses len on supp.
result1 <- lm( len ~ supp, data=tg_ds)
summary( result1 )
##
## Call:
## lm(formula = len ~ supp, data = tg_ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.7633 -5.7633 0.4367 5.5867 16.9367
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.8133 0.9659 19.477 <2e-16 ***
## supp1 1.8500 0.9659 1.915 0.0604 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.482 on 58 degrees of freedom
## Multiple R-squared: 0.05948, Adjusted R-squared: 0.04327
## F-statistic: 3.668 on 1 and 58 DF, p-value: 0.06039
In the ANOVA shown above, the eect of supplement is not quite signicant at the .05 level.Notice, however, that the test assumes that the other factor (dose) has no eect whatsover, that the \(\beta_j\)’s and \(\alpha_i\)\(\beta_{ij}\)’s are all zero. If the \(\beta_j\)’s and \(\alpha_i\)\(\beta_{ij}\)’s were truly zero, then the denomimator of this F-statistic would have expectation 2, and it would be an appropriate test of the null hypothesis that all \(\alpha_i\)’s are zero. But if the \(\beta_j\)’s and/or \(\alpha_i\)\(\beta_{ij}\)’s are nonzero, the expectation of the denominator becomes \(\sigma^2\) plus some extra terms, and the test is not really valid. Similarly, let’s test the signicance of the dose eects (the \(\beta_j\)’s) in a model that does not include any effects for supp.
result2 <- lm( len ~ dose, data=ToothGrowth)
summary( result2 )
##
## Call:
## lm(formula = len ~ dose, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4496 -2.7406 -0.7452 2.8344 10.1139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.4225 1.2601 5.89 2.06e-07 ***
## dose 9.7636 0.9525 10.25 1.23e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.601 on 58 degrees of freedom
## Multiple R-squared: 0.6443, Adjusted R-squared: 0.6382
## F-statistic: 105.1 on 1 and 58 DF, p-value: 1.233e-14
result3 <- lm( len ~ supp + dose + supp:dose, data=tg_ds)
result3 <- lm( len ~ supp*dose, data=tg_ds)
summary(result3)
##
## Call:
## lm(formula = len ~ supp * dose, data = tg_ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.20 -2.72 -0.27 2.65 8.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.8133 0.4688 40.130 < 2e-16 ***
## supp1 1.8500 0.4688 3.946 0.000231 ***
## dose1 -8.2083 0.6630 -12.381 < 2e-16 ***
## dose2 0.9217 0.6630 1.390 0.170190
## supp1:dose1 0.7750 0.6630 1.169 0.247568
## supp1:dose2 1.1150 0.6630 1.682 0.098394 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.631 on 54 degrees of freedom
## Multiple R-squared: 0.7937, Adjusted R-squared: 0.7746
## F-statistic: 41.56 on 5 and 54 DF, p-value: < 2.2e-16
anova(result3)
## Analysis of Variance Table
##
## Response: len
## Df Sum Sq Mean Sq F value Pr(>F)
## supp 1 205.35 205.35 15.572 0.0002312 ***
## dose 2 2426.43 1213.22 92.000 < 2.2e-16 ***
## supp:dose 2 108.32 54.16 4.107 0.0218603 *
## Residuals 54 712.11 13.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# standardized residuals
r <- rstandard( result3)
# fitted values
yhat <- tg_ds$len - result3$residuals
plot(yhat,r)
qqnorm(r)
# put absolute residuals into the data frame
tg_ds$absres <- abs(result3$residuals)
tmp <- lm( absres ~ supp*dose, data=tg_ds)
summary(tmp)
##
## Call:
## lm(formula = absres ~ supp * dose, data = tg_ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.400 -1.280 -0.300 0.995 5.120
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7917 0.2529 11.037 1.85e-15 ***
## supp1 0.1850 0.2529 0.731 0.468
## dose1 0.3213 0.3577 0.898 0.373
## dose2 -0.3697 0.3577 -1.033 0.306
## supp1:dose1 0.5320 0.3577 1.487 0.143
## supp1:dose2 0.4730 0.3577 1.322 0.192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.959 on 54 degrees of freedom
## Multiple R-squared: 0.1523, Adjusted R-squared: 0.07379
## F-statistic: 1.94 on 5 and 54 DF, p-value: 0.1027
The magnitude of the residuals does not vary over the cells in a statistically significant manner, as judged by the omnibus F-test. Even if that test was significant | which it might have been if the n’s were larger I wouldn’t worry about it too much, because the plot of residuals versus fits did not show large dierences in variances across the six cells of this design.
More Information:
https://en.wikipedia.org/wiki/Omnibus_test
https://en.wikipedia.org/wiki/Omnibus_test#Example_2-_The_multiple_Linear_Regression_Omnibus_F_Test_on_R
And THE Reference I tried to exercise (I swear, to exercise, not to sheat) in order to learn a bit more but ran short of time:
http://sites.stat.psu.edu/~jls/stat512/lectures/lec20.pdf