Overview

We’re going to analyze the ToothGrowth data in the R datasets package. Load the ToothGrowth data and perform some basic exploratory data analyses Provide a basic summary of the data. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering) State your conclusions and the assumptions needed for your conclusions.

  1. Load the data set and do a pre process. We’ll execute some commands to know a bit of the ToothGrowth data set:
tg_ds<-ToothGrowth
head(tg_ds)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
names(tg_ds)
## [1] "len"  "supp" "dose"
length(tg_ds)
## [1] 3
class(tg_ds)
## [1] "data.frame"
str(tg_ds)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
dim(tg_ds)
## [1] 60  3
summary(tg_ds)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

¿What is the ToothGrowth Dataset? And the Answer:


ToothGrowth {datasets} R Documentation The Effect of Vitamin C on Tooth Growth in Guinea Pigs

Description

The response is the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid).

Usage

ToothGrowth Format

A data frame with 60 observations on 3 variables.

[,1] len numeric Tooth length [,2] supp factor Supplement type (VC or OJ). [,3] dose numeric Dose in milligrams. Source

C. I. Bliss (1952) The Statistics of Bioassay. Academic Press.

References

McNeil, D. R. (1977) Interactive Data Analysis. New York: Wiley.

Examples

require(graphics)

coplot(len ~ dose | supp, data = ToothGrowth, panel = panel.smooth,
       xlab = "ToothGrowth data: length vs dose, given type of supplement")

____________________________________________________________________

A priori, one thing we can do is the same plot but using different tools. Eample, let’s try to use GNU R ‘anova’. Just to be in the safe side and being very honest, I found this reference and I just wanted to work it stepe by step, which I did but, lack of the right time to explain it step by step! So here it goes:

“anova {stats} R Documentation Anova Tables Description: Compute analysis of variance (or deviance) tables for one or more fitted model objects.”

Being short of time, here is a summary of what I tried/did/got:

table(tg_ds$dose)
## 
## 0.5   1   2 
##  20  20  20
tg_ds$dose <- factor(tg_ds$dose)
summary(tg_ds)
##       len        supp     dose   
##  Min.   : 4.20   OJ:30   0.5:20  
##  1st Qu.:13.07   VC:30   1  :20  
##  Median :19.25           2  :20  
##  Mean   :18.81                   
##  3rd Qu.:25.27                   
##  Max.   :33.90

We can observe that equal number of observations appear in each level of supp and dose, but this does not guarantee that the design is perfectly balanced, i.e. that each cell of the 2 3 design contains equal number of observations. To verify that it does, we will generate the two-way frequency table for supp and dose and then plot the response versus each of the factors

table( tg_ds$supp, tg_ds$dose)
##     
##      0.5  1  2
##   OJ  10 10 10
##   VC  10 10 10
plot( len ~ supp, data=tg_ds)

plot( len ~ dose, data=tg_ds)

Then we can fit linear models that regress the response variable on each factor separately (one-way ANOVA’s) and on both factors (two-way ANOVA). By default, the coding scheme is dummy coding. Let’s change it so that the terms displayed in the table of coefficients correspond to the \(\alpha_i\)’s, \(\beta_j\)’s, and \(\beta_{ij}\)’s.

contrasts( tg_ds$supp)
##    VC
## OJ  0
## VC  1
contrasts( tg_ds$dose)
##     1 2
## 0.5 0 0
## 1   1 0
## 2   0 1
contrasts( tg_ds$supp) <- contr.sum
contrasts( tg_ds$dose) <- contr.sum
contrasts( tg_ds$supp)
##    [,1]
## OJ    1
## VC   -1
contrasts( tg_ds$dose)
##     [,1] [,2]
## 0.5    1    0
## 1      0    1
## 2     -1   -1

And now here is the one-way ANOVA regresses len on supp.

result1 <- lm( len ~ supp, data=tg_ds)
summary( result1 )
## 
## Call:
## lm(formula = len ~ supp, data = tg_ds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.7633  -5.7633   0.4367   5.5867  16.9367 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.8133     0.9659  19.477   <2e-16 ***
## supp1         1.8500     0.9659   1.915   0.0604 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.482 on 58 degrees of freedom
## Multiple R-squared:  0.05948,    Adjusted R-squared:  0.04327 
## F-statistic: 3.668 on 1 and 58 DF,  p-value: 0.06039

In the ANOVA shown above, the e ect of supplement is not quite signi cant at the .05 level.Notice, however, that the test assumes that the other factor (dose) has no e ect whatsover, that the \(\beta_j\)’s and \(\alpha_i\)\(\beta_{ij}\)’s are all zero. If the \(\beta_j\)’s and \(\alpha_i\)\(\beta_{ij}\)’s were truly zero, then the denomimator of this F-statistic would have expectation 2, and it would be an appropriate test of the null hypothesis that all \(\alpha_i\)’s are zero. But if the \(\beta_j\)’s and/or \(\alpha_i\)\(\beta_{ij}\)’s are nonzero, the expectation of the denominator becomes \(\sigma^2\) plus some extra terms, and the test is not really valid. Similarly, let’s test the signi cance of the dose e ects (the \(\beta_j\)’s) in a model that does not include any effects for supp.

result2 <- lm( len ~ dose, data=ToothGrowth)
summary( result2 )
## 
## Call:
## lm(formula = len ~ dose, data = ToothGrowth)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4496 -2.7406 -0.7452  2.8344 10.1139 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.4225     1.2601    5.89 2.06e-07 ***
## dose          9.7636     0.9525   10.25 1.23e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.601 on 58 degrees of freedom
## Multiple R-squared:  0.6443, Adjusted R-squared:  0.6382 
## F-statistic: 105.1 on 1 and 58 DF,  p-value: 1.233e-14
result3 <- lm( len ~ supp + dose + supp:dose, data=tg_ds)
result3 <- lm( len ~ supp*dose, data=tg_ds)
summary(result3)
## 
## Call:
## lm(formula = len ~ supp * dose, data = tg_ds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8.20  -2.72  -0.27   2.65   8.27 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  18.8133     0.4688  40.130  < 2e-16 ***
## supp1         1.8500     0.4688   3.946 0.000231 ***
## dose1        -8.2083     0.6630 -12.381  < 2e-16 ***
## dose2         0.9217     0.6630   1.390 0.170190    
## supp1:dose1   0.7750     0.6630   1.169 0.247568    
## supp1:dose2   1.1150     0.6630   1.682 0.098394 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.631 on 54 degrees of freedom
## Multiple R-squared:  0.7937, Adjusted R-squared:  0.7746 
## F-statistic: 41.56 on 5 and 54 DF,  p-value: < 2.2e-16
anova(result3)
## Analysis of Variance Table
## 
## Response: len
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## supp       1  205.35  205.35  15.572 0.0002312 ***
## dose       2 2426.43 1213.22  92.000 < 2.2e-16 ***
## supp:dose  2  108.32   54.16   4.107 0.0218603 *  
## Residuals 54  712.11   13.19                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# standardized residuals
r <- rstandard( result3)
# fitted values
yhat <- tg_ds$len - result3$residuals
plot(yhat,r)

qqnorm(r)

# put absolute residuals into the data frame
tg_ds$absres <- abs(result3$residuals)
tmp <- lm( absres ~ supp*dose, data=tg_ds)
summary(tmp)
## 
## Call:
## lm(formula = absres ~ supp * dose, data = tg_ds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.400 -1.280 -0.300  0.995  5.120 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.7917     0.2529  11.037 1.85e-15 ***
## supp1         0.1850     0.2529   0.731    0.468    
## dose1         0.3213     0.3577   0.898    0.373    
## dose2        -0.3697     0.3577  -1.033    0.306    
## supp1:dose1   0.5320     0.3577   1.487    0.143    
## supp1:dose2   0.4730     0.3577   1.322    0.192    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.959 on 54 degrees of freedom
## Multiple R-squared:  0.1523, Adjusted R-squared:  0.07379 
## F-statistic:  1.94 on 5 and 54 DF,  p-value: 0.1027

Some Conclusions:

The magnitude of the residuals does not vary over the cells in a statistically significant manner, as judged by the omnibus F-test. Even if that test was significant | which it might have been if the n’s were larger I wouldn’t worry about it too much, because the plot of residuals versus fits did not show large di erences in variances across the six cells of this design.

More Information:
https://en.wikipedia.org/wiki/Omnibus_test
https://en.wikipedia.org/wiki/Omnibus_test#Example_2-_The_multiple_Linear_Regression_Omnibus_F_Test_on_R
And THE Reference I tried to exercise (I swear, to exercise, not to sheat) in order to learn a bit more but ran short of time:
http://sites.stat.psu.edu/~jls/stat512/lectures/lec20.pdf