Q.1. : Load the ToothGrowth data and perform some basic Exploratory Data Analyses

Step 1: Load the data

At first, we have to load the data frame in the working directory.

data(ToothGrowth)

Step 2: Know your data

Before doing any Exploratory Data Analysis, we first should know the data, ie. to know the source of the data, structure of the data, how many rows & collumns the data frame has, how many variables & how many observations the data has, head & tail of the data and overall the summary of the each variables etc. Wuthout knowing these, we cannot do any analysis of the data. Let’s do those.

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

summary(ToothGrowth)

##       len       supp         dose     
##  Min.   : 4.2   OJ:30   Min.   :0.50  
##  1st Qu.:13.1   VC:30   1st Qu.:0.50  
##  Median :19.2           Median :1.00  
##  Mean   :18.8           Mean   :1.17  
##  3rd Qu.:25.3           3rd Qu.:2.00  
##  Max.   :33.9           Max.   :2.00

nrow(ToothGrowth)

## [1] 60

ncol(ToothGrowth)

## [1] 3

names(ToothGrowth)

## [1] "len"  "supp" "dose"

head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

tail(ToothGrowth)

##     len supp dose
## 55 24.8   OJ    2
## 56 30.9   OJ    2
## 57 26.4   OJ    2
## 58 27.3   OJ    2
## 59 29.4   OJ    2
## 60 23.0   OJ    2

dim(ToothGrowth)

## [1] 60  3

From the data source, it is clear that, the data collected to measure or rather, to know the effect of Vitamin C on Tooth Growth in Guinea Pigs. The response is the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid). The data frame contains 60 observations on 3 variables. The variables are: 1. len > Tooth Length > a numeric variable : This is the response variable in this case. 2. supp > Supplement Type > a factor (categorical) variable > 2 Levels : VC ( as Ascorbic Asid ) & OJ (as Orange Juice) 3. dose > “Dose” of the supplement in milligrams (0.5, 1.0, and 2.0) > a numerical variable

To begin, we should note that R considers “dose” to be numerical. We will have to be absolutely certain it is treated as a factor when we construct the model formula. Otherwise, we will end up with an analysis of covariance, which is not what we want.

Step 3 : Change ‘dose’ from numerical variable to factor variable

First, let’s end all doubt that we want “dose” to be a factor.

ToothGrowth$dose = factor(ToothGrowth$dose, levels=c(0.5,1.0,2.0),labels=c("low","med","high"))

The function factor( ) is used to declare “ToothGrowth$dose” as a factor, with levels currently coded as 0.5, 1.0, and 2.0, but now recoded with labels of “low”, “med”, and “high”. Just to be sure, I looked at str( ) and a bit of the data frame itself.

You can also embed plots, for example:

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: Factor w/ 3 levels "low","med","high": 1 1 1 1 1 1 1 1 1 1 ...

ToothGrowth[seq(1,60,5),]

##     len supp dose
## 1   4.2   VC  low
## 6  10.0   VC  low
## 11 16.5   VC  med
## 16 17.3   VC  med
## 21 23.6   VC high
## 26 32.5   VC high
## 31 15.2   OJ  low
## 36 10.0   OJ  low
## 41 19.7   OJ  med
## 46 25.2   OJ  med
## 51 25.5   OJ high
## 56 30.9   OJ high

It looks fine.

Step 4 : Replication

Now let’s check for a balanced design using the replications( ) function…

replications(len ~ supp * dose, data=ToothGrowth)

##      supp      dose supp:dose 
##        30        20        10

replications(len ~ supp * dose, data=ToothGrowth[1:58,])

## $supp
## supp
## OJ VC 
## 28 30 
## 
## $dose
## dose
##  low  med high 
##   20   20   18 
## 
## $`supp:dose`
##     dose
## supp low med high
##   OJ  10  10    8
##   VC  10  10   10

The fact that we got such a straightforward output when the function was run the first time tells us the design is balanced. To illustrate what would happen if the design were unbalanced, I ran the function a second time on a large subset of the data frame (dropping out the last two cases). In that example, we got complete tables of all the factors and how many replications (subjects) there were in each cell of the design.

Step 5 : Graphical Explanation

There are a number of ways the data can be summarized graphically. First, let’s look at boxplots.

Plot 1 : Box Plot

 boxplot(len ~ supp * dose, data=ToothGrowth, ylab="Tooth Length", main="Boxplots of Tooth Growth Data")

plot of chunk unnamed-chunk-6

We might get a better idea of what our main effects and interactions look like if we look at a traditional interaction plot.

Plot 2 : Interaction Plot

with(ToothGrowth, interaction.plot(x.factor=dose, trace.factor=supp,response=len, fun=mean, type="b", legend=T,ylab="Tooth Length", main="Interaction Plot",pch=c(1,19)))

plot of chunk unnamed-chunk-7

The syntax is straightforward, but there are a lot of options. First, “x.factor=” specifies the factor to be plotted on the horizontal axis. Second, “trace.factor=” specifies the factor to plotted as individual lines on the graph. Third, “response=” specifies the response variable. Fourth, “fun=” specifies the function to be calculated on the response variable. “Mean” is the default, so we could have left this one out. Fifth, “type=” specifies whether to plot lines, points, or both (we plotted “b”oth). We also asked for a legend, a label on the y-axis, and a main title on the graph. Finally, we specified the point characters to use. We could also have changed the line type, line color, and just about everything else about the graph.

Another type of graphic we could look at here is called a conditioning plot. I shall steal, er, ah, I mean, borrow the example right off the help page for the ToothGrowth data set.

Plot 3 : Conditional Plot

 coplot(len ~ dose | supp, data=ToothGrowth, panel=panel.smooth,xlab="ToothGrowth data: length vs dose, given type of supplement")

plot of chunk unnamed-chunk-8

The formula is the tricky part here and should be read as “length by dose given supplement.” The “panel=” option draws the red lines when set to “panel.smooth”. The top panel can be suppressed by setting “show.given =” to FALSE. This gives a much more sensible looking plot in my humble opinion.

Q.2. Provide a basic summary of the data

Step 1 : Using tapply() function

Numerical summaries of the data can be done with the tapply( ) function.

with(ToothGrowth, tapply(len, list(supp,dose), mean))

##      low   med  high
## OJ 13.23 22.70 26.06
## VC  7.98 16.77 26.14

with(ToothGrowth, tapply(len, list(supp,dose), var))

##       low    med   high
## OJ 19.889 15.296  7.049
## VC  7.544  6.327 23.018

Remember, in tapply( ) the first argument is the response variable, the second argument is a list of factors by which the response should be partitioned, and the third argument is the function you wish to calculate. Once again, we make note of some consternation over those cell variances.

Step 2 : Using model.tables() function

Another way to get similar information is to use the model.tables( ) function, but the ANOVA has to be run first and saved into a model object.

aov.out = aov(len ~ supp * dose, data=ToothGrowth)

Note: the model formula specifies “length as a function of supplement and dose with all possible interactions between the factors. Now we can look at some model tables.

model.tables(aov.out, type="means", se=T)

## Tables of means
## Grand mean
##       
## 18.81 
## 
##  supp 
## supp
##    OJ    VC 
## 20.66 16.96 
## 
##  dose 
## dose
##   low   med  high 
## 10.61 19.73 26.10 
## 
##  supp:dose 
##     dose
## supp low   med   high 
##   OJ 13.23 22.70 26.06
##   VC  7.98 16.77 26.14
## 
## Standard errors for differences of means
##           supp   dose supp:dose
##         0.9376 1.1484    1.6240
## replic.     30     20        10

We have requested the various cell and marginal means along with standard errors for the differences between means. Finally, let’s look at a Bartlett test for homogeneity of variance.

bartlett.test(len ~ supp * dose, data=ToothGrowth)

The extension to cases with more factors, for a completely randomized design where no Error structure needs to be specified, would be straightforward. Just list all the factors after the tilde separated by asterisks. This gives a model with all possible main effects and interactions.

options(show.signif.stars=F)

Now we can look at the ANOVA table using either summary( ) or summary.aov( ).

summary(aov.out)

##             Df Sum Sq Mean Sq F value  Pr(>F)
## supp         1    205     205   15.57 0.00023
## dose         2   2426    1213   92.00 < 2e-16
## supp:dose    2    108      54    4.11 0.02186
## Residuals   54    712      13

Q.3.Use confidence intervals and hypothesis tests to compare tooth growth by supp and dose.

We see two significant main effects and a significant interaction. In R lingo, two or more factor names separated by colons refers to the interaction between those factors. For post hoc tests, we have the traditional Tukey HSD test.

TukeyHSD(aov.out)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp * dose, data = ToothGrowth)
## 
## $supp
##       diff   lwr   upr p adj
## VC-OJ -3.7 -5.58 -1.82 2e-04
## 
## $dose
##            diff    lwr    upr p adj
## med-low   9.130  6.362 11.898     0
## high-low 15.495 12.727 18.263     0
## high-med  6.365  3.597  9.133     0
## 
## $`supp:dose`
##                  diff     lwr     upr  p adj
## VC:low-OJ:low   -5.25 -10.048 -0.4519 0.0243
## OJ:med-OJ:low    9.47   4.672 14.2681 0.0000
## VC:med-OJ:low    3.54  -1.258  8.3381 0.2640
## OJ:high-OJ:low  12.83   8.032 17.6281 0.0000
## VC:high-OJ:low  12.91   8.112 17.7081 0.0000
## OJ:med-VC:low   14.72   9.922 19.5181 0.0000
## VC:med-VC:low    8.79   3.992 13.5881 0.0000
## OJ:high-VC:low  18.08  13.282 22.8781 0.0000
## VC:high-VC:low  18.16  13.362 22.9581 0.0000
## VC:med-OJ:med   -5.93 -10.728 -1.1319 0.0074
## OJ:high-OJ:med   3.36  -1.438  8.1581 0.3187
## VC:high-OJ:med   3.44  -1.358  8.2381 0.2936
## OJ:high-VC:med   9.29   4.492 14.0881 0.0000
## VC:high-VC:med   9.37   4.572 14.1681 0.0000
## VC:high-OJ:high  0.08  -4.718  4.8781 1.0000

The degree of confidence required for the confidence intervals can be set using the “conf.level=” option. The default value is 0.95. There is also a “which=” option, which can be used to set the factors we wish to compare, in the event we don’t want all possible comparisons. Feed it a character vector of factor names.

TukeyHSD(aov.out)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp * dose, data = ToothGrowth)
## 
## $supp
##       diff   lwr   upr p adj
## VC-OJ -3.7 -5.58 -1.82 2e-04
## 
## $dose
##            diff    lwr    upr p adj
## med-low   9.130  6.362 11.898     0
## high-low 15.495 12.727 18.263     0
## high-med  6.365  3.597  9.133     0
## 
## $`supp:dose`
##                  diff     lwr     upr  p adj
## VC:low-OJ:low   -5.25 -10.048 -0.4519 0.0243
## OJ:med-OJ:low    9.47   4.672 14.2681 0.0000
## VC:med-OJ:low    3.54  -1.258  8.3381 0.2640
## OJ:high-OJ:low  12.83   8.032 17.6281 0.0000
## VC:high-OJ:low  12.91   8.112 17.7081 0.0000
## OJ:med-VC:low   14.72   9.922 19.5181 0.0000
## VC:med-VC:low    8.79   3.992 13.5881 0.0000
## OJ:high-VC:low  18.08  13.282 22.8781 0.0000
## VC:high-VC:low  18.16  13.362 22.9581 0.0000
## VC:med-OJ:med   -5.93 -10.728 -1.1319 0.0074
## OJ:high-OJ:med   3.36  -1.438  8.1581 0.3187
## VC:high-OJ:med   3.44  -1.358  8.2381 0.2936
## OJ:high-VC:med   9.29   4.492 14.0881 0.0000
## VC:high-VC:med   9.37   4.572 14.1681 0.0000
## VC:high-OJ:high  0.08  -4.718  4.8781 1.0000

You can also plot the confidence intervals from the Tukey test.

TukeyHSD(aov.out, which=c("dose"), conf.level=.99)

##   Tukey multiple comparisons of means
##     99% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp * dose, data = ToothGrowth)
## 
## $dose
##            diff    lwr    upr p adj
## med-low   9.130  5.638 12.622     0
## high-low 15.495 12.003 18.987     0
## high-med  6.365  2.873  9.857     0

You can also plot the confidence intervals from the Tukey test…

plot(TukeyHSD(aov.out))

plot of chunk unnamed-chunk-18

although the y-labels tend to get smushed out of existence. The 15 interaction CIs are plotted here in the same order they appear in the table above. If you want pairwise t-tests on any factor, you can have them, with a variety of different methods available for adjusting the p-value for the number of tests being done (see the help page for details).

with(ToothGrowth, pairwise.t.test(len, dose, p.adjust.method="bonferroni"))

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  len and dose 
## 
##      low     med    
## med  2.0e-08 -      
## high 4.4e-16 4.3e-05
## 
## P value adjustment method: bonferroni

Treatment Contrasts

check to make sure we’re getting treatment contrasts$contrasts

options("contrasts")

## $contrasts
##         unordered           ordered 
## "contr.treatment"      "contr.poly"

summary.lm(aov.out)

## 
## Call:
## aov(formula = len ~ supp * dose, data = ToothGrowth)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8.20  -2.72  -0.27   2.65   8.27 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)        13.23       1.15   11.52  3.6e-16
## suppVC             -5.25       1.62   -3.23   0.0021
## dosemed             9.47       1.62    5.83  3.2e-07
## dosehigh           12.83       1.62    7.90  1.4e-10
## suppVC:dosemed     -0.68       2.30   -0.30   0.7683
## suppVC:dosehigh     5.33       2.30    2.32   0.0241
## 
## Residual standard error: 3.63 on 54 degrees of freedom
## Multiple R-squared:  0.794,  Adjusted R-squared:  0.775 
## F-statistic: 41.6 on 5 and 54 DF,  p-value: <2e-16

Conclusion of the Hypothesis Testing

Interpretation of this table can be confusing for someone (like me) who is not used to looking at ANOVA output in this form. To see what’s here, refer back to the table of cell-by-cell means produced above by tapply( ). The estimated coefficient for the Intercept is the mean in the upper left corner of that table, the mean for the baseline level of “supp” combined with the baseline level of “dose”. In other words, this is the mean of the OJ-low cell. We find this mean to be significantly different from zero, not an especially interesting result. The next line of the table, “suppVC”, shows the difference between the means of the VC-low cell and the OJ-low cell. The standard error is for the difference between these means, so the difference is found to be significant. The “dosemed” line shows the difference between the means of the OJ-med cell and the OJ-low cell, and once again this difference is found to be statistically significant. The “dosehigh” line shows the difference in means between the OJ-high cell and the OJ-low cell, also significant. The last two lines test various elements of the interaction (good luck finding them). Frankly, I haven’t yet found this table very interesting, but it’s here if you want it. I don’t see anything here that isn’t done better by the Tukey HSD output.

Q.4. State your conclusions and the assumptions needed for your conclusions.

For different methods, there are significant difference in length of tooth when the dose is smal, but no significant difference when the dose is large. For different doses, there are always significant difference in length of tooth with any methods.

Here is the assumptions:

There is no other factor that would pose confounding effect on the length of tooth.
All the guinea pigs are randomly selected, and there is no significant difference in any attribute that might influence the length of tooth (such as the size of guinea pigs).
Different delivery methods would not affect the effect of dose size, thats to say the delivery methods are independent of dose size, and vice versa.
Guinea pigs do not have perference to any dose size or delivery methods, or guinea pigs reaction to dose size and dose size will not affect the length of tooth.

Statistical Inference: Project 2: Analyze the ToothGrowth Data

sougata biswas

Sunday, August 24, 2014