At first, we have to load the data frame in the working directory.
data(ToothGrowth)
Before doing any Exploratory Data Analysis, we first should know the data, ie. to know the source of the data, structure of the data, how many rows & collumns the data frame has, how many variables & how many observations the data has, head & tail of the data and overall the summary of the each variables etc. Wuthout knowing these, we cannot do any analysis of the data. Let’s do those.
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
summary(ToothGrowth)
## len supp dose
## Min. : 4.2 OJ:30 Min. :0.50
## 1st Qu.:13.1 VC:30 1st Qu.:0.50
## Median :19.2 Median :1.00
## Mean :18.8 Mean :1.17
## 3rd Qu.:25.3 3rd Qu.:2.00
## Max. :33.9 Max. :2.00
nrow(ToothGrowth)
## [1] 60
ncol(ToothGrowth)
## [1] 3
names(ToothGrowth)
## [1] "len" "supp" "dose"
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
tail(ToothGrowth)
## len supp dose
## 55 24.8 OJ 2
## 56 30.9 OJ 2
## 57 26.4 OJ 2
## 58 27.3 OJ 2
## 59 29.4 OJ 2
## 60 23.0 OJ 2
dim(ToothGrowth)
## [1] 60 3
From the data source, it is clear that, the data collected to measure or rather, to know the effect of Vitamin C on Tooth Growth in Guinea Pigs. The response is the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods (orange juice or ascorbic acid). The data frame contains 60 observations on 3 variables. The variables are: 1. len > Tooth Length > a numeric variable : This is the response variable in this case. 2. supp > Supplement Type > a factor (categorical) variable > 2 Levels : VC ( as Ascorbic Asid ) & OJ (as Orange Juice) 3. dose > “Dose” of the supplement in milligrams (0.5, 1.0, and 2.0) > a numerical variable
To begin, we should note that R considers “dose” to be numerical. We will have to be absolutely certain it is treated as a factor when we construct the model formula. Otherwise, we will end up with an analysis of covariance, which is not what we want.
First, let’s end all doubt that we want “dose” to be a factor.
ToothGrowth$dose = factor(ToothGrowth$dose, levels=c(0.5,1.0,2.0),labels=c("low","med","high"))
The function factor( ) is used to declare “ToothGrowth$dose” as a factor, with levels currently coded as 0.5, 1.0, and 2.0, but now recoded with labels of “low”, “med”, and “high”. Just to be sure, I looked at str( ) and a bit of the data frame itself.
You can also embed plots, for example:
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: Factor w/ 3 levels "low","med","high": 1 1 1 1 1 1 1 1 1 1 ...
ToothGrowth[seq(1,60,5),]
## len supp dose
## 1 4.2 VC low
## 6 10.0 VC low
## 11 16.5 VC med
## 16 17.3 VC med
## 21 23.6 VC high
## 26 32.5 VC high
## 31 15.2 OJ low
## 36 10.0 OJ low
## 41 19.7 OJ med
## 46 25.2 OJ med
## 51 25.5 OJ high
## 56 30.9 OJ high
It looks fine.
Now let’s check for a balanced design using the replications( ) function…
replications(len ~ supp * dose, data=ToothGrowth)
## supp dose supp:dose
## 30 20 10
replications(len ~ supp * dose, data=ToothGrowth[1:58,])
## $supp
## supp
## OJ VC
## 28 30
##
## $dose
## dose
## low med high
## 20 20 18
##
## $`supp:dose`
## dose
## supp low med high
## OJ 10 10 8
## VC 10 10 10
The fact that we got such a straightforward output when the function was run the first time tells us the design is balanced. To illustrate what would happen if the design were unbalanced, I ran the function a second time on a large subset of the data frame (dropping out the last two cases). In that example, we got complete tables of all the factors and how many replications (subjects) there were in each cell of the design.
There are a number of ways the data can be summarized graphically. First, let’s look at boxplots.
boxplot(len ~ supp * dose, data=ToothGrowth, ylab="Tooth Length", main="Boxplots of Tooth Growth Data")
We might get a better idea of what our main effects and interactions look like if we look at a traditional interaction plot.
with(ToothGrowth, interaction.plot(x.factor=dose, trace.factor=supp,response=len, fun=mean, type="b", legend=T,ylab="Tooth Length", main="Interaction Plot",pch=c(1,19)))
The syntax is straightforward, but there are a lot of options. First, “x.factor=” specifies the factor to be plotted on the horizontal axis. Second, “trace.factor=” specifies the factor to plotted as individual lines on the graph. Third, “response=” specifies the response variable. Fourth, “fun=” specifies the function to be calculated on the response variable. “Mean” is the default, so we could have left this one out. Fifth, “type=” specifies whether to plot lines, points, or both (we plotted “b”oth). We also asked for a legend, a label on the y-axis, and a main title on the graph. Finally, we specified the point characters to use. We could also have changed the line type, line color, and just about everything else about the graph.
Another type of graphic we could look at here is called a conditioning plot. I shall steal, er, ah, I mean, borrow the example right off the help page for the ToothGrowth data set.
coplot(len ~ dose | supp, data=ToothGrowth, panel=panel.smooth,xlab="ToothGrowth data: length vs dose, given type of supplement")
The formula is the tricky part here and should be read as “length by dose given supplement.” The “panel=” option draws the red lines when set to “panel.smooth”. The top panel can be suppressed by setting “show.given =” to FALSE. This gives a much more sensible looking plot in my humble opinion.
Numerical summaries of the data can be done with the tapply( ) function.
with(ToothGrowth, tapply(len, list(supp,dose), mean))
## low med high
## OJ 13.23 22.70 26.06
## VC 7.98 16.77 26.14
with(ToothGrowth, tapply(len, list(supp,dose), var))
## low med high
## OJ 19.889 15.296 7.049
## VC 7.544 6.327 23.018
Remember, in tapply( ) the first argument is the response variable, the second argument is a list of factors by which the response should be partitioned, and the third argument is the function you wish to calculate. Once again, we make note of some consternation over those cell variances.
Another way to get similar information is to use the model.tables( ) function, but the ANOVA has to be run first and saved into a model object.
aov.out = aov(len ~ supp * dose, data=ToothGrowth)
Note: the model formula specifies “length as a function of supplement and dose with all possible interactions between the factors. Now we can look at some model tables.
model.tables(aov.out, type="means", se=T)
## Tables of means
## Grand mean
##
## 18.81
##
## supp
## supp
## OJ VC
## 20.66 16.96
##
## dose
## dose
## low med high
## 10.61 19.73 26.10
##
## supp:dose
## dose
## supp low med high
## OJ 13.23 22.70 26.06
## VC 7.98 16.77 26.14
##
## Standard errors for differences of means
## supp dose supp:dose
## 0.9376 1.1484 1.6240
## replic. 30 20 10
We have requested the various cell and marginal means along with standard errors for the differences between means. Finally, let’s look at a Bartlett test for homogeneity of variance.
bartlett.test(len ~ supp * dose, data=ToothGrowth)
The extension to cases with more factors, for a completely randomized design where no Error structure needs to be specified, would be straightforward. Just list all the factors after the tilde separated by asterisks. This gives a model with all possible main effects and interactions.
options(show.signif.stars=F)
Now we can look at the ANOVA table using either summary( ) or summary.aov( ).
summary(aov.out)
## Df Sum Sq Mean Sq F value Pr(>F)
## supp 1 205 205 15.57 0.00023
## dose 2 2426 1213 92.00 < 2e-16
## supp:dose 2 108 54 4.11 0.02186
## Residuals 54 712 13
We see two significant main effects and a significant interaction. In R lingo, two or more factor names separated by colons refers to the interaction between those factors. For post hoc tests, we have the traditional Tukey HSD test.
TukeyHSD(aov.out)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = len ~ supp * dose, data = ToothGrowth)
##
## $supp
## diff lwr upr p adj
## VC-OJ -3.7 -5.58 -1.82 2e-04
##
## $dose
## diff lwr upr p adj
## med-low 9.130 6.362 11.898 0
## high-low 15.495 12.727 18.263 0
## high-med 6.365 3.597 9.133 0
##
## $`supp:dose`
## diff lwr upr p adj
## VC:low-OJ:low -5.25 -10.048 -0.4519 0.0243
## OJ:med-OJ:low 9.47 4.672 14.2681 0.0000
## VC:med-OJ:low 3.54 -1.258 8.3381 0.2640
## OJ:high-OJ:low 12.83 8.032 17.6281 0.0000
## VC:high-OJ:low 12.91 8.112 17.7081 0.0000
## OJ:med-VC:low 14.72 9.922 19.5181 0.0000
## VC:med-VC:low 8.79 3.992 13.5881 0.0000
## OJ:high-VC:low 18.08 13.282 22.8781 0.0000
## VC:high-VC:low 18.16 13.362 22.9581 0.0000
## VC:med-OJ:med -5.93 -10.728 -1.1319 0.0074
## OJ:high-OJ:med 3.36 -1.438 8.1581 0.3187
## VC:high-OJ:med 3.44 -1.358 8.2381 0.2936
## OJ:high-VC:med 9.29 4.492 14.0881 0.0000
## VC:high-VC:med 9.37 4.572 14.1681 0.0000
## VC:high-OJ:high 0.08 -4.718 4.8781 1.0000
The degree of confidence required for the confidence intervals can be set using the “conf.level=” option. The default value is 0.95. There is also a “which=” option, which can be used to set the factors we wish to compare, in the event we don’t want all possible comparisons. Feed it a character vector of factor names.
TukeyHSD(aov.out)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = len ~ supp * dose, data = ToothGrowth)
##
## $supp
## diff lwr upr p adj
## VC-OJ -3.7 -5.58 -1.82 2e-04
##
## $dose
## diff lwr upr p adj
## med-low 9.130 6.362 11.898 0
## high-low 15.495 12.727 18.263 0
## high-med 6.365 3.597 9.133 0
##
## $`supp:dose`
## diff lwr upr p adj
## VC:low-OJ:low -5.25 -10.048 -0.4519 0.0243
## OJ:med-OJ:low 9.47 4.672 14.2681 0.0000
## VC:med-OJ:low 3.54 -1.258 8.3381 0.2640
## OJ:high-OJ:low 12.83 8.032 17.6281 0.0000
## VC:high-OJ:low 12.91 8.112 17.7081 0.0000
## OJ:med-VC:low 14.72 9.922 19.5181 0.0000
## VC:med-VC:low 8.79 3.992 13.5881 0.0000
## OJ:high-VC:low 18.08 13.282 22.8781 0.0000
## VC:high-VC:low 18.16 13.362 22.9581 0.0000
## VC:med-OJ:med -5.93 -10.728 -1.1319 0.0074
## OJ:high-OJ:med 3.36 -1.438 8.1581 0.3187
## VC:high-OJ:med 3.44 -1.358 8.2381 0.2936
## OJ:high-VC:med 9.29 4.492 14.0881 0.0000
## VC:high-VC:med 9.37 4.572 14.1681 0.0000
## VC:high-OJ:high 0.08 -4.718 4.8781 1.0000
You can also plot the confidence intervals from the Tukey test.
TukeyHSD(aov.out, which=c("dose"), conf.level=.99)
## Tukey multiple comparisons of means
## 99% family-wise confidence level
##
## Fit: aov(formula = len ~ supp * dose, data = ToothGrowth)
##
## $dose
## diff lwr upr p adj
## med-low 9.130 5.638 12.622 0
## high-low 15.495 12.003 18.987 0
## high-med 6.365 2.873 9.857 0
You can also plot the confidence intervals from the Tukey test…
plot(TukeyHSD(aov.out))
although the y-labels tend to get smushed out of existence. The 15 interaction CIs are plotted here in the same order they appear in the table above. If you want pairwise t-tests on any factor, you can have them, with a variety of different methods available for adjusting the p-value for the number of tests being done (see the help page for details).
with(ToothGrowth, pairwise.t.test(len, dose, p.adjust.method="bonferroni"))
##
## Pairwise comparisons using t tests with pooled SD
##
## data: len and dose
##
## low med
## med 2.0e-08 -
## high 4.4e-16 4.3e-05
##
## P value adjustment method: bonferroni
check to make sure we’re getting treatment contrasts$contrasts
options("contrasts")
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
summary.lm(aov.out)
##
## Call:
## aov(formula = len ~ supp * dose, data = ToothGrowth)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.20 -2.72 -0.27 2.65 8.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.23 1.15 11.52 3.6e-16
## suppVC -5.25 1.62 -3.23 0.0021
## dosemed 9.47 1.62 5.83 3.2e-07
## dosehigh 12.83 1.62 7.90 1.4e-10
## suppVC:dosemed -0.68 2.30 -0.30 0.7683
## suppVC:dosehigh 5.33 2.30 2.32 0.0241
##
## Residual standard error: 3.63 on 54 degrees of freedom
## Multiple R-squared: 0.794, Adjusted R-squared: 0.775
## F-statistic: 41.6 on 5 and 54 DF, p-value: <2e-16
Interpretation of this table can be confusing for someone (like me) who is not used to looking at ANOVA output in this form. To see what’s here, refer back to the table of cell-by-cell means produced above by tapply( ). The estimated coefficient for the Intercept is the mean in the upper left corner of that table, the mean for the baseline level of “supp” combined with the baseline level of “dose”. In other words, this is the mean of the OJ-low cell. We find this mean to be significantly different from zero, not an especially interesting result. The next line of the table, “suppVC”, shows the difference between the means of the VC-low cell and the OJ-low cell. The standard error is for the difference between these means, so the difference is found to be significant. The “dosemed” line shows the difference between the means of the OJ-med cell and the OJ-low cell, and once again this difference is found to be statistically significant. The “dosehigh” line shows the difference in means between the OJ-high cell and the OJ-low cell, also significant. The last two lines test various elements of the interaction (good luck finding them). Frankly, I haven’t yet found this table very interesting, but it’s here if you want it. I don’t see anything here that isn’t done better by the Tukey HSD output.
For different methods, there are significant difference in length of tooth when the dose is smal, but no significant difference when the dose is large. For different doses, there are always significant difference in length of tooth with any methods.
Here is the assumptions: