Statistical Inference Project

Overview

This report analyzed data on tooth growth in guinea pig subjects that was contained in the ToothGrowth data set from R. The report will describe the results of some basic exploratory analyses of the subject data, and will include a summary of the data well as a summary of insights into the data that were gained by interpreting the results of data analysis using inferential statistics such as hypothesis tests.

Data Description

The following synopsis of the dataset was taken from the R Manual at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/ToothGrowth.html

The response is the length of odontoblasts (teeth) in each of 10 guinea pigs at each of three dose levels of Vitamin C (0.5, 1, and 2 mg) with each of two delivery methods of orange juice (OJ) or ascorbic acid (VC).

The format is a data frame with 60 observations on the following three variables :

len - Tooth length
supp - Supplement type (VC or OJ).
dose - Dose in milligrams.

For clarity, the column names in the data set associated with these variables were changed as follows:

len to tooth.length
dose to dosage
sup to supplement

Interpretation of data description

The data set consists of 60 that were split into groups of 10 based on the delivery method and supplement used tested in the study. Each group was given one of six possible combinations of the three dose levels of Vitamin C (0.5 mg, 1 mg, and 2 mg), and the two delivery methods (orange juice or ascorbic acid).

The variable tooth length was the response, or dependent, variable, and the dosage level and delivery method were the independent variables.

Goal of this analysis

The goal of this analysis was to use appropriate statistical analysis techniques identify one or more noteworthy relationships between tooth length, and some combination of the type of supplement or the dosage level of the supplement.

Data summary

After the data was loaded, a quick summary of the data revealed that there were 30 observations with supplements OJ and VC, and 20 observations each with dosages of 0.5 mg, 1 mg, and 2 mg.

A histogram revealed that the data was roughly symmetric and mound shaped, but not clearly distributed normally, with the mean (in red) close to the median (in blue).

# Load ToothGrowth data and change the column names
tooth.data <- ToothGrowth
names(tooth.data) <- c("length", "supplement", "dosage")

summary(tooth.data)

##      length      supplement     dosage     
##  Min.   : 4.20   OJ:30      Min.   :0.500  
##  1st Qu.:13.07   VC:30      1st Qu.:0.500  
##  Median :19.25              Median :1.000  
##  Mean   :18.81              Mean   :1.167  
##  3rd Qu.:25.27              3rd Qu.:2.000  
##  Max.   :33.90              Max.   :2.000

paste("Tooth length ranged from a minimum of ",min(tooth.data$length)," to a maximum of ", max(tooth.data$length),sep="")

## [1] "Tooth length ranged from a minimum of 4.2 to a maximum of 33.9"

hist(tooth.data$length,
        main="Tooth length distribution",
        xlab="Tooth length",
        ylab="Frequency")
abline(v=median(tooth.data$length), col="blue")
abline(v=mean(tooth.data$length), col="red")

table(tooth.data$dosage)

## 
## 0.5   1   2 
##  20  20  20

table(tooth.data$supplement)

## 
## OJ VC 
## 30 30

Given the roughly symmetrical shape of the histogram, specifically the mean relatively close to the median and a much lower frequency at the high and low range of the data, as well as the size of the data set, it was appropriate to use a t-test to see if there was a significant difference between the observations with the OJ and VC supplements.

In spite of the fact that they were of equal size, it was assumed these two sets of observations were treated as independent groups and not paired observations.

No assumptions were made about the variances, so the test was done assuming that the variances were not equal

The data was then was divided into six groups based on the level of dosage and the supplement type, and the means of of those six groups were compared.

# Create a six-item list split by the two variables
tooth.groups <- split(tooth.data, tooth.data[,c("supplement", "dosage")] )
tooth.means <- by(tooth.data$length, tooth.data[,c("supplement", "dosage")],mean )

#Gives them all as character
mean.vec <- sapply(1:length(tooth.groups), function(i){
     c(mean(tooth.groups[[i]][,1]), 
       as.character(tooth.groups[[i]][1,2]),
       tooth.groups[[i]][1,3])
        })

group.label <- sapply(1:length(tooth.groups),function(i){
         paste(tooth.groups[[i]][1,2],"-",tooth.groups[[i]][1,3],sep="")
         })
six.group.summary <- cbind(mean.vec[1,],group.label)

# Bar plot with labels
barplot(as.numeric(mean.vec[1,]),
        main="Tooth length by supplement-dosage combination",
        xlab="Supplement-dosage combination",
        ylab="Tooth length",
        names.arg = group.label)

A review of the bar plot suggested that tooth length was correlated more to the dosage amount rather than the supplement type. To test that idea, two comparisons were made, with the first being between the supplement types, and the second based on the dosage.

The next steps in the analysis were to look at the role of the supplements, and the role of the dosage amounts.

Effects of supplements on tooth growth

As a first step, the data set was split into two groups, one where the guinea pigs received orange juice (OJ), and a second group where they received ascorbic acid (VC), and the means and standard deviations for these groups were computed.

oj.only <- tooth.data$length[tooth.data$supplement=="OJ"]
vc.only <- tooth.data$length[tooth.data$supplement=="VC"]
paste("The 30 OJ observations had a mean of ",round(mean(oj.only),digits=2), " and a standard deviation of ",round(sd(oj.only),digits=2), sep="")

## [1] "The 30 OJ observations had a mean of 20.66 and a standard deviation of 6.61"

paste("The 30 VC observations had a mean of ",round(mean(vc.only),digits=2), " and a standard deviation of ",round(sd(vc.only),digits=2), sep="")

## [1] "The 30 VC observations had a mean of 16.96 and a standard deviation of 8.27"

A t-test was performed on the two sets of supplement data to test the null hypothesis that the two sets of data were from the same distribution. Given the differences in the standard deviations, it was assumed that these two sets of observations did not have equal variances.

t.test(oj.only,vc.only)

## 
##  Welch Two Sample t-test
## 
## data:  oj.only and vc.only
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

paste("Given the p-value of ",round(t.test(oj.only,vc.only)$p.value,digits=3), ", one can't reject the null hypothesis that the true difference in means between these two groups is equal to zero.", sep="")

## [1] "Given the p-value of 0.061, one can't reject the null hypothesis that the true difference in means between these two groups is equal to zero."

Effects of dosage on tooth growth

After evaluating the effect of the supplements, one can look at the effect of dosage on tooth growth. In order to close follow the same analysis steps that were done with supplements, the 60 observations were split into two equal sized groups, one with a lower dosage, and one with a higher dosage.

One way to do that is to first take the 20 observations with the mid-range dosage of 1 mg and to randomly assign half to the low dosage group and half to the high dosage group. After that assignment, the means and standard deviations of the two groups will be computed, and a t-test will be run to see if one can accept the the null hypothesis that the true difference in means between these two groups is zero.

lowest.dose <- tooth.data$length[tooth.data$dosage==0.5]
highest.dose <- tooth.data$length[tooth.data$dosage==2.0]
middle.dose <- tooth.data$length[tooth.data$dosage==1.0]

set.seed(0) # ensures consistent shuffle order
middle.shuffle <- sample(middle.dose)
lower.dose <- c(lowest.dose,middle.shuffle[1:10])
higher.dose <- c(highest.dose,middle.shuffle[11:20])
paste("The 30 lower dose observations had a mean of ",round(mean(lower.dose),digits=2), " and a standard deviation of ",round(sd(lower.dose),digits=2), sep="")

## [1] "The 30 lower dose observations had a mean of 13.63 and a standard deviation of 6.29"

paste("The 30 higher dose observations had a mean of ",round(mean(higher.dose),digits=2), " and a standard deviation of ",round(sd(higher.dose),digits=2), sep="")

## [1] "The 30 higher dose observations had a mean of 24 and a standard deviation of 4.88"

A t-test was performed on the two sets of dosage data to test the null hypothesis that the true difference in means between these two groups was equal to zero. Given the differences in the standard deviations, it was assumed that these two sets of observations did not have equal variances.

t.test(lower.dose,higher.dose)

## 
##  Welch Two Sample t-test
## 
## data:  lower.dose and higher.dose
## t = -7.1366, df = 54.663, p-value = 2.324e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -13.286697  -7.459969
## sample estimates:
## mean of x mean of y 
##  13.62667  24.00000

paste("Given the p-value of ",round(t.test(lower.dose,higher.dose)$p.value,digits=3), ", one can reject the null hypothesis that the true difference in means between these two groups is equal to zero.", sep="")

## [1] "Given the p-value of 0, one can reject the null hypothesis that the true difference in means between these two groups is equal to zero."

Conclusion

After an initial summary of the data, a histogram plus measurements of the means and median showed that the 60 tooth length measurements were roughly symmetrical about the mean, and that the median was close to the mean.

A second graph that looked at the means of the observations when grouped by dosage and supplements suggested that either the supplement or the magnitude of the dosage provided may have had a significant effect on tooth length.

There were two sets of analyses performed on the data, one that split the data into a pair of equal sized groups based on the supplement provided, and a second split by dosage amounts, with the 20 mid-ranged dosage observations randomly and evenly split between the lowest and highest dosage categories.

A t-test was performed on both sets of analyses, and it both cases it was not assumed that two groups had equal variances.

While the t-test on the two supplement groups showed that the guinea pigs that received the orange juice supplement had a higher average tooth length, the t-test gave no clear indication that the null hypothesis the the true difference in means between these two groups is equal to zero.

In contrast, the t-test for the higher and lower dosage groups showed that there was a statistically significant difference between the lower and higher dosage groups, with the higher dosage group having a higher average tooth length. In this case, one can reject the null hypothesis that the true difference in means is equal to zero.

Neither of the two t-tests provided any clear understanding or insights on the possible effect that a combination of supplement and dosage level had on tooth length.

Statistical Inference Project - ToothGrowth data

Todd Curtis

February 13, 2015