This project analyses the Tooth Growth database from R datasets package. It does basic exploratory analysis of the data, postulates hypothesis for comparing the tooth growth by supp and dose and draws conclusion using confidence interval and p value using a t distribution.
The overal conclusion of the analysis is that Supp OJ produces greater tooth growth than supp VC for dose level 0.5 and 1.
Let us first load the dataset and take a look
library(datasets)
data(ToothGrowth)
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
Let us look if there are any NAs and then at the structure of the data
##Are there any NAs?
if(length(complete.cases(ToothGrowth)) == nrow(ToothGrowth)){
print("No NA")} else {
print("NA present")
}
## [1] "No NA"
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
There are only two unique values in supp and three values in dose. This sugests a table structure
## Get a view of count of Supp vs Dose
table(ToothGrowth$supp, ToothGrowth$dose)
##
## 0.5 1 2
## OJ 10 10 10
## VC 10 10 10
This indicates that there are 6 groups of data with 10 observations each. Let us plot the length vs Supp and dose
library(ggplot2)
qplot(supp, len, data = ToothGrowth, facets = .~dose)
This looks interesting. To get a better view, let us boxplot the data.
len_suppdose <- split(ToothGrowth[,1], ToothGrowth[,c('supp','dose')])
boxplot(len_suppdose)
The box plot suggests the following hypothesis
To test these hypothesis, first check whether the data is normally distributed enough to use either normal distribution or t distribution
par(mfrow = c(3,2))
for(i in 1:6){
qqnorm(len_suppdose[[i]], main = names(len_suppdose)[i])
qqline(len_suppdose[[i]], col = "blue")
}
The data does look approximately normal.
The number of observation is only 10 in each group. So let us use the t-distribution and avoid the normal distribution
So let us t.test between various groups
For each of the dose namely 0.5, 1 and 2, we run a t.test with the null hypothesis
analysis <- data.frame(Supp1=character(0),
Supp2=character(0),
Dose=character(0),
LCL=character(0),
UCL=character(0),
PValue=character(0),
Hypothesis=character(0),
stringsAsFactors=FALSE)
i <- 1L
for(j in c(0.5, 1, 2)){
## subset the length for OJ and VC for each of the dose
len_OJ <- ToothGrowth[as.character(ToothGrowth$supp) == "OJ" &
ToothGrowth$dose == j, "len"]
len_VC <- ToothGrowth[as.character(ToothGrowth$supp) == "VC" &
ToothGrowth$dose == j, "len"]
## run the t test
result <- t.test(len_OJ,
len_VC,
paired = FALSE,
var.equal = FALSE )
## Accept or reject the NULL hypothesis
hyp_test <- if(result$conf.int[1] < 0 | result$p.value >0.05)
{"Accept Null"
} else {"Reject Null"}
## add the row to the dataframe
analysis[i,] <- as.vector(c("OJ",
"VC",
j,
round(result$conf.int[1],2),
round(result$conf.int[2],2),
round(result$p.value, 4),
hyp_test ))
i = i +1
}
Now we can print the results of the analysis
library(xtable)
print(xtable(analysis,
caption = "T Test on Supp for varying dosage"),
type = "html")
| Supp1 | Supp2 | Dose | LCL | UCL | PValue | Hypothesis | |
|---|---|---|---|---|---|---|---|
| 1 | OJ | VC | 0.5 | 1.72 | 8.78 | 0.0064 | Reject Null |
| 2 | OJ | VC | 1 | 2.8 | 9.06 | 0.001 | Reject Null |
| 3 | OJ | VC | 2 | -3.8 | 3.64 | 0.9639 | Accept Null |
As seen in the above table, and under the assumptions stated above, OJ produces greater length for Dose 0.5 and 1 and does not produce greater length for Dose 2