This report will examine the ToothGrowth data set and provide a basic inferential data analysis.
Place the data in ‘df’
df<-ToothGrowth
From looking at the accompanying notes to the data set (call help() on it).
Below is some of the information returned when help() is called on the dataset.
The Effect of Vitamin C on Tooth Growth in Guinea Pigs
Description
The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC).
Usage
ToothGrowth Format
A data frame with 60 observations on 3 variables.
[,1] len numeric Tooth length
[,2] supp factor Supplement type (VC or OJ).
[,3] dose numeric Dose in milligrams/day
We call some exploratory functions on the data set below.
dim(df)
## [1] 60 3
names(df)
## [1] "len" "supp" "dose"
head(df)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
summary(df)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
In summary; we are dealing with 60 rows of data with 3 columns for each record with variable types as outlined above.
df_split<-split(df,df$supp)
df_OJ<-df_split$OJ
df_VC<-df_split$VC
head(df_VC)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
# Combined supp
ggplot(df)+geom_point(aes(x=len,y=dose,colour=supp))
df_group<-df%>%group_by(supp,dose) %>%summarise(mean_len=(mean(len)),sd_len=sd(len)) #ggplot(df_group)+geom_point(aes(x=mean_len,y=dose,colour=supp))
#df_group<-summarise(df_group,mean_len=(mean(len)))
ggplot(df_group,aes(dose,mean_len))+geom_bar(stat="identity",aes(fill=supp),position="dodge")+xlab("dose amount")+ylab("mean length")+ggtitle("Mean Tooth Growth (length) under each treatment by with different dosages")
ggplot(df_group,aes(dose,sd_len))+geom_bar(stat="identity",aes(fill=supp),position="dodge")+xlab("dose amount")+ylab("standard deviation of length")+ggtitle("Standard deviation of Tooth Growth (length) under each treatment by with different dosages")
From the graphs produced, we have some hypothesis that we would like to test regards our supp / dose group combinations.
Our summary thus far gives us some hints that there are some differences between supp’s for different dosages.
df_group
## Source: local data frame [6 x 4]
## Groups: supp [?]
##
## supp dose mean_len sd_len
## <fctr> <dbl> <dbl> <dbl>
## 1 OJ 0.5 13.23 4.459709
## 2 OJ 1.0 22.70 3.910953
## 3 OJ 2.0 26.06 2.655058
## 4 VC 0.5 7.98 2.746634
## 5 VC 1.0 16.77 2.515309
## 6 VC 2.0 26.14 4.797731
From visual inspection, we can see that dosage 0.5 and 1 across the two supp groups seems to have a large variance across means, where as dosage 2 seems to have similar means across the two groups.
We now want to test our guesses by way of some hypothesis testing.
We will perform four tests.
Test 1: Is there a statistically significant diffenence between the growth length across the two groups for all dosage values?
Test 2,3,4: Is there a statistically significant diffenence between the growth length between each of the dosage groups?
Because we are considering the above or below approach, this will be a two sided t-test. We will use a confidence interval of 95% (default value)
Test 1:
H0: There is no difference in means across the two supp groups (for all dosage values)
m(VC)= mean of VC group m(OJ)= mean of OJ group
H0: m(VC)-m(OJ)=0 H1: m(VC)-m(OJ)<>0
t.test(df$len ~ df$supp)
##
## Welch Two Sample t-test
##
## data: df$len by df$supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
Test 2:
H0: There is no difference in means across the two supp groups (for dose = 0.5)
m(VC)= mean of VC group for dose 0.5 m(OJ)= mean of OJ group for dose 0.5
H0: m(VC)-m(OJ)=0 H1: m(VC)-m(OJ)<>0
#Create a data frame to hold the dose 0.5 results
df_0.5<-subset(df,df$dose==0.5)
t.test(df_0.5$len ~ df_0.5$supp)
##
## Welch Two Sample t-test
##
## data: df_0.5$len by df_0.5$supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC
## 13.23 7.98
Test 3:
H0: There is no difference in means across the two supp groups (for dose = 1)
m(VC)= mean of VC group for dose 1 m(OJ)= mean of OJ group for dose 1
H0: m(VC)-m(OJ)=0 H1: m(VC)-m(OJ)<>0
#Create a data frame to hold the dose 1 results
df_1<-subset(df,df$dose==1)
t.test(df_1$len ~ df_1$supp)
##
## Welch Two Sample t-test
##
## data: df_1$len by df_1$supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC
## 22.70 16.77
Test 4:
H0: There is no difference in means across the two supp groups (for dose = 2)
m(VC)= mean of VC group for dose 2 m(OJ)= mean of OJ group for dose 2
H0: m(VC)-m(OJ)=0 H1: m(VC)-m(OJ)<>0
#Create a data frame to hold the dose 0.5 results
df_2<-subset(df,df$dose==2)
t.test(df_2$len ~ df_2$supp)
##
## Welch Two Sample t-test
##
## data: df_2$len by df_2$supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.79807 3.63807
## sample estimates:
## mean in group OJ mean in group VC
## 26.06 26.14
From our hypothesis test results, we can say the following.
At 95% confidence, we can accept the null hypothsis that there is no significant difference between the two supp groups when considering all dosage levels.
When you examine data at each of the three dosage levels, there is enough evidence to support (at 95% confidence) that there is a significant difference between the supp groups for dosage levels of 0.5 and 1, but not enough support to reject the null hypothesis for dosage values of 2.
T-test Assumptions
Observed variable is a continous measurement.
That the sample is ramdomly selected from the population (assuming the experiment designers have adhered to this)
Data is normally distributed when plotted, below is what the data looks like for each of the subsets of tests that were completed.
Some of the supp / dose combinations may not conform to this…. The appendix section plots historgram results of each of the supp / dose combinations.
Concerning combinations are depicted below: -dose = 0.5 for both OJ and VC supp values -dose = 1 for OJ supp group
# dose = 0.5 and supp = OJ
df_0.5OJ <- filter(df_0.5, supp == "OJ")
hist(df_0.5OJ$len)
# dose = 0.5 and supp = VC
df_0.5VC <- filter(df_0.5, supp == "VC")
hist(df_0.5VC$len)
# dose = 1 and supp = OJ
df_1OJ <- filter(df_1, supp == "OJ")
hist(df_1OJ$len)
Sample size is assumed to be sufficently large.
The final assumption is homogeneity of variance. Homogeneous, or equal, variance exists when the standard deviations of samples are approximately equal.