Overview

A short exploratory analysis and statistical inference on the ToothGrowth data in R.

Exploratory Analysis

We load the data in question and include some libraries that we’re going to need later. we also present some observations to have an idea about the data format.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

data("ToothGrowth")
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Now we store the ToothGrowth data in a variable x and we group its by supp and dose for easier manipulation. We summarise the data by adding two variables mean and sd that contain the mean and standard deviation of the len of each group of observations.

x<-ToothGrowth
a<-group_by(x,supp,dose)

df<-summarise(a,"mean"=mean(len),"sd"=sd(len))
df<-as.data.frame(df)
df
##   supp dose  mean       sd
## 1   OJ  0.5 13.23 4.459709
## 2   OJ  1.0 22.70 3.910953
## 3   OJ  2.0 26.06 2.655058
## 4   VC  0.5  7.98 2.746634
## 5   VC  1.0 16.77 2.515309
## 6   VC  2.0 26.14 4.797731

We can plot 6 histograms of len corresponding to each sup and dose

g<-ggplot(data=a, aes(x=len)) + facet_grid(supp~dose) + geom_histogram(color="black",fill="red")
g
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

we can calculate the T interval of confidence of len for each group of data:

df$lower<-df$mean - qt(0.975,9)*df$sd/sqrt(10)
df$upper<-df$mean + qt(0.975,9)*df$sd/sqrt(10)
df
##   supp dose  mean       sd     lower     upper
## 1   OJ  0.5 13.23 4.459709 10.039717 16.420283
## 2   OJ  1.0 22.70 3.910953 19.902273 25.497727
## 3   OJ  2.0 26.06 2.655058 24.160686 27.959314
## 4   VC  0.5  7.98 2.746634  6.015176  9.944824
## 5   VC  1.0 16.77 2.515309 14.970657 18.569343
## 6   VC  2.0 26.14 4.797731 22.707910 29.572090

Inference Analysis

We group the data by the supp type abd summarise it by the mean and sd of len by each group

b<-group_by(x,supp)
sup<-summarise(b,mean=mean(len) , sd=sd(len))
sup<-as.data.frame(sup)
sup
##   supp     mean       sd
## 1   OJ 20.66333 6.605561
## 2   VC 16.96333 8.266029

now we use the t.test() function to calculate the p-value corresponding to the distibution of len by the two supp types

NOTE the null hypothesis is that len isn’t effected by supp (OJ=VC)

oj<-x[x$supp=="OJ","len"]
vc<-x[x$supp=="VC","len"]
res<-t.test(oj,vc)

res$p.value
## [1] 0.06063451

Conclusion our p-value=0.0606345 >0.05, so we fail to reject the null hypothesis.