“Coursera Statistical Inference Course Project - part 2. Basic Inferential Data Analysis. Analyze the ToothGrowth data in the R datasets package.

Data exploration.

Load the data and show the header, summary, dimensions and variables:

library(datasets)
df = ToothGrowth
head(df, 5)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5

summary(df)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

dim(df)

## [1] 60  3

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

The data in the ToothGrowth dataset contains the 60 odservations of 3 variables

len (numeric) - Tooth length.
supp (factor) - Supplement type (VC or OJ).
dose (numeric) - Dose in milligrams.

Let’s look at the data.

library(ggplot2)
g <- ggplot(aes(x = supp, y = len), data = df, ) +
        geom_boxplot(aes(fill = supp)) + facet_wrap(~ dose) + 
        ggtitle("Tooth growth of guinea pigs by supplement type and dosage (mg)") + 
        ylab('Tooth length') + 
        xlab('Top - dosage (mg) & bottom - medication type')
 
print(g)

It can be seen from this plot, at highest dosage - 2 mg., the tooth groth is independent on the medication type. At low dosage there can be a dependency on the medication type, but the common picture is what the tooth groth depends more on dosage than on the medication type.
####Let now check if these assumptions are true with the help of the hypotesis testing.

Hypotesis testing.

Split the dataset on the OJ and VC parts. Then split the results in the low dosage and highest dosage groups.

oj = ToothGrowth$len[ToothGrowth$supp == 'OJ']
vc = ToothGrowth$len[ToothGrowth$supp == 'VC']
ojHigh = ToothGrowth$len[ToothGrowth$supp == 'OJ' & ToothGrowth$dose == 2]
vcHigh = ToothGrowth$len[ToothGrowth$supp == 'VC' & ToothGrowth$dose == 2]
ojLow = ToothGrowth$len[ToothGrowth$supp == 'OJ' & ToothGrowth$dose < 2]
vcLow = ToothGrowth$len[ToothGrowth$supp == 'VC' & ToothGrowth$dose < 2]

Test the hypotesis 1: at highest dosage - 2 mg., the tooth groth is independent on the medication type.

The confidence interval test for both types of medication with high dosage:

mean_vcHigh = mean(vcHigh)
mean_ojHigh = mean(ojHigh)
mean_diff = mean_vcHigh - mean_ojHigh
mean_diff

## [1] 0.08

Sx2 = var(vcHigh)
Sy2 = var(ojHigh)
Sp2 = ((length(vcHigh) - 1) * Sx2 + (length(ojHigh) - 1) * Sy2) /(length(vcHigh) + length(ojHigh) - 2)
conf_int = mean_diff + c(-1, 1) * qt(0.975, df=length(vcHigh) + length(ojHigh) - 2) * sqrt(Sp2 * (1 / length(vcHigh) + 1 / length(ojHigh) ))
conf_int

## [1] -3.562999  3.722999

The mean difference zero of the teeth length at both types of medication with high dosage lies nearly in the middle of its confidence interval with 0.95 probability. It means, what the hypotesis what the mean effect of the high dose of two medications is equal is confirmed.

Now let’s test the same hypotises with the p-values

mean(vcHigh)

## [1] 26.14

mean(ojHigh)

## [1] 26.06

t.test(vcHigh, ojHigh, alternative = "two.sided", paired = FALSE, var.equal = FALSE, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  vcHigh and ojHigh
## t = 0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.63807  3.79807
## sample estimates:
## mean of x mean of y 
##     26.14     26.06

#t.test(vcHigh, ojHigh, alternative = "less", paired = FALSE, var.equal = FALSE, conf.level = 0.95)

P - value is high, so it seems the alternative hypothesis: true difference in means of the high dose of two medication is not equal to 0 is not confirmed by the p - value test also.