Basic Inferential Data Analysis Exercise

key words:ToothGrowth data, Data Exploration, T Tests, Box Plots, Statistical Inference.

Plot & explore data

Here is a glimpse of the data:ToothGrowth, and a basic data summary:

glimpse(ToothGrowth); summary(ToothGrowth)

## Rows: 60
## Columns: 3
## $ len  <dbl> 4.2, 11.5, 7.3, 5.8, 6.4, 10.0, 11.2, 11.2, 5.2, 7.0, 16.5, 16...
## $ supp <fct> VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC, VC...
## $ dose <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1.0, 1.0, 1....

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

and a quick plot of the data: ToothGrowth

plot(ToothGrowth)

From the quick data plots we see that the tooth length varies by dose & supplements, the following box plots affirm this.

Closer Inspection: Dose

ggplot(data = ToothGrowth, aes(x = dose, y = len)) +
        geom_boxplot(aes(fill = supp)) + facet_wrap(~supp) + theme_bw()

It seems that orange juice is linked to better tooth grown results, we will use a t test to evaluate the statistical significance.

dose0.5 <- ToothGrowth %>% filter(dose == 0.5)
dose1.0 <- ToothGrowth %>% filter(dose == 1.0)
dose2.0 <- ToothGrowth %>% filter(dose == 2.0)
rbind(
       t.test(len~supp, paired=F, var.equal=F, data=dose0.5)$conf.int[1:2],
       t.test(len~supp, paired=F, var.equal=F, data=dose1.0)$conf.int[1:2],
       t.test(len~supp, paired=F, var.equal=F, data=dose2.0)$conf.int[1:2] )

##           [,1]     [,2]
## [1,]  1.719057 8.780943
## [2,]  2.802148 9.057852
## [3,] -3.798070 3.638070

So, we see that there is a greater tooth length at doses .5 and 1.0 but that it is split at zero at dose 2.0. So we can accept the hypothesis at dose .5 and 1.0 as at these doses orange juice is linked to higher tooth growth length.

Let’s split the data to get more insights with a Welch t-test:

OJ = ToothGrowth$len[ToothGrowth$supp == 'OJ']
VC = ToothGrowth$len[ToothGrowth$supp == 'VC']

t.test(OJ, VC, alternative = "greater", paired = FALSE, var.equal = FALSE, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  OJ and VC
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

If we set alpha to .05 we can reject the null hypothesis, as the p -value, .03, is lower than alpha.

Let’s plot that:

ggplot(data = ToothGrowth, aes(x = supp, y = len)) +
        geom_boxplot(aes(fill = supp)) + facet_wrap(~dose) + theme_bw()

From the box plot we see that at dose = 2mg both vitamin C, VC, and Orange Juice, OJ, produce tooth length of around 26. This can be affirmed with a two sided Welch t-test.

OJ2 = ToothGrowth$len[ToothGrowth$supp == 'OJ' & ToothGrowth$dose == 2]
VC2 = ToothGrowth$len[ToothGrowth$supp == 'VC' & ToothGrowth$dose == 2]
t.test(OJ2, VC2, alternative = "two.sided", paired = FALSE, var.equal = FALSE, conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  OJ2 and VC2
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean of x mean of y 
##     26.06     26.14

The means are the same, so there is no difference, the p-value = 0.9639 almost 1.

After-note: - This is an R Markdown document,created as a submission for the Statistical Inference Course by Johns Hopkins University on Coursera. - The the exercise can be replicated Hidden code is available here.

Basic Inferential Data Analysis Exercise

Statistical Inference

Linda Angulo Lopez

01/01/2021

Summary

Plot & explore data