Statistical Inference PA2

Now in the second portion of the class, I am going to analyze the ToothGrowth data in the R datasets package using confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose.

Load the ToothGrowth data and perform some basic exploratory data analyses

data("ToothGrowth")

Provide a basic summary of the data.

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. I will state my conclusions and the assumptions needed for my conclusions.

There are two levels in supp.

library(ggplot2)

ggplot(ToothGrowth, aes(x = supp, y = len)) + geom_boxplot()

For each levels, mean and standard deviations are here.

library(dplyr)

g_df <- tbl_df(ToothGrowth)
g_df_sm <- g_df %>% group_by(supp) %>% summarize(mean = mean(len), sd = sd(len))

print(g_df_sm)

The average for OJ is 20.66 with a standard variance of 6.60 while the average for VC is 16.96 with a standard variance of 8.26. Consider the 95% confidence interval estimate for the differences of the means. I assume a constant variance. I am looking for the inverval with substracting in this order (OJ - VC).

n_oj <- 30          # count of OJ
n_vc <- 30          # count of VC
m_oj <- 20.66       # mean of OJ
sd_oj <- 6.60       # standard variation of OJ
m_vc <- 16.96       # mean of VC
sd_vc <- 8.26       # standard variation of VC

# find pooled variance
spsq <- ( (n_oj - 1) * sd_oj^2 + (n_vc - 1) * sd_vc^2) / (n_oj + n_vc - 2)

# find confidence intervals
(m_oj - m_vc) + c(-1, 1) * qt(0.975, df=(n_oj + n_vc - 2)) * sqrt(spsq) * sqrt(1/n_oj + 1/n_vc)

## [1] -0.1640165  7.5640165

When subtracting (OJ - VC) the interval has zero. The difference of OJ and VC appears to be no effective.

Now, there are three levels in dose. Suppose 0.5 is A, 1.0 is B and 2.0 is C.

ggplot(ToothGrowth, aes(x = factor(dose), y = len)) + geom_boxplot()

I will find that there are some differences between A and B, A and C, B and C. Consider the 95% confidence interval estimate for the differences of the means. I assume a constant variance.

A <- ToothGrowth[ToothGrowth$dose == 0.5,]$len
B <- ToothGrowth[ToothGrowth$dose == 1.0,]$len
C <- ToothGrowth[ToothGrowth$dose == 2.0,]$len

Fist, compare A and B. (A - B)

t.test(A, B, paired = FALSE, var.equal = TRUE)$conf

## [1] -11.983748  -6.276252
## attr(,"conf.level")
## [1] 0.95

When subtracting (A - B) the interval is entirely below zero. The (A-B) appears to be effective.

Second, compare A and C. (A - C)

t.test(A, C, paired = FALSE, var.equal = TRUE)$conf

## [1] -18.15352 -12.83648
## attr(,"conf.level")
## [1] 0.95

When subtracting (A - C) the interval is entirely below zero. The (A-C) appears to be effective.

Last, compare B and C. (B - C)

t.test(B, C, paired = FALSE, var.equal = TRUE)$conf

## [1] -8.994387 -3.735613
## attr(,"conf.level")
## [1] 0.95

When subtracting (B - C) the interval is entirely below zero. The (B-C) appears to be effective.

So, there is no difference in results by supp, but dose’s values affect the len of toothgrowth.

Statistical Inference PA2

Kong, Seok-kyu

2015-07-19