There are two parts to this final project. The first is to use simulation to demonstrate how the Central Limit Theorem approximates normal distributions. The second is to do some basic analysis and make inferences on a dataset using statistical techniques learned in this module.

First, load packages and create random data.

set.seed(22222)
library(ggplot2)
library(datasets)
    means<-NULL
expSamp<-rexp(40000,.2)
expMat<-matrix(expSamp,ncol=1000)
means<-colMeans(expMat)
data("ToothGrowth")

I use random sampling function rexp to generate 40000 samples from an exponential distribution with mean 5. Then I put that data into a 1000x40 matrix and find the means of each of the 40 columns.

Part 1: CLT Demonstration using Simulation

First let’s plot the data to see what it looks like.

ggplot() + aes(expSamp) + geom_density(col="black") + stat_function(fun=dexp, args=list(rate=.2), col="red")

ggplot() + aes(means) + geom_histogram(col="black",fill="lightblue") + geom_vline(aes(xintercept=mean(means),col="Sample Mean"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Notice two important things: the density of our 40,000 randomly sampled exponential variables very closely mirrors the function from which it is sampled ~Exp(mean=5). However, when observations are grouped into sets of 40 and we plot the 1000 sample means, we see that the resulting data is roughly bell-shaped.

Now I’ll check numerically to see how similar the distribution of our sample means is to the expected normal distribution ~N(mean=5, var=25/40).

samples<-c(mean(means), sd(means)); true<-c(5,sqrt(25/40))
data.frame(Sample=samples, True=true, row.names=c("Mean","St_Dev"))

##          Sample      True
## Mean   5.003462 5.0000000
## St_Dev 0.763194 0.7905694

This looks promising. I’ll show one more graphic to convince myself that the two distributions are similar, then I’ll attempt a statistical test. Here look at the density of our sample (in blue) vs the pdf of the normal distribution suggested by the Central Limit Theorem (~N(5,25/40)).

ggplot() + aes(means) + geom_density(col="blue") + stat_function(fun=dnorm, args=list(mean=5,sd=5/sqrt(40)), col="red")

Pretty close! Now I want to be more statistically certain. Finally, we run a T-test to confirm statistical significance. First I’ll set up hypotheses: \(H_0: \mu = 5\) \(H_a: \mu \neq 5\)

t.test(x=means,mu = 5,alternative = "two.sided", conf.level = .95)

## 
##  One Sample t-test
## 
## data:  means
## t = 0.14343, df = 999, p-value = 0.886
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
##  4.956102 5.050821
## sample estimates:
## mean of x 
##  5.003462

The output from my first T-test shows that I do NOT have evidence to suggest that the true population mean is different from 5. With \(p = 0.889 > .05\), I fail to reject the null hypothesis. This T-test also develops a 95% confidence interval about the mean: (4.956257, 5.050446).

The assumptions for a single-sample t-test are simple:

The data are continuous (sampled from exponential distribution!)
The data follow normal distribution (assured thanks to CLT)
Sample is random from the population (yes, thanks to rexp())

Part 2: Analysis of ToothGrowth dataset

For this portion of the assignment, I will look at the relationship between tooth growth and two variables, supp and dose. Supp refers to the delivery method of Vitamin C, while dose refers to the dosage.

str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

For the entire dataset, mean tooth length is equal to 18.81. Also important to note is that supp is a factor variable levels and dose is coded as a numeric variable. There may be a correlation between tooth length and dosage, but as this is not a regression class it is simpler to think of dose as a factor variable with 3 levels.

First I compare tooth length by the supp variable – “OJ” means the dose was administered via orange juice, “VC” indicates it was administered via ascorbic acid. Since we have two groups and we are interested in the difference, the null hypothesis is simply: \(H_0: \mu_{oj} = \mu_{vc}\) or alternatively \(H_0: \mu_{oj}-\mu_{vc}=0\) \(H_a: \mu_{oj} \neq \mu_{vc}\) or alternatively \(H_0: \mu_{oj}-\mu_{vc} \neq 0\)

with(ToothGrowth, t.test(formula=len~supp))

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

The above test indicates that with \(p=.06063 > .05\) we fail to reject our initial hypothesis. We don’t have sufficient evidence to claim that the two groups are statistically different. Notice also that the 95% confidence interval for the difference of means (-0.171, 7.571) contains zero, indicating that the difference in means may be 0.

Normally this is where I would stop, because two-sided t-tests are often most useful. However, this fringe case has an interesting result. I propose a new hypothesis: \(H_0: \mu_{oj}<=\mu_{vc}\) \(H_a: \mu_{oj}>\mu_{vc}\)

Now I try a one-sided t-test; observe the results!

with(ToothGrowth, t.test(formula = len~supp, alternative = "greater"))

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.03032
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.4682687       Inf
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

Interestingly, we see that with \(p=0.0303<.05\) we reject the null hypothesis and conclude that the guinea pigs who were given orange juice had significantly higher tooth length than guinea pigs given ascorbic acid. Note that since this is a one-sided test, the confidence interval is also one-sided. I interpret it as “we are 95% confidence that the mean difference between OJ and VC is at least .468”.

Lastly I want to look at the relationship between tooth length (len) and vitamin C dosage (dose). Since dose is a 3-factor variable, the obvious method to test the hypothesis \(\mu_1=\mu_2=\mu_3\) is ANOVA. However, since we didn’t cover ANOVA in this course, I will simply use multiple t-tests, testing \(\mu_i=\mu_j, i \in \{1,2,3\}, j \in \{1,2,3\}, i \neq j\)

ToothGrowth$dose<-as.factor(ToothGrowth$dose)
with(ToothGrowth[ToothGrowth$dose %in% c(0.5,1),],t.test(formula = len~dose))

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11.983781  -6.276219
## sample estimates:
## mean in group 0.5   mean in group 1 
##            10.605            19.735

with(ToothGrowth[ToothGrowth$dose %in% c(0.5,2),],t.test(formula = len~dose))

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.15617 -12.83383
## sample estimates:
## mean in group 0.5   mean in group 2 
##            10.605            26.100

with(ToothGrowth[ToothGrowth$dose %in% c(1,2),],t.test(formula = len~dose))

## 
##  Welch Two Sample t-test
## 
## data:  len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2 
##          19.735          26.100

Finally, we see that there is a significant difference between all dosage levels. That is, average length for dose 0.5 is different for dose 1 and 2, and dose 1 is different from dose 2 – each with \(p<.001\). Confidence intervals found in the output above all indicate that the difference in the means for each case is different from 0.

To visualize this last result, let’s look at a boxplot:

ggplot(data=ToothGrowth,aes(x=dose,y=len,fill=dose)) + geom_boxplot()

We can clearly see that the visualizations intuitively match the earlier inference: each of the three doses results in statistically different toothlength.

The assumptions for a two-sample t-tests are the same as one-sample, with the addition of:

Samples are independent.

Note: since I used a Welch test, equal variance is NOT assumed.

Statistical Inference: Final Project

DRWatkins

June 30, 2017

Part 1: CLT Demonstration using Simulation

Part 2: Analysis of ToothGrowth dataset