Overview

There are two parts to this project: 1) an exponential distribution simulation and 2) a basic inferential data analysis that tests for significance on the effect of vitamin C tooth growth in guinea pigs. Part 1 describes the effect of the CLT on the distribution of sample means of a population relative to the distribution of the underlying population data. The second part uses hypothesis testint to compare tooth grwoth by supplement and dose.

Part 1: Simulation exercise

The Central Limit Theorem (CLT), according to Wikipedia, states that, given certain conditions, the mean of a sufficiently large number of iterates of independent random variables will be normally distributed, regardless of its underlying distribution. Furthermore, the distribution of means will be centered at the true population mean. Here, we explore this theorem with a simulated dataset of exponential distributions. According to the CLT, we assume that there are expected finite values and finite variance in our data.

For this analysis, the rate parameter, lambda, equals 0.2 for all simulations. Note that in exponential distributions, the mean and the standard deviation are 1/lambda.

lambda <- 0.2
mn <- sd <- 1/lambda
print(mn)

## [1] 5

Our theoretical mean is 1/lambda, or 5.

Here is what a simulation of 1000 exponentials (i.e. individual scores) looks like.

set.seed(0)
y <- rexp(1000, lambda)
mean(y)

## [1] 5.148383

hist(y, col="gray", breaks=30)
abline(v=mn, col="red", lwd=2)

Notice that the data are skewed to the right. The sample mean of the distribution of 1000 exponential iterations is 5.15 and is larger than the theoretical mean of the population (the vertical red line).

Now, we perform a simulation of the means of 40 exponentials (rather than individual scores) using a large number. In this case, we calculate 1000 means.

set.seed(0)
mns = NULL
for (i in 1:1000) mns = c(mns, mean(rexp(40, lambda)))
mean(mns)

## [1] 4.989678

hist(mns, col="gray", breaks=30)
abline(v=mn, col="red", lwd=2)

Our sample mean of the distribution of means, mns, is much closer to the theoretical mean (the vertical red line), at 4.997. Notice too that the distribution here is much more Gaussian than the last distribution. That is, it is approximately normal and bell-shaped. This is in line with the CLT. Recall that the CLT states that, regardless of the underlying distribution of the data, if there are enough iterations of independent random variables, then the distributions of the means of those variables will be approximately normal and centered around the true population mean. We see that this upholds in our example: the mean of the sample means was the mean of the population.
Now, let’s look at our theoretical variance versus our sample variance.

## Theoretical variance 
variance <- sd^2
print(variance)

## [1] 25

## Sample variance of 1000 averages of 40 exponentials
sds <- mns
vars <- sum(sds^2)/(1000)  
mean(vars)

## [1] 25.51442

The variances are approximately equal, but the variance for our sample mean distribution is slightly higher than the variance for our theoretical distribution. This is again in line with the CLT, which states that the theoretical variance of the population will be equal to the variance of the sample means divided by the sample size (https://people.richland.edu/james/lecture/m170/ch07-clt.html). In summary, our distribution of means of 40 exponentials behaves as predicted by the Central Limit Theorem.

Part 2: Basic Inferential Data Analysis

Now we’ll performa similar analysis of the ToothGrowth data from the datasets package in R Studio. This is a dataset that measured tooth growth in guinea pigs by dose and supplement, either orange juice (OJ) or ascorbic acid (VJ). We begin by loading the data and exploring it.

library(datasets)
data(ToothGrowth)
head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

We see that there are two treatments in the supp column: VC and OJ, each administered in doses ranging from 0.5 to 2.0 units. The len column is the measurement of the tooth after each treatment.

OJ <- ToothGrowth[ToothGrowth$supp=="OJ",-2]
VC <- ToothGrowth[ToothGrowth$supp=="VC",-2]
range(OJ$len)

## [1]  8.2 30.9

range(VC$len)

## [1]  4.2 33.9

For the OJ supplement, we see a range from 8.20 to 30.90 for tooth length growth. For the VC supplement, a range of 4.20 to 33.90.

reg1 <- with(OJ, lm(len ~ dose))
reg2 <- with(VC, lm(len ~ dose))

alpha <- scales::alpha 

cols <- alpha(palette()[c(2, 4)], .5)
cols_supp <- alpha(cols[ToothGrowth$supp], .5)

with(ToothGrowth, plot(dose, len, pch=16, col=cols_supp)) 
temp <- legend(1.75, 10, levels(ToothGrowth$supp), cex=0.8, lty=c(1,1), lwd=c(2, 2), bty="n", col=rep(cols, times=1))
abline(reg1, col=alpha(cols[1], .5))
abline(reg2, col=alpha(cols[2], .5))
title(main="Tooth Growth by Supplement and Dose")

After plotting the regression lines for tooth growth versus dose for each supplement, we want to test if the differences are significant. To do this, we will use an independent, two-sided Student’s T Test. We want to use a 95% confidence interval.

diff <- OJ-VC
diff <- diff[,1]

## Welch Two Sample t-test
t.test(OJ, VC, alternative="two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  OJ and VC
## t = 0.97615, df = 116.88, p-value = 0.331
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.903398  5.603398
## sample estimates:
## mean of x mean of y 
##    10.915     9.065

Our null hypothesis is that there is no difference in tooth growth by supplement and dose. Our alternative hypothesis is that there is a difference in tooth growth by supplement and dose. Or, as our t.test function puts it, “the true difference in means is not equal to 0.” At the 5% rejection rate, we see that we fail to reject the null hypothesis in favor of the alternative. That is, there is not enough evidence to conclude that dose and supplement (orange juice or ascorbic acid) produce any true difference in guinea pig tooth growth.

Statistical Inference Course Project

Kairsten Fay

11/25/2016

Overview

Part 1: Simulation exercise

Part 2: Basic Inferential Data Analysis