Part 1: Simulation Exercise

In this project I will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. I will investigate the distribution of averages of 40 exponentials. Note that I will need to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. I should:

Show the sample mean and compare it to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

Question 1:

# initialise the parameters and generate the sample
n <- 40
lambda <- 0.2
samp <- rexp(n, lambda)

Calculate the sample mean, theoretical mean and visually inspect the sample distribution.

mean(samp) # calculate the mean

## [1] 5.892383

1/lambda # calculate the theoretical mean

## [1] 5

Question 2:

var(samp) # calculate the sample variance

## [1] 29.86344

(1/lambda)^2 # calculate the theoretical variance

## [1] 25

We see that the sample mean and theoretical mean are not the same. Also, we see that the theoretical variance and sample variance are not the same. This sample is not representative of the population. See the appendix for its distribution.

Question 3:

# run 1000 simulations
mns = NULL #initialise mns
for (i in 1 : 1000) #repeat 1000 times
mns = c(mns, mean(rexp(n, lambda))) #add the mean of each sample to mns

In the histogram we see that the distribution resembles a normal distribution centered much closer to the theoretical mean of 5.

hist(mns, main = "Histogram of the sample means", xlab = "Sample means") #plot the histogram
abline(v=mean(mns),col = "blue", lwd = 2)

Part 2: Basic Inferential Data Analysis Instructions

Now in the second portion of the project, I am going to analyze the ToothGrowth data in the R datasets package.

Load the ToothGrowth data and perform some basic exploratory data analyses.
Provide a basic summary of the data.
Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)
State your conclusions and the assumptions needed for your conclusions.

Step 1: (see additional graphs in appendix)

library(datasets)
boxplot(len~supp, data=ToothGrowth, main="Tooth Growth",
        col = c("yellow"), xlab="Suppliment", ylab="Length")

mean(ToothGrowth[ToothGrowth$supp == "VC", c("len")])

## [1] 16.96333

mean(ToothGrowth[ToothGrowth$supp == "OJ", c("len")])

## [1] 20.66333

Step 2: There are 2 supplements and 3 doses. Above we see the mean teeth lengths may be different for different supplements. In the appendix we also see there may be a difference in means for different doses.

Looking at the histogram of teeth lengths in the appendix the results do not appear to be normally distributed. Given there are so few examples in each group we use t tests.

Step 3:

#1. H0: m_oj = m_vc, Ha: m_oj not equal to m_vc
t.test(len ~ supp, data=ToothGrowth)$conf

## [1] -0.1710156  7.5710156
## attr(,"conf.level")
## [1] 0.95

#2. H0: m_0.5 = m_1, Ha: m_1 > m_0.5
subset = ToothGrowth[ToothGrowth$dose != 2,]
t.test(len ~ dose, data=subset, alternative = c("less") )$p.value

## [1] 6.341504e-08

#3. H0: m_0.5 = m_1, Ha: m_2 > m_1
subset = ToothGrowth[ToothGrowth$dose != 0.5,]
t.test(len ~ dose, data=subset, alternative = c("less"))$p.value

## [1] 9.532148e-06

Step 4: In the first hypothesis test we test if the mean for the supplement OJ is the same as the mean for the supplement VC. Given that 0 falls within the confidence interval we cannot reject the null hypothesis at a 95% level of confidence that the difference in means is 0. So we cannot say that the different supplements preform better or worse than each other.

In the second test we test if the mean of the 1 dose is larger than the mean of the 0.5 dose. Then the third test we test if the mean of the 2 dose is larger than the mean of the 1 dose. Here we are doing multiple tests so I apply the Bonferroni adjustment and require that the p-value be 0.025 instead of 0.05 in order to decide with a 95% level of confidence. In both cases I still reject the null hypotheses and conclude that the higher doses result in longer teeth. I conclude that a higher dose results in longer teeth.

APPENDIX

Part 1. Question 1. Looking at the histogram we see that the sample data is clearly not normally distributed given its nonsymmetrical distribution.

hist(samp, main = "Histogram of the sample", xlab = "Value") #plot the histogram
abline(v=mean(samp),col = "blue", lwd = 2) #adding a line at the mean

Part 1. Question 3.

If we plot the density plot of the simulations it is easier to see the shape of the distribution which is still not perfectly symmetrical but much more so than the sample distribution. Also below we see that the curve is centered around the mean of ~5.

d <- density(mns)
plot(d, main= "Density plot of the sample means", xlab= "Sample means")

Part 2. Step 1 additional calculations and graphs

mean(ToothGrowth[ToothGrowth$supp == "VC", c("len")])

## [1] 16.96333

mean(ToothGrowth[ToothGrowth$supp == "OJ", c("len")])

## [1] 20.66333

hist(ToothGrowth$len, main = "Histogram of tooth length", xlab="Lengths")

boxplot(len~dose,data=ToothGrowth, main="Tooth Growth",
        col = c("yellow"), xlab="Dose", ylab="Length")

Statistical Inference Course Project

Brian

Part 1: Simulation Exercise

Part 2: Basic Inferential Data Analysis Instructions

APPENDIX