Part 1: Simulation Exercises

Overview

The purpose of this exercise is to show the Central Limit Theorem in action. This section of the homework will simulate 1000 random samples of sample size forty random variables belonging to the exponential distribution. Next, we will show that the mean of sample means and the mean of the sample variances are about equivalent to the true mean and variance of the distribution. In the last section, the exercise will standardize the means by subtracting each from the the sample mean, then dividing by the standard error. The result is a plot that shows that the distribution of sample means is approximately Normal.

Simulations

First, we need to generate random data. The below code samples random variables from samples n = 40 and a lambda rate of 0.2 one thousand times. We then store the means and variances of this data in the appropriate variable.

set.seed(20910)
lambda <- 0.2; n <- 40; numSims <- 1000
simData <- data.frame(matrix(rexp(numSims*n,lambda),numSims,n))
simVariances <- data.frame(variances = apply(simData,1, var))
meanVarianceMeans <- mean(simVariances$variances)

Sample Mean versus the Theoretical Mean

A property of the Exponential Distribution is that the true mean is equal to the inverse of lambda. So with our rate of 0.2, our true mean should be equivalent to 5.

simMeans <- data.frame(means = apply(simData,1, mean))
meanSampleMeans <- mean(simMeans$means)
theoryMean <- 1/lambda
test <- t.test(simMeans$means, mu = theoryMean)
print(test)
## 
##  One Sample t-test
## 
## data:  simMeans$means
## t = 0.37373, df = 999, p-value = 0.7087
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
##  4.960940 5.057438
## sample estimates:
## mean of x 
##  5.009189

As we can see, 5.01 is roughly equivalent to our theoretical mean of 5. We also show this with a confidence interval of 95%. Given our data, if the true value is equal to 5 we were 70.87% likely to see this result in the data.

Sample Variance versus the Theoretical Variance

We repeat the above for seeing how our sample variance compares to the true variance. In an exponential distribution, the variance is equal to the inverse square of lambda.

simVariances <- data.frame(variances = apply(simData,1, var))
meanVarianceMeans <- mean(simVariances$variances)
theoryVariance <- (1/lambda)^2

So, our sample variance of 25.0624412 does a pretty good job the approximating the true variance of 25.

Distribution

To show how the distribution of sample means is normally distributed, we use the below figure. The figure plots the standardized histogram of the sample means, overlaying the density function and a standard normal distribution bell curve.

The sample values for the mean and distribution are given by the dashed lines, while those belonging to the standard normal are given in solid black. As we can see, the means almost exactly overlap and the distribution of sample means almost 1-to-1 follows the standard normal distribution.

standardError <- meanSampleMeans/sqrt(n)
standFunc <- function(b){
        round((b - meanSampleMeans)/standardError,2)
}
normalizedMeans <- apply(simMeans,1,FUN = standFunc)
normalizedHist <- ggplot() + 
        geom_histogram(aes(x = normalizedMeans, y = ..density..),
                       bins = 40,
                       color = "black",
                       fill = 'darkorchid'
        ) +
        labs(
                title = 'Comparing the Standardized Sample Means and the Standard Normal',
                xlab = 'Mean',
                ylab = 'Frequency'
        ) +
        geom_vline(
                xintercept = meanSampleMeans - (1/lambda), 
                color = "red",
                linetype = 'dashed',
                size = 2
        ) +
        geom_vline(
                xintercept = 0, 
                color = "black",
                size = 1
        ) + 
        stat_function(
                aes(x = normalizedMeans),
                size = 2,
                fun = dnorm,
                args = list(
                        mean = 0,
                        sd = 1
                )
        ) +
        stat_function(
                aes(x = normalizedMeans),
                size = 2,
                color = 'blue',
                linetype = 'dashed',
                fun = dnorm,
                args = list(
                        mean = mean(normalizedMeans),
                        sd = sd(normalizedMeans)
                )
        )
## Warning: `mapping` is not used by stat_function()

## Warning: `mapping` is not used by stat_function()
print(normalizedHist)

The Central Limit Theorem in action!

Basic Inferential Data Analysis

Summary

First, we upload the data and provide a basic structure and 5 number summary of the data

data <- ToothGrowth
str(data)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
summary(data)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Methods

First, we look at general boxplots of our variables of concern.

suppBox <- ggplot(data = ToothGrowth, aes(x = supp, y = len))
suppBox + geom_boxplot(aes(fill = supp)) + 
        labs(title = "Boxplots of Tooth Length by Supplement", x = "Supplement", y = "Tooth Length")

suppDoseBox <- ggplot(data = ToothGrowth, aes(x = supp, y = len))
suppDoseBox + geom_boxplot(aes(fill = supp)) + 
        facet_wrap(~ dose) +
        labs(title = "Tooth Length by Supplement and Dose", 
             x = "Supplement", y = "Tooth Length")

So, we can see in general that the OJ supplement seems to have a bigger effect on tooth length than VJ, and that this effect is particularly large at medium to small doses. Let’s use the student’s t-test to check more rigorously.

smallDose <- subset (ToothGrowth, dose == 0.5)
medDose <- subset (ToothGrowth, dose == 1)
largeDose <- subset (ToothGrowth, dose == 2)
t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = ToothGrowth)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333
t.test(len ~ supp, var.equal = F, data = smallDose)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98
t.test(len ~ supp, var.equal = F, data = medDose)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77
t.test(len ~ supp, var.equal = F, data = largeDose)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14

The p-value for the general test between the two supplements is just short of the .05 alpha threshold, so we fail to reject this null hypothesis. However, when we look more at the different types of data, we do see that the p-value for the effectiveness of OJ over VJ at the small and medium dose levels is .006 and .001 respectively, so we can reject the null hypothesis in this case.

Conclusions

In conclusion, we reject the null hypothesis that small and medium doses have the same affect as small and medium doses as VJ in this trial. However, the general effect is not stastically significant, and a look at the graph suggests that the size of the dose, rather than the paritcular supplement, is a dominating effect here.