1 A simulation exercise - Exponential Distribution

1.1 Overview

The central limit theorem is a powerful tool from statistical asymptotics. It describes the theory that the distribution of sample means is approximately normal. We will compare the theory to a simulation of samples from the random exponential distribution.

1.2 Theory

The central limit states that the mean of a sufficiently large number of iid random variables will be approximately normally distributed, with mean \(\mu\) and variance \(\frac{\sigma^2}{n}\) regardless of the underlying distribution.

1.3 Simulations

The following code is run to setup the simulation.

set.seed(987654321)
lambda <- 0.2 # lambda of exponential distribution
B <- 1000 # number of simulations
n <- 40 # sample size in each simulation

randoms <- matrix(data = rexp(n * B, rate = lambda), B, n, byrow = TRUE)
meanforty <- apply(randoms, 1, mean)
fig1 <- qplot(as.vector(randoms),xlab = "x", ylab = "Count", 
              main = "Ramdomly Generated Data", bins=100)
fig1

Figure 1.

First the randomly generated data are potted to view the shape. They appear to come from the exponential distribution.

Using the \(set.seed()\) function is important to provide for reproducibility. Then, the important parameters \(lambda\), \(B\), and \(n\) are defined. The random variables are generated in the \(randoms\) matrix using the \(rexp()\) function. The matrix of random variables is then of dimensions \(1000, 40\). The mean of each simulation is then calculated using the \(apply()\) function. This generates \(1000\) sample mean values.

1.4 Sample Mean versus Theoretical Mean

The following code is run to compare the observed sample mean to the theoretical mean.

theoretical <- 1 / lambda
obsmean <- mean(meanforty)
df <- data.frame(meanforty)

fig2 <- ggplot(data=df, aes(meanforty))
fig2 + geom_histogram(aes(y=..density.., fill="observed simulation"), bins=40) + 
     geom_vline(aes(xintercept = theoretical, color="mean-theory", show.legend = TRUE)) + 
     geom_vline(aes(xintercept = obsmean, color="mean-observed",show.legend = TRUE)) + 
     stat_function(aes(color="Theory-Normal Dist", show.legend = TRUE), fun=dnorm, args=list(mean=(1/lambda),sd=(1/lambda)/sqrt(n))) + 
     ggtitle("Sample means (n=40) of Exponential Distribution") + 
     xlab("x") + 
     ylab("Density")

Figure 2.

The theoretical mean and observed mean of the distribution is calculated. To compare the theory to the simulation, the figure was constructed. The histrogram displays the observed density from the simulation. The mean observed (from the simulation) and the theoretical mean are shown. They are very close together. The observed mean was \(4.979\) compared to the theoretical mean of \(5\). The sampling distribution is also shown, which the Central Limit Theorem predicts. It is a normal distribution of mean \(5\) and standard deviation of \(0.791\). The simulations reasonably follow the theoretical distribution, which provides evidence that the Central Limit Theorem is true.

1.5 Sample Variance versus Theoretical Variance

We desire to show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution. Figure 2. depicts graphically how the variability compares between the simulation and the theoretical distribution of n=40 means of random exponentials (the simulation being the histrogram and the theoretical being the solid-line distribution). To compare analytically, the variance of the simulation is \(0.631\). This compares to the variance of the theoretical of \(0.625\).

1.6 Distribution

We desire to show that the distribution is approximately normal. (focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.) The distribution of the random exponentials is displayed in Figure 1. The n=40 mean distributions are shown in Figure 2–both the simulation and the theoretical. The shape of the simulation data confirms that it is approximately normally distributed.

2 Basic inferential data analysis

2.1 Loading Data and Exploratory Analysis

data("ToothGrowth")
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
ToothGrowth$dose <- factor(ToothGrowth$dose)

The ‘ToothGrowth’ data set is loaded. The \(str()\) function is used to examine the structure of the data. The variables of interest are: (1) \(supp\), a two level factor that describes how the dose was delievered; (2) \(dose\), the dose of level (one of three options); and (3) \(len\) the length of the tooth. Therefore, there are two independent variables and \(len\) as the response variable. Since the \(dose\) variable is really a factor variable, it is converted to a factor in the data frame.

To perform visual exploratory analysis, a faceted histogram is produced by the code below. By breaking up the two independent variables into facets, their relationship to tooth length can be explored. It appears that the dose method (OJ vs. VC) does not have an effect on length. However, there appears to be a relationship between increasing doseage and longer tooth lengths. This exploratory analysis provides suggestions for more formal analysis later.

fig3 <- ggplot(data = ToothGrowth, aes(len))
fig3 + geom_histogram(bins=20) + 
     facet_grid(supp ~ dose, labeller = label_both) + 
     xlab("Length") +  ylab("Count") + 
     ggtitle("ToothGrowth - Exploratory")

Figure 3.

2.2 Basic Summary of the data

The following code evaluates the mean values for each combination of the factors. The mean values are used to further investigate the data. As previously thought, It appears that higher doses correlate with greater tooth length. Furthermore, it also appears that the \(OJ\) supplement gives higher values of tooth length than the \(VC\) supplement.

ag <- aggregate(len~dose+supp, data = ToothGrowth, FUN=mean)
xtabs(len~., data=ag)
##      supp
## dose     OJ    VC
##   0.5 13.23  7.98
##   1   22.70 16.77
##   2   26.06 26.14

2.3 Analysis

2.3.1 Assumptions

  • The response variable, \(len\), was only measured after the treatments were done. There is no baseline data to compare to.
  • The measurements of the response variable are independent and identically distributed from some population. The groups have the same variance.
  • There is no measurement error; the reported values of \(len\) are the truth length.
  • The data will only be analyzed with t-tests and confidence intervals. The data will not be analyzed using a three-dimensional method.
  • A 95% confidence level is appropriate for this analysis.
  • Multiple comparison ajustments to the p-values will not be considered.

2.3.2 Supplement

To answer the question: Is tooth length statistically different when considering the two method of supplementation?, the following analysis is performed. A two-sided t-test will be performed on the groups. The null hypothesis is that tooth growth has the same mean value for the two supplementation types.

supplement <- t.test(len~supp, data = ToothGrowth, alternative = "two.sided", 
                     paired = FALSE, var.equal = TRUE, conf.level = 0.95)
supplement$p.value
## [1] 0.06039337
fig4 <- ggplot(data = ToothGrowth, aes(supp,len))
fig4 + geom_dotplot(binaxis = "y", stackdir = "center", binwidth = 1) + 
     ggtitle("Tooth Length vs. Supplement Type") + 
     xlab("Supplement Type") + ylab("Length") + 
     stat_summary(fun.y = "mean", color = "red", size = 2, geom = "point")

Figure 4. Length by Supplement Type. Mean values of each group are shown in red.

The p-value of the test of \(0.0604\) indicates that we cannot reject the null hypothesis.

2.3.3 Dose

To answer the question: Is tooth length statistically different when considering the three dosages?, the following analysis is performed. 95% confidence intervals are calculated for each of the three dose groups. And the data, means, and intervals are plotted.

alpha <- 0.05
toothsummary <- group_by(ToothGrowth,dose) %>% 
     summarize(mean=mean(len), sd=sd(len), n=length(len)) %>% 
     mutate(int.half.width=qt(1-alpha/2,n-1)*sd/sqrt(n)) %>% 
     mutate(ci.max=mean+int.half.width, ci.min=mean-int.half.width)

fig5 <- ggplot(data = ToothGrowth, aes(dose,len))
fig5 + geom_dotplot(binaxis = "y", stackdir = "center", binwidth = 1) + 
     geom_errorbar(data=toothsummary, aes(mean, x=dose, ymin=ci.min, ymax=ci.max)) + 
     geom_point(data=toothsummary, mapping=aes(x=dose, y=mean), size=4, shape=18, col="red") + 
     ggtitle("Length vs. Dose") + 
     xlab("Dose") + ylab("Length")

Figure 5. Plot of Tooth length vs. dose. Individual data points are shown in black. Mean values of each group are shown in red. 95% confidence intervals of the mean are shown by black lines.

The mean length values of the three dose values are separated. The confidence intervals for the mean show where we can reasonably say the mean lies. Since none of the three error ranges for the means overlap, we conclude that the different levels of dose correlate with different values of tooth length. Also, there is a positive correlation between dose and length–higher values of dose have higher values of length.

2.4 Conclusions

2.4.1 Supplement Type

A mean difference of \(3.7\) was found between the two supplement types with regard to tooth length. A two-sided 95% confidence t-test was performed, and concluded that the null hypothesis cannot be rejected. Therefore, it is concluded that the supplement type does not have a relationship with the tooth length.

2.4.2 Dose

The mean length values of the three dose values are separated. The confidence intervals for the mean show where we can reasonably say the mean lies. Since none of the three error ranges for the means overlap, we conclude that the different levels of dose correlate with different values of tooth length. Also, there is a positive correlation between dose and length–higher values of dose have higher values of length.