Please note that blank space was used to make the document more readable; overall, the page lengths requirements are met!

Part 1: Simulation Exercise

Overview

Wikipedia describes the exponential distribution as “the probability distribution of the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate1”.

My goal is to perform a simulation so as to show how the sampling distribution tends to the normal.

I start by setting up the parameters for the simulation exercise.

lambda = 0.2 # lambda parameter of the exponential function 
n = 40 # sample size for the exponential distributions (input 'n' in the rexp() function) 
B = 1000 # number of replications 

Theoretical (population) Mean & Variance

From the above, it is a straightforward computation to get the mean and standard deviation (positive square root of the variance) of the theoretical2 exponential distribution characterized by a parameter \(\lambda\), since they are both equal to \(\frac{1}{\lambda}\) (variance is \(\frac{1}{\lambda^2}\)).

In the Appendix I report the computed theoretical population mean, variance and standard deviation for an exponential distribution with the given \(\lambda\) parameter.

The CLT3 tells us that as the sample size grows (\(n \rightarrow \infty\)), the sampling distribution tends in probability to be approximately normally distributed with mean equal to the population mean \(\mu\) and variance (the theoretical variance of the sampling distribution) equal to \(\frac{\sigma^2}{n}\), where \(\sigma^2\) is the population variance.

Simulations

A basic for loop is used to iterate up to the desired number of simulations; for each iteration, the mean is computed for an exp(n=40, \(\lambda\)=.2). The computed means are stored in a vector. Code and output is in the Appendix4.

Results

The average of the simulated means is 4.99: it is basically equal to the “theoretical” mean \(\mu\) of 5. The sampling and theoretical variances are pretty close as well: 0.625 and 0.6344 respectively5.

In order to compare the simulated sampling distribution and the theoretical (normal) distribution under the CLT, I superimpose the corresponding Normal PDF (solid red curve), as well as a kernel density estimated curve (dashed green line), to the histogram6 of the simulated sample means. The result is Figure 1 in the Appendix.

A simple plot comparing the quantiles of the simulated distribution and the normal one shows that there are some discrepancies, especially on the right tail; recall, however, that while the support for a Normal is the real line, an exponential distribution is defined only for non-negative values. Please refer to Figure 2 in the Appendix.

Part 2: Basic Inferential Data Analysis

I load the ‘TootGrowth’ data and perform some basic EDA7. A routinary summary() call, in addition to the usual descriptive statistics for the two variables ‘len’ (the outcome, or response, variable, length of odontoblasts cells8) and ‘dose’ (formally a numeric variable indicating vitamin C mg/day), tells us that the data set is split into 2 groups of 30 observations each, based on the factor variable ‘supp’ (delivery method, either orange juice OJ or ascorbic acid VC).

The boxplot9 suggest that OJ is better for low dosages, while there is no stark difference for high dosage (2 mg/day).

To properly assess the effectiveness of supp-dose combinations, we test for differences in means between OJ and VC groups both overall and for low and medium (.5 and 1) and high (2) dosages separately; we also test for difference in means for medium versus high dosage for OJ delivery. Given the setup, we perform unpaired two-samples Welch t-tests10 (i.e. with unequal variances, since just eyeballing the data makes it difficult to argue for equal variances11).

Results

For each test, the Null Hypothesis is that there is no difference between the means, versus the Alternative that the means do differ12. Here I report the p-values only13, rounded to 4 digits.

Test p-value Significant at H_0 action
Overall 0.0606 10% (\(\alpha=.1\)) Reject (caveat)
Low/Medium Dose 0.0042 1% (\(\alpha=.01\)) Reject
High Dose 0.9639 Not significant Fail to reject
OJ Med. vs High 0.0392 5% (\(\alpha=.05\)) Reject

Conclusions

Assuming normality of the data, unequal variances Welch t-tests formally assess what visual inspection of the bloxplot brought to our attention. Overall the difference between the delivery method is statistically significant only at 10%14. While for a low or medium dosage the delivery method (OJ vs VC) makes a statistically significant (at 1%) difference, this is not the case for a high dosage. This fact suggests that mg/day Vitamin C intake might be a factor more important than delivery method in explaining tooth cells length. The best results were obtained with a high dosage-OJ combination, statistically different from the medium-OJ one.

Appendix15

Code chunks and output Part 1

Theoretical mean, variance, standard deviation

muExp <- 1/lambda 
sdExp <- 1/lambda 
varExp <- 1/(lambda^2) 
cat(" Theoretical mean =", muExp, "= Theoretical standard deviation", "\n", 
    "Theoretical variance =", varExp) 
##  Theoretical mean = 5 = Theoretical standard deviation 
##  Theoretical variance = 25

Theoretical variance of the sampling distribution of the (sample) mean

sampling.var <- varExp/n 
sampling.sd <- sdExp/sqrt(n) 
cat("Theoretical variance of sampling distribution =", 
       round(sampling.var, digits = 4)) 
## Theoretical variance of sampling distribution = 0.625

Simulation details

set.seed(42) # Answer to the Ultimate Question of Life, the Universe, and Everything 
sample.means <- NULL # Creates a null vector; to be filled as the loop iterates 
for (b in 1:B) { # B is the n° of simulations we are asked to perform  
                 # (i.e., we want to end up with 1000 means from the simulation) 
    sample.means <- c(sample.means, mean(rexp(n, lambda))) 
} 

Results Part 1

avg.sample.means <- mean(sample.means) 
sd.sample.means <- sd(sample.means) 
var.sample.means <- var(sample.means) 
cat(" Average of the simulated sampling distribution of means =", 
    round(avg.sample.means, digits = 2), "\n", 
    "Standard deviation of the simulated sampling distribution of means =", 
    round(sd.sample.means, digits = 4), "\n", 
    "Variance of the simulated sampling distribution of means = ", 
    round(var.sample.means, digits = 4) 
)
##  Average of the simulated sampling distribution of means = 4.99 
##  Standard deviation of the simulated sampling distribution of means = 0.7965 
##  Variance of the simulated sampling distribution of means =  0.6344

Code chunks and output Part 2

Exploratory Data Analysis (see also Figure 3)

library(datasets) 
data("ToothGrowth") 
str(ToothGrowth) 
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
summary(ToothGrowth) 
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Subsetting data for testing

OJ.all <- ToothGrowth[ToothGrowth$supp=="OJ", ] 
VC.all <- ToothGrowth[ToothGrowth$supp=="VC", ] 
OJ.lm <- ToothGrowth[ToothGrowth$supp=="OJ" & 
                         (ToothGrowth$dose==0.5 | ToothGrowth$dose==1), ] 
VC.lm <- ToothGrowth[ToothGrowth$supp=="VC" & 
                         (ToothGrowth$dose==0.5 | ToothGrowth$dose==1), ] 
OJ.h <- ToothGrowth[ToothGrowth$supp=="OJ" & ToothGrowth$dose==2, ] 
VC.h <- ToothGrowth[ToothGrowth$supp=="VC" & ToothGrowth$dose==2, ] 
OJ.med <- ToothGrowth[ToothGrowth$supp=="OJ" & ToothGrowth$dose==1, ] 

Full code and output for Welch tests

library(ggpubr) 
overall <- t.test(x = OJ.all$len, y = VC.all$len, 
                  alternative = "two.sided", var.equal = FALSE) 
lm <- t.test(x = OJ.lm$len, y = VC.lm$len, 
             alternative = "two.sided", var.equal = FALSE) 
high <- t.test(x = OJ.h$len, y = VC.h$len, 
               alternative = "two.sided", var.equal = FALSE) 
OJmedVShigh <- t.test(x = OJ.med$len, y = OJ.h$len, 
                      alternative = "two.sided", var.equal = FALSE) 
print("Test for overall difference in means (OJ vs VC)") 
## [1] "Test for overall difference in means (OJ vs VC)"
overall 
## 
##  Welch Two Sample t-test
## 
## data:  OJ.all$len and VC.all$len
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333
print("Test for difference in means for low-medium dosage (OJ vs VC)") 
## [1] "Test for difference in means for low-medium dosage (OJ vs VC)"
lm 
## 
##  Welch Two Sample t-test
## 
## data:  OJ.lm$len and VC.lm$len
## t = 3.0503, df = 36.553, p-value = 0.004239
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.875234 9.304766
## sample estimates:
## mean of x mean of y 
##    17.965    12.375
print("Test for difference in means for high dosage (OJ vs VC)") 
## [1] "Test for difference in means for high dosage (OJ vs VC)"
high 
## 
##  Welch Two Sample t-test
## 
## data:  OJ.h$len and VC.h$len
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean of x mean of y 
##     26.06     26.14
print("Test for difference in means for OJ, medium vs high dosage") 
## [1] "Test for difference in means for OJ, medium vs high dosage"
OJmedVShigh 
## 
##  Welch Two Sample t-test
## 
## data:  OJ.med$len and OJ.h$len
## t = -2.2478, df = 15.842, p-value = 0.0392
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.5314425 -0.1885575
## sample estimates:
## mean of x mean of y 
##     22.70     26.06

Plots

Figure 1

library(kdensity) 
kdensity.means <- kdensity(sample.means, start = "gumbel", 
                           kernel = "gaussian", na.rm = TRUE) 
xSeq <- seq(0, 8, length = 80) 
NormalPDF <- dnorm(xSeq, mean = muExp, sd = sampling.sd) # Normal under CLT 
hist(sample.means, col = "lightblue", breaks = 20, freq = FALSE, 
     main = "Sampling Distribution vs Normal PDF", 
     xlab = "Simulated Means", ylab = "Density") 
abline(v = muExp, lty = 1, lwd = 2.5, col = "darkorange2") 
legend(legend = "mean", "topright", lty = 1, lwd = 2.25, 
       bty = "n", col = "darkorange2") 
lines(xSeq, NormalPDF, lty = 1, col = "red") 
lines(kdensity.means, lty = 2, col = "darkgreen") 

Figure 2

qqnorm(sample.means) 
qqline(sample.means, col = "red") 

Figure 3

library(ggplot2) 
library(RColorBrewer) 
CBfriendly <- c("#E69F00", "#56B4E9") # We should all try to make colorblindness friendliness the default 
gg <- ggplot(data = ToothGrowth, 
             aes(x = as.factor(dose), y = len, fill = supp)) + 
    scale_fill_manual(values = CBfriendly) + 
    geom_boxplot() + 
    xlab("Vitamin C Dose (mg/day)") + 
    ylab("Tooth Cells Length") + 
    ggtitle("Boxplot of Tooth Cells Length by Delivery Method & Dose") + 
    theme_light() + 
    theme(legend.position = "top", 
          legend.title = element_text(color = "darkgreen", size = 14), 
          legend.text = element_text(color = "darkred", size = 14) 
          ) + 
    theme(plot.title = element_text(size = 16, hjust = 0.5)) + 
    theme(axis.text = element_text(size = 12), 
          axis.title = element_text(size = 12)) 
gg 


  1. https://en.wikipedia.org/wiki/Exponential_distribution↩︎

  2. Here “theoretical” refers to the population mean of a process following an exponential distribution (with a given \(\lambda\)), as well as the “theoretical” mean and variance from the CLT↩︎

  3. Shorthand for Central Limit Theorem↩︎

  4. A seed is set for reproducibility. For the choice of seed, please do read Adams, D. (2017). The Ultimate Hitchhiker’s Guide to the Galaxy: The Complete Trilogy in Five Parts (Vol. 6). Pan Macmillan.↩︎

  5. See the Appendix for code and output↩︎

  6. For the histogram, 20 break points were specified in the function call. This is often used as a “rule of thumb” number of bins↩︎

  7. Exploratory Data Analysis↩︎

  8. Henceforth “tooth cells” for simplicity↩︎

  9. See Figure 3 in the Appendix↩︎

  10. Note that, if the data are not normally distributed, we should either use the non parametric two-samples Wilcoxon rank test or resort to, say, log-normalization. A Shapiro-Welk test should be conducted for normality. These go slightly beyond the scope of this assignment, however, so let’s pretend everything is fine and dandy↩︎

  11. For sake of completeness, we could perform a F-test for equality of variances. Paraphrasing Fermat, I have a truly marvelous output of this, which this margin is too narrow to contain↩︎

  12. As argued in class, this test is more restrictive than the one-tailed version (e.g. H_0: \(\mu_1 \le \mu_2\) vs H_A: \(\mu_1 > \mu_2\))↩︎

  13. See the Appendix for full output from the tests↩︎

  14. Considered “low” in many applications, hence the caveat↩︎

  15. For the full code (so as to see where my results come from and for reproducibility, please find the full .Rmd HTML file at )↩︎