Part 1: Simulation Exercise

In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should

  1. Show the sample mean and compare it to the theoretical mean of the distribution.
  2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
  3. Show that the distribution is approximately normal.

Overview

Here we’ll just load ggplot2 since this is my preferred plotting tool.

require("ggplot2")
require("tidyverse")
require("data.table")

Simulations

Here we define a function that will start without elements and will concatenate new elements, which are the averages of the nsim exponentials. The function will do n simulations

sim <- function(n, lambda, nsim) {
    mns = NULL
    for (i in 1:n) {
        mns = c(mns, mean(rexp(nsim, rate = lambda)))
    }

    mns
}

1.1. Show the sample mean and compare it to the theoretical mean of the distribution

Here, we will compare the sample mean of the 1000 simulations of 40 exponentials agains the mean of the exponential distribution, which is equal to 1/lambda.

First, we set the seed for repeatiblity and enter our simulation parameters (lambda = 0.2, nsim = 40, and n = 1000). We then run the simulation and create a dataframe. This is required for plotting in ggplot.

We compare the means in two ways, the first as a percent difference. The second way is graphically.

set.seed(1989)
lambda <- 0.2
nsim <- 40
n <- 1000

means <- data.table::data.table(sim(n = n, lambda = lambda, nsim = nsim))
sample_mean <- mean(means$V1)
theore_mean <- 1/lambda

perc_error <- (sample_mean - theore_mean)/theore_mean * 100

We conclude that the sample mean and the theoretical mean differ by 0.38%. The size of this difference becomes all the more obvious through the graphical comparison below.

simplot <- ggplot(data = means, 
                  aes(x = V1))

simplot + 
    geom_histogram(
        bins = 60,
        fill = "grey",
        color = "black") + 
    geom_vline(
        xintercept = 1/lambda, 
        color = "blue") + 
    geom_vline(
        xintercept = sample_mean, 
        color = "red") +
    labs(
        title = "Distribution of Averages of 40 Exponentionals",
        subtitle = "Comparison of 1/lambda (blue) and sample mean (red)",
        x = "Average",
        y = "Count")

1.2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

sample_var <- var(means$V1)
theore_var <- (1/lambda) ^ 2 / 40

perc_error_var <- (sample_var - theore_var) / theore_var * 100

We conclude that the sample deviation and the theoretical standard deviation differ by -1.4%. We can illustrate the normal distribution for the theoretical mean and standard deviation together with the normal distribution for the sample mean and standard deviations.

varplot <- simplot + 
    geom_histogram(
        bins = 60, 
        aes(y = ..density..),
        fill = "grey",
        color = "black") +
    stat_function(
        fun = dnorm, 
        n = nsim*n, 
        args = list(
            mean = 1/lambda, 
            sd = 1/lambda/sqrt(40)),
        color = "blue") +
    stat_function(
        fun = dnorm,
        n = nsim*n,
        args = list(
            mean = sample_mean,
            sd = sd(means$V1)),
        color = "red") +
     geom_density(color = "forestgreen") +
     labs(
        title = "Density of Averages of 40 Exponentionals",
        subtitle = "Comparison of theoretical 1/lambda (blue) and sample mean (red)",
        x = "Average",
        y = "Density")

varplot

We can also explore this relationship visually. We see in the density plot that the theoretical and sample densities are very close to each other throughout the data range. In this case, we use the theoretical mean and standard deviation (both equal to 1/lambda) and the sample mean and sample deviations to calculate the appropriate densities. If we include the density plot we can see that it is qualitatively close throughout the data range as well.

qqplot <- ggplot(means, aes(sample = means$V1))

y <- quantile(means$V1, c(0.25, 0.75), type = 5)
x <- qnorm(c(0.25, 0.75))

slope <- diff(y) / diff(x)
int <- y[1] - slope * x[1]

qqplot + 
    stat_qq(
        distribution = qnorm) +
    geom_abline(
        intercept = int, 
        slope = slope, 
        color = "red") + 
    labs(
        title = "Normal Q-Q Plot",
        subtitle = "Comparison of theoretical quantiles (red) versus sample quartiles",
        x = "Theoretical Quantiles",
        y = "Sample Quantiles")

We can further explore the relationship between the sample data and the theoretical values in a Q-Q plot. This plot compares the distribution of the sample values to the normal distribution. We can see from this plot that near the data follows a normal distribution, especially near the median of the data (Q0) and diverges from a normal distribution at the tails.

Part 2: Analysis of the ToothGrowth Dataset

Now in the second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package.

  1. Load the ToothGrowth data and perform some basic exploratory data analyses
  2. Provide a basic summary of the data.
  3. Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose. (Only use the techniques from class, even if there’s other approaches worth considering)
  4. State your conclusions and the assumptions needed for your conclusions.

2.1 Load the ToothGrowth data and perform some basic exploratory data analyses

Here we load the ToothGrowth dataset and the dplyr package to facilitate any future analyses on the dataset.

tooth <- data.table::data.table(ToothGrowth)
library("dplyr")

We then do some exploratory analysis to see what we’re working with.

str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

2.2 Provide a Basic Summary of the Data

We see that we’re working with a dataset composed of 60 observations of 3 variables, which are len, the tooth length, supp, which is a factor with two levels (OJ) for orange juice and (VC) for ascorbic acid and dosage data.

range(ToothGrowth$len)
## [1]  4.2 33.9
unique(ToothGrowth$dose)
## [1] 0.5 1.0 2.0
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

From here we see that ToothGrowth ranges from 4.2 to 33.9 (units not mentioned) and that the dosages include 0.5 mg, 1 mg, and 2 mg of Vitamin C.

We then move on to graphically analyzing this dataset, we can use a boxplot, which is a favorite of mine.

tooth.group <- tooth.summary <- tooth %>% 
    group_by(supp, dose)

tooth.plot <- tooth.group %>% 
    ggplot(aes(x = supp, y = len, fill = supp)) + scale_fill_manual(values = c("orange", "deepskyblue"))

tooth.plot + 
    geom_boxplot(alpha = 0.5) + 
    facet_wrap(~tooth.group$dose) +
    labs(
        title = "ToothGrow Data: Length versus Dose & Supplement",
        subtitle = "Supplement is orange juice (OJ) or ascorbic acid (VC)",
        x = "Supplement Method",
        y = "Length of Odontoblasts")

From this plot we can conclude that the length of odontoblasts increases with increasing dosage of Vitamin C. We can also conclude that the OJ route of administration appears to be more effective than the VC route at 0.5 mg and 1 mg; whereas at 2 mg the effect of Vitamin C can be said to be nearly equal for both routes.

2.3 Use confidence intervals and/or hypothesis tests to compare tooth growth by supp and dose.

The analysis now moves on to quantify the comments and observations noted from the plots above. As we claimed, the supplement type has a strong influence on the growth of odontoblasts, as we will explore in the subsequent analysis.

First, we analyze the relationship between odontoblast growth and supplement method. We analyze the possibility of equal and unequal variances. Starting with equal variance assumption.

t.test(data = ToothGrowth, len ~ supp, paired = FALSE, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 58, p-value = 0.06039
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1670064  7.5670064
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

And now with unequal variance assumption.

t.test(data = ToothGrowth, len ~ supp, paired = FALSE, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean in group OJ mean in group VC 
##         20.66333         16.96333

The outcomes of these t-tests show that in OJ performs better than ascorbic acid (Vitamin C) in contributing to tooth growth under the equal and unequal variance assumptions. In both these cases the test results appear to be relatively close in value. As the confidence range includes zero, a more conclusive test may require a larger population size to be more definitive.

We can also analyze the dependence of dosage, which will require a bit more of data wrangling, but nothing too complicated. For the low dose (0.5 mg):

low.dose <- filter(tooth.group, dose == 0.5)

t.test(data = low.dose, len ~ supp, paired = FALSE, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 18, p-value = 0.005304
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.770262 8.729738
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98
t.test(data = low.dose, len ~ supp, paired = FALSE, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98

As we can see, under both the equal and unequal variance assumptions, the results are nearly the same. For the low dosage (0.5 mg), the OJ supplement is more effective than the ascorbic acid in encouraging odontoblast growth.

Medium dose (1.0 mg):

med.dose <- filter(tooth.group, dose == 1.0)

t.test(data = med.dose, len ~ supp, paired = FALSE, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 18, p-value = 0.0007807
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.840692 9.019308
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77
t.test(data = med.dose, len ~ supp, paired = FALSE, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77

In the medium dosage scenario, we also conclude that OJ is more effective than the ascorbic acid in encouraging odontoblast growth.

High dose:

hi.dose <- filter(tooth.group, dose == 2.0)
t.test(data = hi.dose, len ~ supp, paired = FALSE, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 18, p-value = 0.9637
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.722999  3.562999
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14
t.test(data = hi.dose, len ~ supp, paired = FALSE, var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14

For the high dose, we can conclude that the OJ and the ascorbic acid supplements have a more or less equal effect in encouraging odontoblast growth.

Conclusion

The data shows that OJ is more effective than ascorbic acid in encouraging odontoblast growth in the low and medium dosage experiments and are equally effective in the high dosage experiments.