Overview

The aim of this project is to investigate the exponential distribution in R and compare it with the Central Limit Theorem. The second part concern the analysis of the ToothGrowth data in the R datasets package.

library(ggplot2)
library(datasets)
library(formatR)

Part I

Simulation

I investigate the distribution of averages of 40 exponentials by doing a thousand simulations with a rate parameter lambda equal to 0.2.

lambda <- .2
vari <- 1/lambda^2 #the theoretical variance of the distribution
moyenne <- 1/lambda # the theoretical mean of the distribution
n <- 40
nsim <- 1000
set.seed(777)
moyenneDist <- replicate(nsim, mean(rexp(n,lambda)))

Results

The sample mean and the variance of the simulation are respectively:

c(mean(moyenneDist), var(moyenneDist))
## [1] 4.9697890 0.6384832

They are in fact very similar to their respective theoretical ones:

c(moyenne, vari/n)
## [1] 5.000 0.625

The following plot shows the histogram of the sample distribution and permits to compare its PDF (in blue) with the theoretical one (in red). The dashed lines represent the corresponding means of each distribution, while the dotted ones are their standard deviations. The plot permits to compare all these parameters in a very fast and handy way.

ggplot(data.frame(moyenneDist), aes(x=moyenneDist)) +
    geom_histogram(aes(y=..density..), color="black", fill="white", bins = 30, size = 1) +
    geom_density(color = "blue", size = 1) +
    geom_vline(xintercept=mean(moyenneDist), color="blue", linetype="dashed", size=.8) +
    stat_function(fun = dnorm, args = list(mean = moyenne, sd = sqrt(vari/n)), color = "red", size = 1) +
    geom_vline(xintercept=5, color="red", linetype="dashed", size=.8) +
    geom_vline(xintercept = mean(moyenneDist)+c(-1,1)*sd(moyenneDist), color = "blue", linetype = "dotted", size = 0.8) +
    geom_vline(xintercept = moyenne+c(-1,1)*sqrt(vari/n), color = "red", linetype = "dotted", size = 0.8) +
    annotate("text", mean(moyenneDist)+sd(moyenneDist), 0.45, vjust = 1.3, color = "blue", label = "sd_sample", angle = 90) +
    theme(plot.title = element_text(hjust = 0.5)) +
    theme(plot.title = element_text(size=10)) +
    annotate("text", moyenne-sqrt(vari/n), 0.45, vjust = -.8, color = "red", label = "-sd_theor", angle = 90) + labs(title="Comparison between sample (in blue) and theoretical distributions (in red)", x="Means")

To Show whether the distribution is approximately normal we use a Shapiro-Wilk test:

shapiro.test(moyenneDist)
## 
##  Shapiro-Wilk normality test
## 
## data:  moyenneDist
## W = 0.98704, p-value = 9.767e-08
qqnorm(moyenneDist);qqline(moyenneDist, col = 2)

While the Q-Q plot shows that normality is probably a reasonably good approximation, the test itself present significant levels implying rejection of the null-hypothesis.

Part II:

The dta used in this part consist in ToothGrowth data available within the R datasets package

head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

According to the R documentation, the “ToothGrowth” dataset shows the effect of the three different dose levels of Vitamin C (in two forms: orange Juice (OJ) or ascorbic acid (VC)) on Tooth Growth of Guinea Pigs. A simple boxplot could show how different are the results according to each parameter:

ggplot(ToothGrowth, aes(x=factor(dose), y=len, fill=supp)) + geom_boxplot(position=position_dodge(1)) + scale_x_discrete("Dosage (mg/day)") + scale_y_continuous("Tooth length")

Assuming equal variance:

test1 <- t.test(len ~ supp, paired = FALSE, var.equal = TRUE, data = ToothGrowth)
c("Confidence intervals" = test1$conf.int, "P-value" = test1$p.value)
## Confidence intervals1 Confidence intervals2               P-value 
##           -0.16700642            7.56700642            0.06039337

In case the variances are not equal:

test2 <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = ToothGrowth)
c("Confidence intervals" = test2$conf.int, "P-value" = test2$p.value)
## Confidence intervals1 Confidence intervals2               P-value 
##           -0.17101562            7.57101562            0.06063451

Conclusions:

  1. According to the boxplot, one can conclude that the larger is the dose of Vitamin C the longer are the teeth.

  2. According to the t-tests, it is not possible to conclude whether using orange Juice or ascorbic acid is of a substantial impact on thoot growth.