Statistical Inference Final Project

Part 1: Simulation Exercise Instructions

In this project you will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Illustrate via simulation and associated explanatory text the properties of the distribution of the mean of 40 exponentials. You should

Show the sample mean and compare it to the theoretical mean of the distribution. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution. Show that the distribution is approximately normal. In point 3, focus on the difference between the distribution of a large collection of random exponentials and the distribution of a large collection of averages of 40 exponentials.

Set the variables

set.seed(2017) # set seed to enable replicability
lambda <- 0.2 # the rate parameter lambda 
n <- 40 # number of exponentials
sim <- 1000 # the required number of simulations

Set the simulation

Simulation <- matrix(data=rexp(n*sim, lambda), nrow=sim)
SimuMeans <- apply(Simulation, 1, mean)

Analysis of the Moments of the Distribution

Moments of a distribution are specific quantitave measure of the shape of a distribution which enable us to define how normally distributed is our distributions. There are 4 moments:

Mean
Variance
Skewness
Kurtosis

DistMean <- mean(SimuMeans)
TheoMean <- 1/lambda
DistVar <- var(SimuMeans)
TheoVar <- ((1/lambda)/sqrt(n))^2

The distribution mean is 4.98 and the theoretical mean is 5 which makes a difference of 0.02.

The variance of the distribution is 0.6 and the theoretical variance is 0.62 which makes a difference of -0.02.

Analysing the first two moments of the distribution we can see that we have a mean lower and a lower variance than the theoretical mean and variance. This might be because our distribution might be more heavily distributed on the left side and might have lighter tails than a normal distribution. The next two moments enable us to prove this.

library(e1071)
Skew <- skewness(SimuMeans)
Kurto <- kurtosis(SimuMeans)

The skewness is 0.39 which shows that the distribution is more heavily weighted on the left of the distribution. This can explain why the mean of the real distribution is below the theoretical distribution.

The excess of kurtosis is 0.48. The positive kurtosis shows that the distribution has lighter tails than a normal distribution and thus explains why we had a standard deviation lower than the theoretical one.

While the mean, kurtosis and skewness shows that our simulated distribution is not normally distributed, the values proves that our distribution is not far away to be normally distributed.

Visualisation of the distribution

hist(SimuMeans, breaks=n, prob=TRUE, col="green", xlab = "distribution of means", main="Density of our distribution", ylab="density")
lines(density(SimuMeans), col=4, lwd=2)
x <- seq(min(SimuMeans), max(SimuMeans), length=2*n)
y <- dnorm(x, mean=TheoMean, sd=sqrt(TheoVar))
lines(x, y, pch=22, col=2, lwd=2)
abline(v = mean(SimuMeans), lty = 1, lwd = 5, col = "blue")
legend("topright", lty = 1, lwd = 5, col = "blue", legend = "distribution mean")

The red line shows a normally distributed density and the blue line shows our density. As you can see the top of our simulated distribution is on the left of the normal distribution. We can also see that the blue line has lighter tail than a normal distribution on the left side but not on the right side on the distribution. However, as expected our distribution is not far away to be normally distributed.

Part 2: Basic Inferential Data Analysis

In this part we will analyse the ToothGrowth dataset available in R. Description from R Documentation : The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, (orange juice or ascorbic acid (a form of vitamin C and coded as VC).

Understand the format and structure

head(ToothGrowth)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

As we can see there are 3 columns: Tooth length, delivery method (with 30 doses through orange juice and 30 doses through ascorbic acid), and the dose ( which ranges from 0.5 to 2)

We our data we can answer to 3 questions:

Do odontoblasts response better to one delivery method
Do odontoblasts response better to a higher dose without taking into consideration the delivery method
Do odontoblasts response better to a higher dose given a supplementary method

Do odontoblasts response better to one delivery method

We first need to assess if it is possible to compare the two delivery method. To make the comparison possible and avoid any biases we need to make sure that the distribution of doses is similar for the orange juice and the ascorbic acid. For example, if the data has much higher distribution of doses for orange juice and we see a higher impact on odontoblasts length, we could conclude that orange juice is a better method of delivery while in fact it is the level of dose that really impact odontoblasts length.

data_vc <- subset(ToothGrowth , supp == 'VC')
data_oj <- subset(ToothGrowth , supp == 'OJ')
summary(data_vc[,3])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   0.500   1.000   1.167   2.000   2.000

summary(data_oj[,3])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   0.500   1.000   1.167   2.000   2.000

We can see above that the distribution of doses is equal for both methods of delivery and are thus comparable.

summary(data_vc[,1])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.20   11.20   16.50   16.96   23.10   33.90

summary(data_oj[,1])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.20   15.53   22.70   20.66   25.73   30.90

The summary proves that odontoblasts reponse better to orange juice as a delivery method as all quartiles ( except the max) and the mean result in a higher length of odontoblasts.

Do odontoblasts response better to a higher dose without taking into consideration the delivery method

As we will not have sufficient data per dose level, it will be impossible to have robust results on quartile analysis however, the mean can be a good indicator to estimate the impact.

aggregate(len ~ dose, ToothGrowth, mean)

##   dose    len
## 1  0.5 10.605
## 2  1.0 19.735
## 3  2.0 26.100

The above results show that higher dose has a positive impact on odontoblasts length.

Do odontoblasts response better to a higher dose given a supplementary method

We have earlier proven that the choice of a delivery method and the level of dose has a positive impact on the odontoblasts length. We will now reiterate the same computation than in the last question by splitting the result by distribution method

data_vc <- subset(ToothGrowth , supp == 'VC')
data_oj <- subset(ToothGrowth , supp == 'OJ')
aggregate(len ~ dose, data_vc, mean)

##   dose   len
## 1  0.5  7.98
## 2  1.0 16.77
## 3  2.0 26.14

aggregate(len ~ dose, data_oj, mean)

##   dose   len
## 1  0.5 13.23
## 2  1.0 22.70
## 3  2.0 26.06

This result is very interesting because it shows that while the delivery method is important for lower doses 0.5 and 1, we can see that the impact on odontoblasts length is higher for ascorbic acid for a dose of 2 (although very close). This can explain why the max was higher for the ascorbic acid than for the orange juice at the beginning of the analysis.

The same can be concluded using the

Visualisation of our Statistical Analysis

By dose

library(ggplot2)
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
g <- ggplot(ToothGrowth, aes(x = dose, y = len))
g <- g + geom_boxplot(aes(fill = dose))
g <- g + facet_grid(. ~ supp)
print(g)

By delivery Method

g <- ggplot(ToothGrowth, aes(x = supp, y = len))
g <- g + geom_boxplot(aes(fill = supp))
g <- g + facet_grid(. ~ dose)
print(g)

Compare odontoblasts length impact using confidence interval and hypothesis testing with Confidence interval of 95%

Our first null hypothesis is that odontoblasts react equally to the 2 delivery methods. We will thus use the Welch Two Sample t-test on each of the different level of doses.

t.test(len~supp, data = ToothGrowth[ToothGrowth$dose == 0.5,])

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98

t.test(len~supp, data = ToothGrowth[ToothGrowth$dose == 1,])

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77

t.test(len~supp, data = ToothGrowth[ToothGrowth$dose == 2,])

## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14

For dose level of 0.5 and 1, the p-value is lower than our alpha (0.05) which means that we can reject our null hypothesis for those two doses. For a dose level of 2, the p-value is higher and thus the null hypothesis is not rejected. This is also translated on the mean of the t.testing as the means for orange juice are higher than ascorbic acid except for a dose level of 2.

Our second null hypothesis is that odontoblasts length does not react positively to a higher level of dose. To do this we will compare the dose 0.5 and the dose 2.

# subset to isolate 0.5 and 2 in separate dataframe
data_0.5 <- subset(ToothGrowth , dose == 0.5) 
data_2.0 <- subset(ToothGrowth , dose == 2.0)
# test run
t.test(data_2.0[,1] , data_0.5[,1] , paired = FALSE , var.equal = FALSE , conf.level = 0.95)

## 
##  Welch Two Sample t-test
## 
## data:  data_2.0[, 1] and data_0.5[, 1]
## t = 11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12.83383 18.15617
## sample estimates:
## mean of x mean of y 
##    26.100    10.605

We can reject this null hypothesis as the p-value is lower than our alpha set at 0.05. We can also see the mean for dose level of 2 is much higher than the mean for dose level of 0.5.

conclusion

Our hypothesis testing helped us to conclude two results:

odontoblasts length reacts better to orange juice than ascorbic acid as a delivery method for lower doses (0.5 to 1) but not to higher dose where results are similar (2)
odontoblasts length reacts positively to higher level of doses