Overview

This report has two parts:
1. Simulate an exponential distribution to verify the Central Limit Theorem (Abbreviation as CLT in this article).

The exponential distribution has only one parameter lambda as the rate. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda.

In this experiment, set lambda = 0.2 for all of the simulations. And for each sample, 40 exponentials will be drawn and then calculate their averages. With the generated data set, calculate its mean and standard deviation to comapre with the theoretical (CLT).

2.Analyze the ToothGrowth Data, comparing tooth growth by different supplements. Here, Hypothesis Testing will be used and output confidence intervals.Detailed information, please see Part Two below.

Part One: Simulate an exponential distribution to verify CLT

Simulation:

Use R function rexp(n, lambda) to generate random exponentials. For each simulation, 40 exponentials value will be drawn as sample. Caculate the mean of these 40 exponentials, and do this simulation 1000 times.

In this report, assume that the parameter lambda=0.2 for all simulations. see the R code below:

nosim <- 1000;n<-40;lambda<-0.2

SimulatesMeans<-apply(matrix(rexp(n*nosim,lambda), nosim), 1,mean)
smean<-mean(SimulatesMeans) # Calculate the mean of the averages of 1000 samples
smean
## [1] 4.956786
ssd<-sd(SimulatesMeans)  # Calculate the standard deviation of the averages of 1000 samples
ssd
## [1] 0.7637221

Sample Mean versus Theoretical Mean

As CLT states, the sample mean distribution is an Asymptopia of a normal distribution N(μ,σ2/n). The sample mean distribution’s mean equals the normal distribution’s mean and its standard deviation equals normal distribution’s standard deviation divided by squre root of n.

For the experiments above: The sample distribution mean caculated is the original mean of the exponential distribution 1/lambda=5

compare the output via R code above smean=4.9567859, they are almost the same

Sample Variance versus Theoretical Variance

Continued from above, the sample distribution’s standard deviation should equal the original exponential distribution’s standard deviation devided by squre root of sample counts, here is 40, so it should be 1/lamba/sqrt(40)=0.7906

compare the output via R code above ssd=0.7637221, they are almost the same.

Distribution histogram plot

Now, draw a basic histogram, and easyly to observate it looks like a normal distribution

hist(SimulatesMeans,main = "Histiogram of averaged exponential distribution value - Plot 1.1")

Standard normal distrobution plot

Further more, let’s draw a Standard normal distribution curve with the histogram.

library(ggplot2)
#Create function cfunc to convert the averaged expotential values into standard normal distribution value.(also called Normalized)
cfunc <- function(x, n) sqrt(n) * (mean(x) - 1/lambda) / (1/lambda)
#For ggplot2 Graphic system, a matrix of nosim rows with averaged expotential distribution values needed, then apply cfunc to convert them into standard normal distribution values.  
dat <- data.frame(x = c(apply(matrix(rexp(n*nosim,lambda), nosim), 1, cfunc, n)))

#begin histogram plot
g <- ggplot(dat, aes(x = x, fill = nosim)) + geom_histogram(alpha = 0.1, binwidth=0.2, colour = "blue", aes(y = ..density..)) 
#attach standard normal curve
g + stat_function(fun = dnorm, size = 1)+ labs(title = "Normalized Exponential with Standard Normal Curve - Plot 1.2")

Part Two: Analyze the ToothGrowth Data

Firstly, Load/Observe/Summary the data

library(datasets)
summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Let’s draw a basic point plot to get the clue of the possible ditribution from the dataset

Since the data reflects small sample measurement of tooth length, so it should follw a T distribution. Let’s draw a basic point plot from the dataset to verify this.
ggplot(ToothGrowth, aes(x=dose, y=len, colour=supp)) + geom_point() + labs(title = "Point of 2 supplements - Plot 2.1")

Based on the plot above (Plot 2.1), obviously it cannot give us any clue of what kind of distribution it is. That’s because 2 supplements with 3 dose integrated together, so let’s focus on one supplement’s one dose (different supplements and dose will show the same result).
tg<-ToothGrowth[ToothGrowth$dose==0.5,] # if choose 1.0 or 2.0, they are same distribution.
tg<-tg[tg$supp=="VC",] # Choose "OJ" will get same distribution
tg$LN<- c(1:nrow(tg))
ggplot(tg, aes(x=LN, y=len, colour=supp))+ 
        geom_point(size=3)+
        labs(title = "Point of 2 supplements - Plot 2.2")

Though the measurement points are scatted, but we could observe, it follows a bell curve if we rearrrange the points position for their occurance order doesn’t impact its distribution. Now, we can be confident that it’s a T distrbution from both theory and real data.
Let’s go back to observe Plot 2.1, which inspires us by giveing the clue that the different supplements (VC and OJ) have different influence on the length of the tooth, and looks like that OJ has more infulence than VC on the length of tooth. Let’s draw another plot to show the averaged length of 2 supplements.

A Line plot to show different supplements (VC and OJ) impact the length from their averaged values.

library(plyr)
SumMeanLen <- ddply(ToothGrowth, c("supp", "dose"), summarise, length=mean(len))
ggplot(SumMeanLen, aes(x=dose, y=length, colour=supp)) + geom_line()+ labs(title = "Lines of 2 supplements on an averaged Tooth Length - Plot 2.3")

Now, we assume that OJ has more infulence than VC on the length of tooth.

There are several kinds of methods to verify assumption and here, Hypothesis Testing will be used, and also, confidence intervals will be output via T Test confidence intervals. ### Use Hypothesis Testing as analysis means. Here, let’s make clear what I do and what I expect. Let’s assume that OJ has more infulence than VC on the length of tooth in a view of averaged value, so in statistical language, we assume that the averaged tooth length with supplement OJ has big value, and we could verufy that via a T Test against two supplement as different groups. If the confidence intervals is greater than 0 (OJ as the first parameter in the t.test function), then our assumption will be verified, otherwise if the interval contains 0, then verified faliue. in formular: Ho:μ2-μ1>0

Implement T Test and get confidence intervals

#create two sub groups based on different supplements
gVC<-ToothGrowth[ToothGrowth$supp=="VC",]$len
gOJ<-ToothGrowth[ToothGrowth$supp=="OJ",]$len
nVC <- length(gVC);nOJ <- length(gOJ)
sp <- sqrt( ((nOJ - 1) * sd(gOJ)^2 + (nVC-1) * sd(gVC)^2) /(nOJ + nVC-2))
meanDifference <- mean(gOJ) - mean(gVC)
semd <- sp * sqrt(1 / nOJ + 1/nVC)
meanDifference + c(-1, 1) * qt(.975, nVC + nOJ - 2) * semd
## [1] -0.1670064  7.5670064
#Easy way to call t.test function directly.
# Since the two groups are indenpendent different people, without same variance, thus the paried parameter and var.equal both should be "FALSE"
t.test(gOJ,gVC, paired=FALSE,var.equal=FALSE)$conf.int
## [1] -0.1710156  7.5710156
## attr(,"conf.level")
## [1] 0.95

Since the intervals cover 0, thus the assumption that OJ has more infulence than VC on the length of tooth failed. Let’s investigate, why? We could observe via the plot and it looks means of OJ is greater than VC. How it is if we seperated the groups agian by their dose. See below.

gOJ<-ToothGrowth[ToothGrowth$supp=="OJ" & ToothGrowth$dose==0.5 ,]$len
gVC<-ToothGrowth[ToothGrowth$supp=="VC" & ToothGrowth$dose==0.5 ,]$len
t.test(gOJ,gVC, paired=FALSE,var.equal=FALSE)$conf
## [1] 1.719057 8.780943
## attr(,"conf.level")
## [1] 0.95
gOJ<-ToothGrowth[ToothGrowth$supp=="OJ" & ToothGrowth$dose==1.0 ,]$len
gVC<-ToothGrowth[ToothGrowth$supp=="VC" & ToothGrowth$dose==1.0 ,]$len
t.test(gOJ,gVC, paired=FALSE,var.equal=FALSE)$conf
## [1] 2.802148 9.057852
## attr(,"conf.level")
## [1] 0.95
gOJ<-ToothGrowth[ToothGrowth$supp=="OJ" & ToothGrowth$dose==2.0 ,]$len
gVC<-ToothGrowth[ToothGrowth$supp=="VC" & ToothGrowth$dose==2.0 ,]$len
t.test(gOJ,gVC, paired=FALSE,var.equal=FALSE)$conf
## [1] -3.79807  3.63807
## attr(,"conf.level")
## [1] 0.95

Now, we know the reason that the initial assuption fail, when does increase to 2.0, the infulence of OJ is lower(not greater) than VC, which leads to the whole mean value down.