This report has two parts:
1. Simulate an exponential distribution to verify the Central Limit Theorem (Abbreviation as CLT in this article).
The exponential distribution has only one parameter lambda as the rate. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda.
In this experiment, set lambda = 0.2 for all of the simulations. And for each sample, 40 exponentials will be drawn and then calculate their averages. With the generated data set, calculate its mean and standard deviation to comapre with the theoretical (CLT).
2.Analyze the ToothGrowth Data, comparing tooth growth by different supplements. Here, Hypothesis Testing will be used and output confidence intervals.Detailed information, please see Part Two below.
Use R function rexp(n, lambda) to generate random exponentials. For each simulation, 40 exponentials value will be drawn as sample. Caculate the mean of these 40 exponentials, and do this simulation 1000 times.
In this report, assume that the parameter lambda=0.2 for all simulations. see the R code below:
nosim <- 1000;n<-40;lambda<-0.2
SimulatesMeans<-apply(matrix(rexp(n*nosim,lambda), nosim), 1,mean)
smean<-mean(SimulatesMeans) # Calculate the mean of the averages of 1000 samples
smean
## [1] 4.956786
ssd<-sd(SimulatesMeans) # Calculate the standard deviation of the averages of 1000 samples
ssd
## [1] 0.7637221
As CLT states, the sample mean distribution is an Asymptopia of a normal distribution N(μ,σ2/n). The sample mean distribution’s mean equals the normal distribution’s mean and its standard deviation equals normal distribution’s standard deviation divided by squre root of n.
For the experiments above: The sample distribution mean caculated is the original mean of the exponential distribution 1/lambda=5
compare the output via R code above smean=4.9567859, they are almost the same
Continued from above, the sample distribution’s standard deviation should equal the original exponential distribution’s standard deviation devided by squre root of sample counts, here is 40, so it should be 1/lamba/sqrt(40)=0.7906
compare the output via R code above ssd=0.7637221, they are almost the same.
Now, draw a basic histogram, and easyly to observate it looks like a normal distribution
hist(SimulatesMeans,main = "Histiogram of averaged exponential distribution value - Plot 1.1")
Further more, let’s draw a Standard normal distribution curve with the histogram.
library(ggplot2)
#Create function cfunc to convert the averaged expotential values into standard normal distribution value.(also called Normalized)
cfunc <- function(x, n) sqrt(n) * (mean(x) - 1/lambda) / (1/lambda)
#For ggplot2 Graphic system, a matrix of nosim rows with averaged expotential distribution values needed, then apply cfunc to convert them into standard normal distribution values.
dat <- data.frame(x = c(apply(matrix(rexp(n*nosim,lambda), nosim), 1, cfunc, n)))
#begin histogram plot
g <- ggplot(dat, aes(x = x, fill = nosim)) + geom_histogram(alpha = 0.1, binwidth=0.2, colour = "blue", aes(y = ..density..))
#attach standard normal curve
g + stat_function(fun = dnorm, size = 1)+ labs(title = "Normalized Exponential with Standard Normal Curve - Plot 1.2")
library(datasets)
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
ggplot(ToothGrowth, aes(x=dose, y=len, colour=supp)) + geom_point() + labs(title = "Point of 2 supplements - Plot 2.1")
tg<-ToothGrowth[ToothGrowth$dose==0.5,] # if choose 1.0 or 2.0, they are same distribution.
tg<-tg[tg$supp=="VC",] # Choose "OJ" will get same distribution
tg$LN<- c(1:nrow(tg))
ggplot(tg, aes(x=LN, y=len, colour=supp))+
geom_point(size=3)+
labs(title = "Point of 2 supplements - Plot 2.2")
library(plyr)
SumMeanLen <- ddply(ToothGrowth, c("supp", "dose"), summarise, length=mean(len))
ggplot(SumMeanLen, aes(x=dose, y=length, colour=supp)) + geom_line()+ labs(title = "Lines of 2 supplements on an averaged Tooth Length - Plot 2.3")
There are several kinds of methods to verify assumption and here, Hypothesis Testing will be used, and also, confidence intervals will be output via T Test confidence intervals. ### Use Hypothesis Testing as analysis means. Here, let’s make clear what I do and what I expect. Let’s assume that OJ has more infulence than VC on the length of tooth in a view of averaged value, so in statistical language, we assume that the averaged tooth length with supplement OJ has big value, and we could verufy that via a T Test against two supplement as different groups. If the confidence intervals is greater than 0 (OJ as the first parameter in the t.test function), then our assumption will be verified, otherwise if the interval contains 0, then verified faliue. in formular: Ho:μ2-μ1>0
#create two sub groups based on different supplements
gVC<-ToothGrowth[ToothGrowth$supp=="VC",]$len
gOJ<-ToothGrowth[ToothGrowth$supp=="OJ",]$len
nVC <- length(gVC);nOJ <- length(gOJ)
sp <- sqrt( ((nOJ - 1) * sd(gOJ)^2 + (nVC-1) * sd(gVC)^2) /(nOJ + nVC-2))
meanDifference <- mean(gOJ) - mean(gVC)
semd <- sp * sqrt(1 / nOJ + 1/nVC)
meanDifference + c(-1, 1) * qt(.975, nVC + nOJ - 2) * semd
## [1] -0.1670064 7.5670064
#Easy way to call t.test function directly.
# Since the two groups are indenpendent different people, without same variance, thus the paried parameter and var.equal both should be "FALSE"
t.test(gOJ,gVC, paired=FALSE,var.equal=FALSE)$conf.int
## [1] -0.1710156 7.5710156
## attr(,"conf.level")
## [1] 0.95
Since the intervals cover 0, thus the assumption that OJ has more infulence than VC on the length of tooth failed. Let’s investigate, why? We could observe via the plot and it looks means of OJ is greater than VC. How it is if we seperated the groups agian by their dose. See below.
gOJ<-ToothGrowth[ToothGrowth$supp=="OJ" & ToothGrowth$dose==0.5 ,]$len
gVC<-ToothGrowth[ToothGrowth$supp=="VC" & ToothGrowth$dose==0.5 ,]$len
t.test(gOJ,gVC, paired=FALSE,var.equal=FALSE)$conf
## [1] 1.719057 8.780943
## attr(,"conf.level")
## [1] 0.95
gOJ<-ToothGrowth[ToothGrowth$supp=="OJ" & ToothGrowth$dose==1.0 ,]$len
gVC<-ToothGrowth[ToothGrowth$supp=="VC" & ToothGrowth$dose==1.0 ,]$len
t.test(gOJ,gVC, paired=FALSE,var.equal=FALSE)$conf
## [1] 2.802148 9.057852
## attr(,"conf.level")
## [1] 0.95
gOJ<-ToothGrowth[ToothGrowth$supp=="OJ" & ToothGrowth$dose==2.0 ,]$len
gVC<-ToothGrowth[ToothGrowth$supp=="VC" & ToothGrowth$dose==2.0 ,]$len
t.test(gOJ,gVC, paired=FALSE,var.equal=FALSE)$conf
## [1] -3.79807 3.63807
## attr(,"conf.level")
## [1] 0.95