The aim of this project is to show the power of the Central Limit Theorem, one of the most important concept and central tenet in Statistics. Normally , we all assume that the theoretical distribution of the population that we sample from is normal or approximately so. I that respect we further assume that the mean of that sample will also be normally distributed. The power of the Central Limit Theorem is that even if theoretical distribution is not normal we can still assume that the sample mean is normally distributed. That will be shown by simulating from the exponential distribution, which is very different from the normal as can be seen from Figure 3. The main mass is to the right of 0 and the vicinity, is heavily skewed to the right, and mainly is not “bell” shaped. The exponential density function is:
\[
f(X;\lambda)=\lambda e^{-\lambda x},
\] where the mean of the distribution is \(\frac{1}{\lambda}\), the variance is \(\frac{1}{\lambda^2}\), and the standard deviation is again \(\frac{1}{\lambda}\). In this report \(\lambda\) will be set equal to 0.2 , therefore the theoretical mean will be 1/.2 = 5 or \(\mu = 5\) and variance will be \(\sigma^2 = 1/.2^2 = 25\), finally the standard deviation is \(\sigma = 1/.2 = 5\). The simulation experiments will be simple - sample from the exponential distribution, calculate the mean and variance. Then repeat this 1000 times (in this report the number of simulations will be set equal to 1000), whereby each iteration will record the mean and the variance. Plot the histogram of the 1000 results, and make conclusions. Finally, compare the distribution of the simulation of the 1000 exponentially distributed numbers with that of the distribution of the 1000 means of samples of size 40.
According to the Law of Large Numbers the sample mean in the limit should converge to the mean of the population. We can see here that even with one sample of 40 exponentials the sample mean is pretty close to the theoretical \(\mu = 5\).
nosim = 1000; lambda = .2; n = 40
set.seed(0); mean(rexp(n,lambda))
## [1] 4.776259
In order to investigate this relation we can look at the distribution of the 1000 of means of samples of different szes - i.e. n = 10,40,90. The histograms of the simulation can be seen in Figure 1 in the appendix, with the accompanying code. It can clearly be seen that the mean of the distributions is cetered at the theoretical mean (marked by the vertical lines). Also, it can e seen that as n increases the the results are more and more closer to the theoretical mean.
The variance of the sample mean is centered at the variance of the population from which we are sampling. The variance of the sample mean is itself a random variable and the nice property is that it estimates the variance of the population. We take a single sample here of 40 exponentials and calculate the \(\sigma^2\) and \(\sigma\).
set.seed(0); var(rexp(n,lambda))
## [1] 23.90902
set.seed(0); sd(rexp(n,lambda))
## [1] 4.889685
As can be seen, the resuls are pretty close to the theoretical parameters. In order to investigate this further we will simulate 1000 samples of n=10,40,90 and measure their variances. Then plot the histograms of the different cases is shown in Figure 2. It can be concluded that sample vaiances are even closer to the theoretical ones then that of the means, and of course as the sample size increases the relation gets even stronger. So sample variance estimates pretty good the population variance.
The sample mean is itself a random variable which has a mean equal to the mean of the population, but the variance of that distribution is in fact \(\frac{\sigma^2}{n}\) or the standard deviation is \(\frac{\sigma}{\sqrt n}\). With our parameters that theoretical variance would be 25/40 = 0.625. It could be observed that this variance is much samller than the variance of the population. And we can take one sample:
set.seed(0); var(rexp(n,lambda))/n
## [1] 0.5977255
Again - pretty close. Or we can simulate 1000 n = 40 samples and calculate the variance of that 1000 means:
means = NULL; set.seed(0)
for (i in 1 : nosim) {means = c(means, mean(rexp(n,lambda)))}
var(means)
## [1] 0.6181582
Of course the number now is much closer to the theoretical value. In order to further investigate this relation we can plot the histogram of the means and compare it to the distribution of the population from which we are sampling. That is illustrated in Figure 3. It clearly shows that both distributions are centered at the population mean \(\mu = 5\), as indicated by the red horisontal line. Also, the distribution of the mean of samples is clearly normal shaped, whereas the population distribution is very different from normal. On the other hand also can be observed that the exponential distribution is much more spread out than that of the means of samples.
Loading the data we see three variables:
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
The length of the teeth of 60 pigs, responding to different vactors, i.e 3 dose of Vitamin C, delivered in two different methods - orange juice(OJ) and ascorbic acid(VC). In order to perform the basic exploratory analysis and statistical inference, 5 new variables would be created. That can be seen here:
supplement_VC <- subset(ToothGrowth, supp == "VC")[,1]
supplement_OJ <- subset(ToothGrowth, supp == "OJ")[,1]
dose_05 <- subset(ToothGrowth, dose == .5)[,1]
dose_1 <- subset(ToothGrowth, dose == 1)[,1]
dose_2 <- subset(ToothGrowth, dose == 2)[,1]
To get a feel of the data, we can plot how the length of the teeth respond to different dose of Vitamin C. That is shown in the Figure 4 in the appendix. From the figure is evident that increasing the quantity of the vitamin C clearly increases the length of the teeth of the pigs. The red regression line also highlights that point. So, our question is - is that increase statistically significant, in other words how many standard deviations away from the mean the results are. On the other hand, the data can be looked from a point of view of the different supplement with which the vitamins are given. That can be done by plotting variable supplement vs. the length, shown in Figure 5. Eyeballing the graph, the data in that dimention looks relatively similar, with the caveat that the one variable is more variable. Again our task is to see whether the difference in the means is statistically sgnificant.
The boxplot will be used for this quick look at all te data. It is illustrated in Figure 6. Again, it can be seen that there are differences between the different factors in the means of the variables. So that is why a statistical inference analysis is needed to be performed, in order to decide whether these differences are significant.
After examining the data, it could be concluded that the observations are not paired, since 60 different pigs are been treated and not 10 pigs with different treatment. That is why, when performing the t.test the option paired = False will be selected. The other decision to be made is whether the formula to equal or unequal variance should be used.
First, the hypothesis will be tested whether the differences between the two means is 0. That could easily be done:
t.test(supplement_VC, supplement_OJ, paired=F)$statistic
## t
## -1.915268
The p-value for that statistic is:
t.test(supplement_VC, supplement_OJ, paired=F)$p.value
## [1] 0.06063451
suggesting that the \(H_{0}\) hypothesis that the difference is 0 cannot be rejected at the 5% significance level. It can however be rejected at 10% level. 95% confidence interval could be calculated also easily:
t.test(supplement_VC, supplement_OJ, paired=F)$conf.int[1:2]
## [1] -7.5710156 0.1710156
From here we can seen that 0 is contained in the interval, hence the \(H_{0}\) hypothesis cannot be rejected at this level. Should be noted that the confidence level gives more information and contains the decision about hypothesis testing. All that, calculations was made assuming unequal variance. However, it is good to test also the equal option:
t.test(supplement_VC, supplement_OJ, paired=F, var.equal=T)$conf.int[1:2]
## [1] -7.5670064 0.1670064
We can see that the interval is slightly narrower, but 0 is still in it, so we cannot reject again the \(H_{0}\) hypothesis. As a check, a manual calculation of this confidence interval will be performed. The code is in the appendix, result is:
## [1] -7.5670064 0.1670064
Similarly, the hypothesis that there is not enough evidence to reject the statement that there is no difference between the means of treatment by different doses. So \(H_{0}\) hypothesis: nothing is going on - no difference. The statistic for comparing dose .05 and dose 1.0 is:
## t
## -6.476648
Obviously this number been 6.48 standard errors away from the mean is rejected overwhelmingly. The p-value is:
## [1] 1.268301e-07
so small that it is rejected at any meaningful level. The confidence level associated is:
## [1] -11.983781 -6.276219
Zero is really far away from the limits of this interval so obviously there is a difference between this means. One more check for the case with equal variances:
## [1] -11.983748 -6.276252
The differenc is so small so it can really be neglected. Because, in this experiment we have three different doses of vitamin C it is good to test the other possible combinations, to see if they differ from each other. Here is the confidence interval for dose 1.0 and dose 2.0:
## [1] -8.996481 -3.733519
Again, we see that 0 is not in the interval suggesting that we reject the \(H_{0}\) hypothesis of “nothing is going on”. The confidence interval for dose .05 and dose 2.0 is:
## [1] -18.15617 -12.83383
Obliously much wider than the rest, and 0 is far away, so we reject again the null.
After all the tests performed we can conclude that there is no statistically significant difference if we use orange juice or ascorbic acid as a supplement when adding different amounts of vitamin C. Although, it should be said that we just failed to rejected the null hipothesis at 5%. On the contrary there is strong evidence that the higher milligram of vitamin C added leads to higher length in pigs’ teeth. For the assumptions - we should suppose that the hypothetical “population” is approximately normally distributed. Yes, we also know that whatever the population we can expect a normally distributed mean of the sample, but in this case the sample is really small. And in any case, the Central Limit Theorem guarantees “normality” eventualy as n increases. Unfortunately, for a single test we cannot always be so confident. That is to say if we have evidence that the “population” is really not normal, maybe other techniques should be sought. Nevertheless in our case, in my view the “normality” could be safely assumed. The other assumption is unequal variance of the samples, which is the default of the t.test, and is to be preferred since it is more conservative.
Distribution of the means of 1000 samples of size 10, 40 and 90 from the exponential distribution.
Distribution of the variances of 1000 samples of size 10, 40 and 90 from the exponential distribution.
Histogram of 1000 draws from exponential distribution(left) and histogram of the means of 1000 of 40 exponentials(right).
library(ggplot2); set.seed(0)
dat <- data.frame(
x = c(apply(matrix(rexp(nosim * (n-30), lambda),
nosim), 1, mean),
apply(matrix(rexp(nosim * n, lambda),
nosim), 1, mean),
apply(matrix(rexp(nosim * (n+50), lambda),
nosim), 1, mean)
),
size = factor(rep(c(10, 40, 90), rep(nosim, 3))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20,
binwidth=.2, colour = "black")
g <- g + geom_vline(xintercept = 1/lambda, size = 2)
g + facet_grid(. ~ size) #+ annotate("text", x = 7, y = 40, label = "Some text")
set.seed(0)
dat <- data.frame(
x = c(apply(matrix(rexp(nosim * (n-30), lambda),
nosim), 1, var),
apply(matrix(rexp(nosim * n, lambda),
nosim), 1, var),
apply(matrix(rexp(nosim * (n+50), lambda),
nosim), 1, var)
),
size = factor(rep(c(10, 40, 90), rep(nosim, 3))))
g <- ggplot(dat, aes(x = x, fill = size)) + geom_histogram(alpha = .20,
binwidth=7, colour = "black")
g <- g + geom_vline(xintercept = (1/lambda)^2, size = 2)
g + facet_grid(. ~ size)
set.seed(0)
dat <- data.frame(
x = c(rexp(nosim, lambda),
apply(matrix(rexp(nosim * n, lambda),
nosim), 1, mean)
),
simulation = factor(rep(c("exp dist", "sample dist"), rep(nosim, 2))))
g <- ggplot(dat, aes(x = x, fill = simulation)) + geom_histogram( alpha = .20,
binwidth=.5, colour = "black")
g <- g + geom_vline(xintercept = 1/lambda, size = 1,color="red")
g + facet_grid(.~simulation)#+ annotate("text", x = 20, y = 150, label = "Some text")
library(ggplot2)
g <- ggplot(ToothGrowth,aes(dose,len))
g + geom_point(color="steelblue") + facet_grid(.~supp) + geom_smooth(size = 1,
linetype = 1, method = "lm", se = FALSE, color="red")
g<-ggplot(ToothGrowth,aes(supp,len))
g+geom_point(color="steelblue",size=3.5)
#+ geom_smooth(size = 1, linetype = 1, method = "lm", se = FALSE, color="red")
boxplot(supplement_OJ, supplement_VC, dose_05, dose_1, dose_2, names = c("Supplement OJ",
"Supplement_VC", "Dose 05", "Dose 1", "Dose 2"),col="red")
title("Basic Summary of the Data")
ns1 <- length(supplement_VC); ns2 <- length(supplement_OJ)
sp_s <- sqrt( ((ns1 - 1) * sd(supplement_VC)^2 + (ns2-1) * sd(supplement_OJ)^2) / (ns1 + ns2-2))
md_s <- mean(supplement_VC) - mean(supplement_OJ)
semd_s <- sp_s * sqrt(1 / ns1 + 1/ns2)
md_s + c(-1, 1) * qt(.975, ns1 + ns2 - 2) * semd_s
#df = ((var(supplement_VC)/ns1+var(supplement_OJ)/ns2)^2)/((((var(supplement_VC)/ns1)^2)/(ns1-1)+((var(supplement_OJ)/ns2)^2)^2)/(ns2-1))
#md_s + c(-1,1)*qt(.975,df)*((var(supplement_VC)/ns1+var(supplement_OJ)/ns2)^.5)