Confirming the Central Limit Theorem by Simulation and Analysis of ToothGrowth dataset

Introduction

This project is a part of Coursera’s Statistical Inference course. Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population. Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population (Source: Wikipedia).

The project consists of two parts:

A simulation exercise.
Basic inferential data analysis.

Part 1: Simulation Exercise

The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

For more information on Central Limit Theorem:

Wikipedia Page: Central Limit Theorem
Youtube Link: Central limit theorem | Inferential statistics | Probability and Statistics | Khan Academy

In this part, we will investigate the exponential distribution in R and compare it with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

Design parameters for the simulation:

set.seed(100)
lambda <- 0.2
simulations <- 1000
n <- 40
sd <- 1/lambda

We need to illustrate the following through simulation and associated explanatory text:

Show the sample mean and compare it to the theoretical mean of the distribution.
Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
Show that the distribution is approximately normal.

The theoretical mean is:

theoretical_mean <- 1/lambda
theoretical_mean

## [1] 5

The simulated mean is:

sim <- data.frame(x = sapply(1:simulations, function(x) mean(rexp(n, lambda))))
head(sim)

##          x
## 1 4.137412
## 2 6.051703
## 3 4.415869
## 4 4.404714
## 5 3.210413
## 6 5.475307

simulation_mean <- mean(sim$x)
simulation_mean

## [1] 4.999702

The theoretical variance is:

theoretical_var <- (1/lambda)^2 / n
theoretical_var

## [1] 0.625

The simulated variance is:

simulation_var <- var(sim$x)
simulation_var

## [1] 0.6432442

Thus, we see that our simulation results converges to the theoretical values. Now, to visualize the results, we make use of ggplot library.

library(ggplot2)
g <- ggplot(data = sim, aes(x = x)) + 
        geom_histogram( aes(y = ..density..),
                        binwidth = 0.15,
                        color = I("black"),
                        fill = I("green")) + 
        stat_function(fun = dnorm,
                      args = list(mean = theoretical_mean, sd = sqrt(theoretical_var)),
                      size = 1.2) + 
        geom_vline(xintercept = simulation_mean,
                   size=0.7,
                   color="blue",
                   linetype = 'longdash') +
        labs(title = "Sampling Distribution of the Sample Means",
             x = "Sample Means",
             y = "Frequency")
g

Inference: We can see that our sampling distribution approximates the normal distribution reasonably well for a sample size of 40. It is important to note that a greater sample size would result in an even closer approximation of the normal distribution, whereas a smaller sample size would result the opposite.

Part 2: Basic Inferential Data Analysis

Now in the second portion of the project, we’re going to analyze the ToothGrowth data in the R datasets package. We begin by loading the dataset.

library(datasets)
data("ToothGrowth")
ToothGrowth <- as.data.frame(ToothGrowth)
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: Factor w/ 3 levels "0.5","1","2": 1 1 1 1 1 1 1 1 1 1 ...

summary(ToothGrowth)

##       len        supp     dose   
##  Min.   : 4.20   OJ:30   0.5:20  
##  1st Qu.:13.07   VC:30   1  :20  
##  Median :19.25           2  :20  
##  Mean   :18.81                   
##  3rd Qu.:25.27                   
##  Max.   :33.90

Now, we plot Supplementary Type vs Tooth Length.

library(ggplot2)
g1 <- ggplot(aes(x = supp, y = len), data = ToothGrowth) +
      geom_boxplot(aes(fill = supp)) +
      ggtitle("Comparison between Length and Supplement") +
      ylab("Length") + xlab("Supplement")
g1

From the figure, it appears that a longer length is detected for subjects tested with supplement OJ(Orange Juice) compared to VC(Vit C). Because of this, it might be worth it to do a hypothesis test for these. Let us check if the same can be said about a possible doses.

g2 <- ggplot(aes(dose, len), data = ToothGrowth) + 
      geom_boxplot(aes(fill=dose)) +
      ggtitle("Boxplot of tooth lengths per dose") +
      xlab("Dose in mg per day") + ylab("Tooth length in mm")
g2

Again, there exists a clear correlation, based on the boxplots. This motivates us to do a hypothesis test for the different doses as well.

Perform Confidence Intervals and Hypothesis Test to Compare Tooth Growth by Supplement and Dose

To test the hyphothesis, a series of two-sided unpaired t-tests will be use to obtain the confidence intervals and p-values. Significance level to be tested will be at 0.05. The p-values will be adjusted using Bonferroni correction method (for conservative) and the comparative results show in the table below:

library(pander)
ttests <- lapply(c(.5, 1, 2), function(x) {
    t.test(len ~ supp, data=subset(ToothGrowth, dose == x), paired=FALSE, var.equal=FALSE)})
ttests

## [[1]]
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 3.1697, df = 14.969, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.719057 8.780943
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98 
## 
## 
## [[2]]
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = 4.0328, df = 15.358, p-value = 0.001038
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.802148 9.057852
## sample estimates:
## mean in group OJ mean in group VC 
##            22.70            16.77 
## 
## 
## [[3]]
## 
##  Welch Two Sample t-test
## 
## data:  len by supp
## t = -0.046136, df = 14.04, p-value = 0.9639
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.79807  3.63807
## sample estimates:
## mean in group OJ mean in group VC 
##            26.06            26.14

pvals <- c(ttests[[1]]$p.value, ttests[[2]]$p.value, ttests[[3]]$p.value)
stats <- c(ttests[[1]]$statistic, ttests[[2]]$statistic, ttests[[3]]$statistic)
adjusted_p <- p.adjust(pvals, method = "bonferroni")
lower <- sapply(c(ttests[[1]]$conf.int[1], ttests[[2]]$conf.int[1], ttests[[3]]$conf.int[1]), round, 3)
upper <- sapply(c(ttests[[1]]$conf.int[2], ttests[[2]]$conf.int[2], ttests[[3]]$conf.int[2]), round, 3)
df <- data.frame(dose=c(0.5, 1, 2), t=stats, p=pvals, adj=adjusted_p,
                 ci=paste0("[",paste(lower, upper, sep=", "), "]"))
colnames(df) <- c("Dose", "t", "p-value", "adj. p-value", "conf. int.")
pander(df, round=3, split.tables=120,
       caption="**_Two-sided t-test comparison of Supplement by Dose_**")

***Two-sided t-test comparison of Supplement by Dose***
Dose	t	p-value	adj. p-value	conf. int.
0.5	3.17	0.006	0.019	[1.719, 8.781]
1	4.033	0.001	0.003	[2.802, 9.058]
2	-0.046	0.964	1	[-3.798, 3.638]

At the 0.5 mg and 1 mg dose levels, there is a statistically significant difference (reject null hypothesis) between the means of the OJ and VC groups. The adjusted p-values are significant at the α = 0.05 level, and the 95% confidence intervals do not include zero.
For the 2 mg dose level, we fail to reject the null hypothesis, the adjusted p-value is much greater than 0.5, and the 95% confidence interval includes zero. So, it seems that dose level, there is no significative influence of the supplement of Orange Juice or Vit C on tooth growth.

Conclusions

There is a clear indications that the amount of dosage have clear effects on the length of teeth guinea pigs.
Supplement type has a clear influence too, OJ has a greater avarage teethgrowth comapre to VC.
Supplement OJ with dosages 0.5 and 1 has a clear influence than VC supplement, on growth of teeth length. VC supplement vs the OJ in combiantion with dosage 2mg/day has no significant effect(almost same mean & same confidence interval).

Assumptions

The sampling method for each sample is simple random sampling.
The samples are independent.
The sampling distribution is approximately normal, without outliers.