Statistical Inference Course Project

Part 1. Simulation Exercise

Overview

This project is dedicated to the exponential distribution investigation in R and its comparison with the Central Limit Theorem. The parameters of exponential distribution are: mean = \(1/\lambda\), standard deviation = \(1/\lambda\). In this project we set \(\lambda=0.2\) and investigate the distribution of averages of 40 exponentials for all of the 1000 simulations.

Simulations

The following code creates a matrix 40x1000, each column is a separate independent simulation of 40 observations of an exponentially distributed random variable with \(\lambda=0.2\)

n<-40; lambda<-0.2; simulations<-1000; set.seed(10176)
m<-matrix(nrow=n,ncol=simulations)
for (i in 1:simulations){ 
    m[,i]<-rexp(n, lambda) 
}

Sample Mean versus Theoretical Mean

Theoretical mean for the exponential distribution with \(\lambda=0.2\) is \(mean=1/\lambda=1/0.2=5\). Let’s now calculate the vector of sample means for our simulated data.

mns<-data.frame(x=apply(m,2,mean))
str(mns)

## 'data.frame':    1000 obs. of  1 variable:
##  $ x: num  3.42 5.57 4.61 4.86 4.79 ...

summary(mns)

##        x        
##  Min.   :2.554  
##  1st Qu.:4.456  
##  Median :4.976  
##  Mean   :5.009  
##  3rd Qu.:5.509  
##  Max.   :7.716

According to the Central Limit Theorem (CLT), the averages should approximately follow normal distribution and be centered at the population mean, e.g. \(\bar X_n\) is approximately \(N(\mu, \sigma^2 / n)\). The mean of the distribution of the averages is the estimate of the population mean. Its value is very close to the theoretical mean: 5.0088453.
The following histogram is built using the vector of means. As we can see the shape of the histogram resembles bell curve of normal distribution and the most of the sample means appear to be close to the theoretical mean - yellow line x=5. The sample mean - vertical red line - is almost equal the theoretical mean.
The normalized distribution of the sample means shown on the figure below is approximately standard normal \(\frac{\bar X_n - \mu}{\sigma / \sqrt{n}}\) ~ \(N(0, 1)\). To make it clear, the density function for \(N(0, 1)\) is drawn on the same figure.

The estimates of the variance and the standard error of the population mean are the variance and the standard error of the sample mean respectively: \(var=\sigma^2/n=\) 0.6097241, \(std\_err=\sigma/\sqrt n=\) 0.7808483. The theoretical value for mean variance is 0.625 and for standard error is 0.7905694. So, the theoretical and sample values are pretty close.
We showed on the example that the CLT allows to estimate population mean using large number of the sample means.

Sample Variance versus Theoretical Variance

In this section we’re going to illustrate that the sample variance estimates population (theoretical) variance. Theoretical variance for the exponential distribution with \(\lambda=0.2\) is \(var=1/\lambda^2=1/0.2^2=25\). Let’s now calculate the vector of sample means for our simulated data.

variances<-data.frame(x=apply(m,2,var))
str(variances)

## 'data.frame':    1000 obs. of  1 variable:
##  $ x: num  13.6 18.4 22 17.8 19.3 ...

summary(variances)

##        x         
##  Min.   : 4.108  
##  1st Qu.:17.300  
##  Median :23.297  
##  Mean   :25.043  
##  3rd Qu.:30.925  
##  Max.   :87.547

As we know, the distribution of a sample variance is centered around the theoretical variance. So let’s have a look at the histogram and the distribution graph for the sample variance. Here, yellow vertical line is the theoretical variance and red is the estimate.
The shape of the variances distribution doesn’t look like normal distribution and this is correct because CLT doesn’t apply for variances. The distribution is centered around the population variance, e.g. mean of the sample variances estimates theoretical variance very precisely: its value is 25.0432038 and theoretical is 25.

Distribution

Let’s compare the distribution of a large collection (1000) of random exponentials and the distribution of 1000 means of 40 exponentials. This distribution of the means has the shape that is very close to Gaussian but the exponential distribution looks absolutely different. Again we see that according to CLT the distribution of averages of iid variables (properly normalized) becomes that of a standard normal.

Part 2. Basic Inferential Data Analysis

Now we’re going to analyze the ToothGrowth data in the R datasets package that shows the effect of vitamin C on tooth growth in guinea pigs. The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of 3 dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (coded as OJ and VC).

Let’s load the data and have a look at its summaries.

library(datasets) 
data("ToothGrowth") 
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

summary(ToothGrowth)

##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

with(ToothGrowth,table(supp,dose))

##     dose
## supp 0.5  1  2
##   OJ  10 10 10
##   VC  10 10 10

There are 10 animals for each combination of dose and method.

with(ToothGrowth,table(supp,dose))

##     dose
## supp 0.5  1  2
##   OJ  10 10 10
##   VC  10 10 10

On the figure we clearly see that dose increase leads to the increasing of the tooth length. Also it looks like OJ is more effective in small doses than VC, but approximately same effective for dose=2.0. Let’s compare the OJ-VC difference for each dose.

\(H_0=\{ mean\ of\ the\ difference = 0\}\)
\(H_a=\{ mean\ of\ the\ difference > 0\}\).

The results of t-test (95% confidence) for the hypotheses comparison for each dose are shown in the table.

Dose	[	]	P-value	t statistic	Mean in group OJ	Mean in group VC
0.5	1.719057	8.780943	0.0063586	3.1697328	13.23	7.98
1.0	2.802148	9.057852	0.0010384	4.0327696	22.70	16.77
2.0	-3.798071	3.638070	0.9638516	-0.0461361	26.06	26.14

For small doses (0.5 and 1) confidence interval for the difference mean is above 0 and p-values are very small, which means that for these two cases \(H_0\) should be rejected and alternative hypothesis is accepted. For the last case (dose=2) there is not enough statistical evidence in order to reject \(H_0\).

Conclusion

Increasing the dose of vitamin C leads to the increasing of the length of odontoblasts in the teeth.
OJ in small doses (0.5mg and 1mg) is more effective than AC.
For bigger doses (2mg) delivery method doesn’t matter.

Appendix 1.

A1.1. Mean estimate

mean_est<-mean(mns$x)

A1.2. Multiplot function

Function that allows to put plots in a row taken from http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

# Multiple plot function 
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }
 if (numPlots==1) {
    print(plots[[1]])
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

A1.3. Histogram and distribution of sample means

library(ggplot2)

g1 <- ggplot(mns, aes(x = x)) + theme_bw() + 
    geom_histogram(alpha = .7, binwidth=.25, colour = "blue", fill="lightslateblue") +
    geom_vline(xintercept = 5, colour="goldenrod1",size = 1.5) + 
    geom_vline(xintercept = mean_est, colour="firebrick2",size = 0.6) + 
    labs(title="Histogram of the sample means") +
    xlab("Mean") + ylab("Frequency")

mu<-5
sigma<-5
f<-function(x,n) (mean(x) - mu)/(sigma/sqrt(n))
dat<-data.frame(x=apply(m,2,f,n))

g2<-ggplot(dat,aes(x=x)) + theme_bw() +
    geom_histogram(alpha=.5, binwidth=.1, colour="forestgreen", 
                   fill="forestgreen", aes(y= ..density..))+
    stat_function(fun = dnorm, size = 1) +
    labs(title="Distribution of the sample means (normalized)") +
    xlab("Mean") + ylab("Density") 

multiplot(g1, g2, cols=2)

A1.4. Variance and standard error of the mean

v<-var(mns$x)
std_err<-sd(mns$x)
teor_v<-sigma^2/n
teor_std_err<-sigma/sqrt(n)

A1.5. Histogram and distribution of sample variances

var_est<-mean(variances$x)

g3 <- ggplot(variances, aes(x = x)) + theme_bw() +
  geom_histogram(alpha = .7, binwidth=1.5, colour ="blue", fill="lightslateblue") +
  geom_vline(xintercept = 25, colour="goldenrod1", size=1.5) +
  geom_vline(xintercept =var_est, color="firebrick2", size=0.6) + 
  labs(title="Histogram of the sample variances") +
  xlab("Variance") + ylab("Frequency")

g4 <- ggplot(variances, aes(x = x)) + theme_bw() +
  geom_histogram(alpha=.5, binwidth=1, colour="forestgreen", 
                 fill="forestgreen", aes(y= ..density..)) +
  geom_vline(xintercept =var_est, color="firebrick2", size=0.6) +
  labs(title="Distribution of the sample variances") +
  xlab("Variance")+ ylab("Density")

multiplot(g3, g4, cols=2)

A1.6. Exponential distribution and distribution of the sample means

exp_distr<-data.frame(x=rexp(simulations, lambda))

g5 <- ggplot(exp_distr, aes(x = x)) + theme_bw() + 
    geom_histogram(alpha = .7,binwidth=0.5,aes(y= ..density..), 
                   colour ="blue",fill="lightslateblue") + 
    labs(title="Exponential distribution with lambda=0.2") + 
    xlab("x") + ylab("Density")

# Another simulation
set.seed(113659)
m1<-matrix(nrow=n,ncol=simulations)
for (i in 1:simulations){ 
    m1[,i]<-rexp(n, lambda) 
}
dat1<-data.frame(x=apply(m1,2,f,n))

g6 <- ggplot(dat1,aes(x=x)) + theme_bw() +
    geom_histogram(alpha=.5, binwidth=.1, colour="forestgreen", 
                   fill="forestgreen", aes(y= ..density..)) +
    stat_function(fun = dnorm, size = 1) +  
    labs(title="Distribution of the sample means (normalized)") +
    xlab("Mean") + ylab("Density")

multiplot(g5, g6, cols=2)

Appendix 2.

A2.1. Creating exploratory plot

library(ggplot2)

gt<-ggplot(ToothGrowth, aes(x=dose,y=len)) + 
  geom_point(alpha=0.6, aes(color=dose), size=2) + 
    facet_grid(.~supp) + theme_bw() + geom_smooth(method = "loess", size=0.5) +
    labs(title="    Tooth growth by dose of the vitamin C for each delivery method") +
    xlab("Dose of vitamin C, mg/day)") + ylab("Length of odontoblasts")+
    theme(text= element_text(size = 7.5)) + 
    theme(plot.title = element_text(size = 10))
gt

A2.2. T test

t1<-t.test(len~supp, paired=FALSE, var.equal=FALSE, 
           data=ToothGrowth[ToothGrowth[,"dose"]==0.5,])
t2<-t.test(len~supp, paired=FALSE, var.equal=FALSE, 
           data=ToothGrowth[ToothGrowth[,"dose"]==1,])
t3<-t.test(len~supp, paired=FALSE, var.equal=FALSE, 
           data=ToothGrowth[ToothGrowth[,"dose"]==2,])

df<-data.frame()
df<-cbind(rbind(0.5,1,2), 
          rbind(t1$conf.int, t2$conf.int, t3$conf.int), 
          rbind(t1$p.value, t2$p.value, t3$p.value),
          rbind(t1$statistic, t2$statistic, t3$statistic),
          rbind(t1$estimate, t2$estimate, t3$estimate)
    )
colnames(df)<-c("Dose","[","]","P-value", "t statistic", 
                "Mean in group OJ", "Mean in group VC")
library(knitr)
kable(df)