Q1

Example

#hist(runif(1000))
#mns = NULL
#for (i in 1:1000) mns = c(mns, mean(runif(40)))
#hist(mns)

The following code simulates the process of drawing 1000 samples of size 40 from an exponential distribution with lambda = 0.2. Then it calculates the sample mean and variance and theoretical mean and variance. Finally, it plots histograms showing the distribution of the random exponentials and the sample means.

set.seed(1234)
lambda = 0.2
n = 40

sampleMeans = NULL
for (i in 1:1000) sampleMeans <- c(sampleMeans, mean(rexp(n, lambda)))

#Average of sample means vs theoretical mean
Avg_Means = mean(sampleMeans)
The_Mean = 1/lambda
#4.994 vs 5

#Variance of sample means vs theoretical variance of the mean
Var_Means = var(sampleMeans)
The_Var = 1/lambda

#Compare distribution of sample means vs that of original random exponentials
hist(rexp(1000, lambda), main = 'Histogram of 1000 random exponentials',
     xlab = paste('rexp(1000, ', lambda, ')', sep = ''), col = 'grey')
abline(v = The_Mean, col = 'green', lwd = 2)

hist(sampleMeans, main = 'Histogram of 1000 sample means',
     xlab = 'Sample Means', col = 'grey')
abline(v = Avg_Means, col = 'green', lwd = 2)

The average of the sample means is 4.9742388, which is very close to the theoretical mean 5. The variance of the sample means is about 10 times smaller than the variance of the original sample (0.5706551 vs. 5).

As shown in the histrograms, the distribution of the original random exponentials is very skewed to the right. But with repeated sampling, the sampling distribution is approximately normal and tighter.

Q2

The following code calculates the means and standard deviations of tooth length per vitamin C supplement types and doses and plots line graphs to show the data.

#Check data
#?ToothGrowth
table(ToothGrowth$supp, ToothGrowth$dose)

##     
##      0.5  1  2
##   OJ  10 10 10
##   VC  10 10 10

sum(is.na(ToothGrowth$len)) #no missing

## [1] 0

hist(ToothGrowth$len, main = 'Histogram to show tooth growth', xlab = 'Tooth length')
ToothGrowth$supp <- as.character(ToothGrowth$supp)

#Summary of data
library(plyr)
ag <- ddply(ToothGrowth, c('supp', 'dose'), summarize, n = length(len), mn = mean(len), sd = sd(len))
ag

##   supp dose  n    mn       sd
## 1   OJ  0.5 10 13.23 4.459709
## 2   OJ  1.0 10 22.70 3.910953
## 3   OJ  2.0 10 26.06 2.655058
## 4   VC  0.5 10  7.98 2.746634
## 5   VC  1.0 10 16.77 2.515309
## 6   VC  2.0 10 26.14 4.797731

#ag$dose <- as.character(ag$dose)
names(ag)[1] <- 'Supplement'

#Line graphs
library(ggplot2)

ggplot(data = ag, aes(x = dose, y = mn, group = Supplement, color = Supplement)) + 
  geom_point(size = 5, alpha = 1/2) +
  geom_errorbar(aes(ymin = mn - sd, ymax = mn + sd), width = 0.2, size = 1) +
  geom_line(size = 1) +
  labs(x = 'Dose', y = 'Mean tooth length and SD', 
      title = 'Tooth growth per vitamin C dose')

Graphically, vitamin C supplement only affects tooth growth at dose <2 and has no effect at dose = 2. Now, let’s perform statistical tests to support this observation. We will test the null hypotheseis, H0: supplement has no effect on tooth growth, against the alternative hypothesis, Ha: tooth growth differs by supplement. Assume tooth length is iid sampled from a normal distribution.

#Hand-calculate 95% CI, with pooled standard error
sp <- with(ag, sqrt(sum((n-1)*sd^2)/(sum(n)-6)))

CI = list(rep(numeric(2), 3)) #Create a list of 3 empty numeric vectors
for (i in 1:3){
  print(CI[[i]] <- ag$mn[i]-ag$mn[i+3] + c(-1, 1)*qt(0.975, 18)*sp*(1/9+1/9)^0.5)
}

## [1] 1.653508 8.846492
## [1] 2.333508 9.526492
## [1] -3.676492  3.516492

#2-sample t-test
TG05<-subset(ToothGrowth, dose == 0.5)
TG1<-subset(ToothGrowth, dose == 1)
TG2<-subset(ToothGrowth, dose == 2)

TG <- list(TG05, TG1, TG2)

#Assume equal variance between supp
ttest_sameVar <- 
  lapply(TG, function(x){
        conf.int <- t.test(len ~ supp, data= x, paired = F, var.equal = TRUE)$conf.int
        pvalue <- t.test(len ~ supp, data = x, paired = F, var.equal = TRUE)$p.value
        c(conf.int = round(conf.int, 6), pvalue = round(pvalue, 6))
})
data.frame(Reduce(rbind, ttest_sameVar))

## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 3 --> row.names NOT used

##   conf.int1 conf.int2   pvalue
## 1  1.770262  8.729738 0.005304
## 2  2.840692  9.019308 0.000781
## 3 -3.722999  3.562999 0.963710

#Assume different variance between supp
ttest_diffVar <- 
  lapply(TG, function(x){
    conf.int <- t.test(len ~ supp, data= x, paired = F, var.equal = FALSE)$conf.int
    pvalue <- t.test(len ~ supp, data = x, paired = F, var.equal = FALSE)$p.value
    c(round(conf.int, 6), round(pvalue, 6))
    c(conf.int = round(conf.int, 6), pvalue = round(pvalue, 6))
  })
data.frame(Reduce(rbind, ttest_diffVar))

## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 3 --> row.names NOT used

##   conf.int1 conf.int2   pvalue
## 1  1.719057  8.780943 0.006359
## 2  2.802148  9.057852 0.001038
## 3 -3.798070  3.638070 0.963852

In the previous summary table, the standard deviations of tooth length for the two supplements are quite different. There is no reason to assume that the two groups have equal variance. So another set of t-tests are run again with unequal variance assumption. But the change in assumption has almost no effect on the confidence intervals.

When 95% confidence intervals include zero, we fail to reject H0 and vice versa. Hence, we conclude that supplement does impact tooth growth at dose = 0.5 and 1 but not so at dose = 2. This is consistent with the above line graphs.

Title: Statistical Inference Course Project

Author: Tuttoasposto

Overview: Simulate data to see sampling distritbution of estimated statistics

Q1

Example

Q2