Simulation and inference exercise

In our statistical inference class, I want to use simulation to investigate the exponential distribution in R and use inference to analyze the ToothGrowth data in the R datasets package.

Simulation

First I set up an empty vector of 1000 NAs to store sample means, then use for loop to take 1000 samples of 40 exponentials and store all of them in “sample_means”.

sample_mean=rep(NA,1000)
for (i in 1:1000){
  samp=rexp(40,0.2)
  sample_mean[i]=mean(samp)
}

Next we use boxplot to demeonstrate the ditterence between simulation sample mean and exponential distribution.

par(mfrow=c(1,2))
boxplot(samp,ylim=c(0,15),main="Theoretical Boxplot")
boxplot(sample_mean,ylim=c(0,15),main="simulation sample mean")

mean(samp)

## [1] 7.176523

mean(sample_mean)

## [1] 4.979948

var(samp)

## [1] 38.2815

var(sample_mean)

## [1] 0.5869743

We can find out from the boxplot that sample mean is actually pretty close to the theoretical mean. But the exponential distribution definately have more variability compared with simulation sample mean.

par(mfrow=c(1,2))
hist(samp,main="exponential distribution",xlab="exponential")
hist(sample_mean,main="sample_mean dis",xlab="sample mean")

No matter how skewed the distribution was, if we use simulation bootstrap, we can always get a approximately normal distribution like above.

Inferential data analysis

Next i’m going to analyze the ToothGrowth data in the R datasets package.

library(ggplot2)
ToothGrowth<-ToothGrowth
ToothGrowth$dose<-as.factor(ToothGrowth$dose)
summary(ToothGrowth)

##       len        supp     dose   
##  Min.   : 4.20   OJ:30   0.5:20  
##  1st Qu.:13.07   VC:30   1  :20  
##  Median :19.25           2  :20  
##  Mean   :18.81                   
##  3rd Qu.:25.27                   
##  Max.   :33.90

ggplot(aes(dose,len,fill=supp),data=ToothGrowth)+facet_grid(.~supp)+geom_boxplot()

We can easily find out that orange juice(OJ) seems more efficient to teeth length when the dose level is low(0.5 and 1 mg). In addition, orange juice and ascorbic acid seems roughly equal effective when the dose level is 2 mg.

suppressPackageStartupMessages(library(dplyr, quietly=TRUE))
tooth<-arrange(ToothGrowth,supp)
OJ<-tooth[1:30,1]
VC<-tooth[31:60,1]
t.test(OJ,VC)

## 
##  Welch Two Sample t-test
## 
## data:  OJ and VC
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333

If we set alpha level equal to 0.05(assumption), the p value(0.06063) shows there is actually no significant difference beetween orange juice and ascorbic acid. With respect of OJ and VC, the 95% percent interval is between -0.1710156 and 7.5710156.