Overview

This file included 2 parts of exploration: 1. simulation on exponential distribution 2. ToothGrowth dataset


Part1

The goal of this simulation is to know the Central Limit Theorem by practice. Exponential distribution will be used in this simulation, it can be simulated in R with rexp(n,rate = lambda) where lambda is the rate parameter in the function. In this simulation, lambda will be set as 0.2, which means both the theoretical mean and standard deviation of the distribution will equal to 5, 1/lambda.

Let’s get started! First, set the number of simulation nosim to 1000, and make each sample size n to 40, and lambda to 0.2.

nosim <- 1000
n <- 40
lambda <- 0.2

Histogram of the Random Exponential Samples

draw the distribution of random exponential samples.

hist(rexp(nosim * n, rate = lambda),col = "lightblue", breaks = 40,
     main = "Histogram of the Simulations")

Simulations

simulation data will have 40 columns and 1000 rows, each row represents a simulation. calculate each simulation’s mean to meanbyrow. draw the distribution of 1000 averages of 40 random exponential. density curve for the distribution is fitted as the black line on plot.

simulation <- matrix(rexp(nosim * n, rate = lambda),nrow = nosim)
meanbyrow <- apply(simulation,1,mean)
hist(meanbyrow, xlim = c(0,10), ylim = c(0,0.7),
     breaks = 40, col = "pink",
     prob =T, xlab = "simulation sample mean", 
     main = "Distribution of the Simulation Means")
lines(density(meanbyrow),lwd = 2)

Sample Mean versus Theoretical Mean

The sample mean is 5.018, which nearly equals to the theoretical mean 5 of the distribution.

# sample mean
simulationmean <- mean(meanbyrow)
# theoretical mean 1/lambda
theoreticalmean <- 1/lambda

hist(meanbyrow, xlim = c(0,10), ylim = c(0,0.7),breaks = 40,
     col = "pink",
     prob =T, xlab = "simulation sample mean", 
     main = "Distribution of the Simulation Means")

abline(v = simulationmean, col = "black", lwd = 2)
abline(v = theoreticalmean, col = "red", lwd = 2,lty = 2)

legend(6,0.6,legend = c("simulation","theoretical"),
       col = c("black","red"),
       cex = 0.8,lwd = 2,lty = 1:2)

Sample Variance versus Theoretical Variance

variance of the sample is also nearly equal to the theoretical variance of the distribution.

# sample variance
(sd(simulation))^2
## [1] 24.94565
# theoretical variance (1/lambda)^2
(1/lambda)^2
## [1] 25

Central Limit Theorem

By CLT, we knows that in many situations, the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

the black line is fitted density curve for the simulation. the pink line represented for the normal distribution X ~ N(5,1).

by comparing 2 lines, we can see that the distribution is approximately normal.

# Distribution of 1000 simulations
hist(meanbyrow, breaks = 50, xlim = c(0,10), ylim = c(0,0.6),prob =T,
     main = "Distribution of 1000 simulations")

# fitted density curve
lines(density(meanbyrow),lwd = 2)

# normal distribution
x <- seq(2,8,length=100)
curve(dnorm(x,5,1), lwd = 3, col = 2, add = T)

Part 2

In part2, we’re going to analyze the ToothGrowth data in the R datasets package.

ToothGrowth data exploration and summary

Load in ToothGrowth data. By str(), we can see 60 observations and 3 variable in the dataset. suppvariable contains OJ and VC, 2 kinds of supplement. dosevarible contains 0.5, 1, and 2, 3 kinds of dosage.

data("ToothGrowth")
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Compare Tooth Growth by Supplement type

Independent t test is used for compare the means, alpha is set to 0.05.

# subset the data
OJ <- subset(ToothGrowth, ToothGrowth$supp == "OJ")
VC <- subset(ToothGrowth, ToothGrowth$supp == "VC")


t.test(OJ$len, VC$len, paired = F,alternative = "two.sided", conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  OJ$len and VC$len
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1710156  7.5710156
## sample estimates:
## mean of x mean of y 
##  20.66333  16.96333
# there is no true difference on length between two supp type

a visualization for how 2 groups look like is as the following plot, light blue represents for OJ, and light pink represents for VC.

# setting the color
c1 <- rgb(173,216,230,max = 255, alpha = 80, names = "lt.blue")
c2 <- rgb(255,192,203, max = 255, alpha = 80, names = "lt.pink")

# histogram
hist(OJ$len, prob =T, xlim = c(0,40), ylim = c(0,0.1), 
     col = c1, breaks = 10, main = "Histogram of Supplement type",
     xlab = "Tooth Length")
hist(VC$len, prob =T, add = T, col = c2, breaks = 15)
lines(density(OJ$len),lwd = 2, col = "lightblue")
lines(density(VC$len), lwd = 2, col = "lightpink")

legend(30,0.08,legend = c("OJ","VC"),
       col = c("lightblue","lightpink"),
       cex = 0.8,lwd = 2)

Compare Tooth Growth by Dosage

# subsetting dosage
dose2 <- subset(ToothGrowth,dose == 2)
dose1 <- subset(ToothGrowth,dose == 1)
dose0.5 <- subset(ToothGrowth,dose == 0.5)
# 2 v.s. 1
t1 <- t.test(dose2$len, dose1$len, alternative = "two.sided")
t1
## 
##  Welch Two Sample t-test
## 
## data:  dose2$len and dose1$len
## t = 4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.733519 8.996481
## sample estimates:
## mean of x mean of y 
##    26.100    19.735
# 2 v.s. 0.5
t2 <- t.test(dose2$len, dose0.5$len, alternative = "two.sided")
t2
## 
##  Welch Two Sample t-test
## 
## data:  dose2$len and dose0.5$len
## t = 11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  12.83383 18.15617
## sample estimates:
## mean of x mean of y 
##    26.100    10.605
# 1 v.s. 0.5
t3 <- t.test(dose1$len, dose0.5$len, alternative = "two.sided")
t3
## 
##  Welch Two Sample t-test
## 
## data:  dose1$len and dose0.5$len
## t = 6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   6.276219 11.983781
## sample estimates:
## mean of x mean of y 
##    19.735    10.605

This is a plot for how 3 groups look like, light blue for 2, light pink for 1, and yellow for 0.5.

# setting the color
c1 <- rgb(173,216,230,max = 255, alpha = 80, names = "lt.blue")
c2 <- rgb(255,192,203, max = 255, alpha = 80, names = "lt.pink")
c3 <- rgb(255,255,0, max = 255, alpha = 80, names = "lt.yellow")

# histogram
hist(dose2$len, prob =T,col = c1, breaks = 10, 
     xlim = c(0,40),ylim = c(0,0.15),
     main = "Histogram of Dosage",
     xlab = "Tooth Length")
hist(dose1$len, prob =T, add = T, col = c2, breaks = 5)
hist(dose0.5$len, prob =T, add = T, col = c3, breaks = 10)

# density lines
lines(density(dose2$len),lwd = 2, col = "lightblue")
lines(density(dose1$len), lwd = 2, col = "lightpink")
lines(density(dose0.5$len), lwd = 2, col = "yellow")

legend(30,0.15,legend = c("2","1","0.5"),
       col = c("lightblue","lightpink","yellow"),
       cex = 0.8,lwd = 2)

Conclusions

The Hypothesis for comparing tooth length by supplement type is: H0: No tooth length differences between OJ and VC. MUoj == MUvc Ha: There are difference between OJ and VC MUoj ne MUvc

By the results of the t test, p-value = 0.06 > alpha, that we failed to reject the null hypothesis, which means no significant difference on tooth length between 2 supplement type.

For the comparison between dosage, after taking t test, all p-values are < 0.05, which means there are significant difference between dosage groups.

## # A tibble: 3 × 4
##   group    conf.low conf.high  p.value
##   <chr>       <dbl>     <dbl>    <dbl>
## 1 2 vs 1       3.73      9.00 1.91e- 5
## 2 2 vs 0.5    12.8      18.2  4.40e-14
## 3 1 vs 0.5     6.28     12.0  1.27e- 7