This file included 2 parts of exploration: 1. simulation on exponential distribution 2. ToothGrowth dataset
The goal of this simulation is to know the Central Limit Theorem by
practice. Exponential distribution will be used in this simulation, it
can be simulated in R with rexp(n,rate = lambda) where
lambda is the rate parameter in the function. In this simulation, lambda
will be set as 0.2, which means both the theoretical mean and standard
deviation of the distribution will equal to 5, 1/lambda.
Let’s get started! First, set the number of simulation
nosim to 1000, and make each sample size n to
40, and lambda to 0.2.
nosim <- 1000
n <- 40
lambda <- 0.2
draw the distribution of random exponential samples.
hist(rexp(nosim * n, rate = lambda),col = "lightblue", breaks = 40,
main = "Histogram of the Simulations")
simulation data will have 40 columns and 1000 rows, each
row represents a simulation. calculate each simulation’s mean to
meanbyrow. draw the distribution of 1000 averages of 40
random exponential. density curve for the distribution is fitted as the
black line on plot.
simulation <- matrix(rexp(nosim * n, rate = lambda),nrow = nosim)
meanbyrow <- apply(simulation,1,mean)
hist(meanbyrow, xlim = c(0,10), ylim = c(0,0.7),
breaks = 40, col = "pink",
prob =T, xlab = "simulation sample mean",
main = "Distribution of the Simulation Means")
lines(density(meanbyrow),lwd = 2)
The sample mean is 5.018, which nearly equals to the
theoretical mean 5 of the distribution.
# sample mean
simulationmean <- mean(meanbyrow)
# theoretical mean 1/lambda
theoreticalmean <- 1/lambda
hist(meanbyrow, xlim = c(0,10), ylim = c(0,0.7),breaks = 40,
col = "pink",
prob =T, xlab = "simulation sample mean",
main = "Distribution of the Simulation Means")
abline(v = simulationmean, col = "black", lwd = 2)
abline(v = theoreticalmean, col = "red", lwd = 2,lty = 2)
legend(6,0.6,legend = c("simulation","theoretical"),
col = c("black","red"),
cex = 0.8,lwd = 2,lty = 1:2)
variance of the sample is also nearly equal to the theoretical variance of the distribution.
# sample variance
(sd(simulation))^2
## [1] 24.94565
# theoretical variance (1/lambda)^2
(1/lambda)^2
## [1] 25
By CLT, we knows that in many situations, the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.
the black line is fitted density curve for the simulation. the pink
line represented for the normal distribution
X ~ N(5,1).
by comparing 2 lines, we can see that the distribution is approximately normal.
# Distribution of 1000 simulations
hist(meanbyrow, breaks = 50, xlim = c(0,10), ylim = c(0,0.6),prob =T,
main = "Distribution of 1000 simulations")
# fitted density curve
lines(density(meanbyrow),lwd = 2)
# normal distribution
x <- seq(2,8,length=100)
curve(dnorm(x,5,1), lwd = 3, col = 2, add = T)
In part2, we’re going to analyze the ToothGrowth data in the R datasets package.
Load in ToothGrowth data. By str(), we can see 60
observations and 3 variable in the dataset. suppvariable
contains OJ and VC, 2 kinds of supplement.
dosevarible contains 0.5, 1, and 2, 3 kinds of dosage.
data("ToothGrowth")
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
Independent t test is used for compare the means, alpha is set to 0.05.
# subset the data
OJ <- subset(ToothGrowth, ToothGrowth$supp == "OJ")
VC <- subset(ToothGrowth, ToothGrowth$supp == "VC")
t.test(OJ$len, VC$len, paired = F,alternative = "two.sided", conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: OJ$len and VC$len
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean of x mean of y
## 20.66333 16.96333
# there is no true difference on length between two supp type
a visualization for how 2 groups look like is as the following plot, light blue represents for OJ, and light pink represents for VC.
# setting the color
c1 <- rgb(173,216,230,max = 255, alpha = 80, names = "lt.blue")
c2 <- rgb(255,192,203, max = 255, alpha = 80, names = "lt.pink")
# histogram
hist(OJ$len, prob =T, xlim = c(0,40), ylim = c(0,0.1),
col = c1, breaks = 10, main = "Histogram of Supplement type",
xlab = "Tooth Length")
hist(VC$len, prob =T, add = T, col = c2, breaks = 15)
lines(density(OJ$len),lwd = 2, col = "lightblue")
lines(density(VC$len), lwd = 2, col = "lightpink")
legend(30,0.08,legend = c("OJ","VC"),
col = c("lightblue","lightpink"),
cex = 0.8,lwd = 2)
# subsetting dosage
dose2 <- subset(ToothGrowth,dose == 2)
dose1 <- subset(ToothGrowth,dose == 1)
dose0.5 <- subset(ToothGrowth,dose == 0.5)
# 2 v.s. 1
t1 <- t.test(dose2$len, dose1$len, alternative = "two.sided")
t1
##
## Welch Two Sample t-test
##
## data: dose2$len and dose1$len
## t = 4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.733519 8.996481
## sample estimates:
## mean of x mean of y
## 26.100 19.735
# 2 v.s. 0.5
t2 <- t.test(dose2$len, dose0.5$len, alternative = "two.sided")
t2
##
## Welch Two Sample t-test
##
## data: dose2$len and dose0.5$len
## t = 11.799, df = 36.883, p-value = 4.398e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 12.83383 18.15617
## sample estimates:
## mean of x mean of y
## 26.100 10.605
# 1 v.s. 0.5
t3 <- t.test(dose1$len, dose0.5$len, alternative = "two.sided")
t3
##
## Welch Two Sample t-test
##
## data: dose1$len and dose0.5$len
## t = 6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 6.276219 11.983781
## sample estimates:
## mean of x mean of y
## 19.735 10.605
This is a plot for how 3 groups look like, light blue for 2, light pink for 1, and yellow for 0.5.
# setting the color
c1 <- rgb(173,216,230,max = 255, alpha = 80, names = "lt.blue")
c2 <- rgb(255,192,203, max = 255, alpha = 80, names = "lt.pink")
c3 <- rgb(255,255,0, max = 255, alpha = 80, names = "lt.yellow")
# histogram
hist(dose2$len, prob =T,col = c1, breaks = 10,
xlim = c(0,40),ylim = c(0,0.15),
main = "Histogram of Dosage",
xlab = "Tooth Length")
hist(dose1$len, prob =T, add = T, col = c2, breaks = 5)
hist(dose0.5$len, prob =T, add = T, col = c3, breaks = 10)
# density lines
lines(density(dose2$len),lwd = 2, col = "lightblue")
lines(density(dose1$len), lwd = 2, col = "lightpink")
lines(density(dose0.5$len), lwd = 2, col = "yellow")
legend(30,0.15,legend = c("2","1","0.5"),
col = c("lightblue","lightpink","yellow"),
cex = 0.8,lwd = 2)
The Hypothesis for comparing tooth length by supplement type is: H0: No tooth length differences between OJ and VC. MUoj == MUvc Ha: There are difference between OJ and VC MUoj ne MUvc
By the results of the t test, p-value = 0.06 > alpha, that we failed to reject the null hypothesis, which means no significant difference on tooth length between 2 supplement type.
For the comparison between dosage, after taking t test, all p-values are < 0.05, which means there are significant difference between dosage groups.
## # A tibble: 3 × 4
## group conf.low conf.high p.value
## <chr> <dbl> <dbl> <dbl>
## 1 2 vs 1 3.73 9.00 1.91e- 5
## 2 2 vs 0.5 12.8 18.2 4.40e-14
## 3 1 vs 0.5 6.28 12.0 1.27e- 7