In Part 1, I investigate the exponential distribution on its mean, variable and distribution. In Part 2, I do some basic inferential data analysis on dataset “ToothGrowth” by appplying t-test.
1st, I assign values for the parameters in simulation
lambda <- 0.2
n <- 40
nosim <- 1000
2nd, I create dataset with 1000 values, each value is the mean of 40 exponentials
sampleExp <- matrix(rexp(n * nosim, rate = lambda), nosim)
sampleData <- rowMeans(sampleExp)
3rd, I calculate sample mean and theoretical mean
sample_mean <- mean(sampleData)
sample_mean
## [1] 5.001388
theoretical_mean <- 1 / lambda
theoretical_mean
## [1] 5
It shows that sample mean and theoretical mean are very close to each other. ##Sample Variance versus Theoretical Variance 4th, I calculate sample variance and theoretical variance
sample_variance <- var(sampleData)
sample_variance
## [1] 0.5861573
theoretical_variance <- (1 / lambda)^2 / (n)
theoretical_variance
## [1] 0.625
It shows that sample variance and theoretical variance are close to each other too. 5th, I calculate standard deviation of sample and theoretical one
sample_sd <- sd(sampleData)
sample_sd
## [1] 0.7656091
theoretical_sd <- 1 / (lambda * sqrt(n))
theoretical_sd
## [1] 0.7905694
It shows that the 2 values are still close to each other ##Distribution 6th, I plot the distribution of mean of exponentials to show that it is approximately normal, also on the plot, I draw lines of sample variance and theoretical variance to illustrate the difference, I draw curves of normal distribution of sample and theoretical one to illustrate the difference of variance.
library(data.table)
library(ggplot2)
Data <- data.frame(sampleData)
distribution <- ggplot(Data, aes(x = sampleData))
distribution <- distribution + geom_histogram(aes(y = ..density..), col = "black", fill = "pink")
distribution <- distribution + geom_vline(xintercept = c(sample_mean, theoretical_mean), col = c("red", "blue"), size = 1.5)
distribution <- distribution + stat_function(fun = dnorm, args = list(mean = sample_mean, sd = sample_sd), col = "red", size = 1.5)
distribution <- distribution + stat_function(fun = dnorm, args = list(mean = theoretical_mean, sd = theoretical_sd), col = "blue", size = 1.5)
distribution <- distribution + xlab("sample mean") + ylab("density") + ggtitle("Distribution of sample mean")
distribution
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
On the plot, the pink histogram shows the distribution of sample mean, the red vertical line represents sample mean and the blue one represents theoretical mean, the red curve and the blue one correspond to sample’s and theoretical’s respectively. It shows that variance of the sample one is a little wider than the theoretical one when sample mean approaches to the mean of sample means. And it shows that the distribution is approximately normal in its shape.
1st, I load the ToothGrowth data and perform some basic exploratory data analyses
library(datasets)
data(ToothGrowth)
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
unique(ToothGrowth$supp)
## [1] VC OJ
## Levels: OJ VC
unique(ToothGrowth$dose)
## [1] 0.5 1.0 2.0
It shows that supp has 2 types: “VC” and “OJ”, dose has 3 values: “0.5”, “1.0” and “2.0”. ##2. Provide a basic summary of the data 2nd, I draw a histogram of len on dose with 2 types separately as a basic summary of the data
basic_summary <- ggplot(aes(x = as.factor(dose), y = len, fill = supp), data = ToothGrowth)
basic_summary <- basic_summary + geom_bar(stat = "identity")
basic_summary <- basic_summary + facet_grid(.~ supp)
basic_summary <- basic_summary + xlab("dose") + ylab("tooth length") + ggtitle("Tooth length on Dose amount and Supplement type")
basic_summary
3rd, I apply t-test on len with dose and supp to test whether the hypothesis that the effect of supplement is equal to be true
##Apply t-test of len on supp, the whole dataset
t.test(len ~ supp, data = ToothGrowth)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
##Apply t-test of len on dose, sub-dataset when dose is 0.5 or 1.0
t.test(len~dose, data = subset(ToothGrowth, ToothGrowth$dose %in% c(0.5, 1.0)))
##
## Welch Two Sample t-test
##
## data: len by dose
## t = -6.4766, df = 37.986, p-value = 1.268e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.983781 -6.276219
## sample estimates:
## mean in group 0.5 mean in group 1
## 10.605 19.735
##Apply t-test of len on dose, sub-dataset when dose is 1.0 or 2.0
t.test(len~dose, data = subset(ToothGrowth, ToothGrowth$dose %in% c(1.0, 2.0)))
##
## Welch Two Sample t-test
##
## data: len by dose
## t = -4.9005, df = 37.101, p-value = 1.906e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -8.996481 -3.733519
## sample estimates:
## mean in group 1 mean in group 2
## 19.735 26.100
It shows that p = 0.06 when applying t-test on the whole dataset, it is not significant, so I can not deny the hypothesis, so the effect of different supplement has no difference. Then I apply t-test on sub-dataset when dose = (0.5, 1.0) or dose = (1.0, 2.0) to find out whether increase of dose has effect on tooth length. it shows that both p is approaching 0, this to say, p << .05, so I can deny the hypothesis, and say the increase of dose has a positive effect on tooth growth. ##4. State your conclusions and the assumptions needed for your conclusions 1. Type of supplement has no effect on tooth growth. 2. Increase of dosage has good effect on tooth growth.