In this report we do 2 things. Firstly we do a simulation exercise where we explore the exponential distribution and find how the sample mean is cenitered around the true mean and the sample variance is close to the real variance.
In the second part we look at the R dataset TeethGrowth which compares the growth of the teeth of 60 guinea pigs who were given different doses of different Vitamin C supplements.
In this forst part we’re going to work with the exponential distribution f(x, lambda) where f(x) = lambda * exp(-lambda*x) when x >= 0 and f(x) = 0 otherwise. We’ll fix lambda at the level 0.2. Lets look at the densitiy and distribution functions. Note that the mean in an exponentially distributed variable is given by 1/lambda, 5 in our case of lambda = 0.2. We’ll denote this by a blue line in the plot.
lambda <- 0.2
par(mfrow = c(1,2))
curve(exp(-lambda*x), from=0, to = 25, main = "density function")
abline(v=c(1/lambda), col = "blue")
curve(1-exp(-lambda*x), from=0, to = 25, main = "cumulative distribution function")
abline(v=c(1/lambda), col = "blue")
Now to the fun part. Lets calculate 1000 medians. Each will be calculated from 40 simulated exponentially distributed random variables. Since we’ll want this to be reproducible, we’ll set the seed before creating random numbers.
set.seed(42)
means <- NULL
for (i in 1:1000) means <- c(means, mean(rexp(40,lambda)))
mean(means)
## [1] 4.986508
As we can see the mean of those 1000 means is quite near to 5.
Lets plot a histogram to see how the means are distributed. As we know the true mean is 1/lambda = 5, we’ll mark this in blue
hist(means, breaks = 25)
abline(v=5, col="blue")
We can cleary see that the sample means are centered around 5.
For the exponential distribution the standard deviation is, same as the mean 1/lambda, thus the variance is 1/lambda^2, in our case that means the true variance be 1/0.04 which equals 25. Baiscally we’re going to do the same as before. We’ll create 1000 times 40 random variables which are distributed exponentially with lambda = 0.2. Then we’re goint to take the median of this sample and expect to see a value close to 25. Note that we set the seed to the same value as before, such that the sample will be the same as before.
set.seed(42)
vars <- NULL
for (i in 1:1000) vars <- c(vars, var(rexp(40,lambda)))
mean(vars)
## [1] 25.16591
Great, we can see that the difference to the real variance is very small. But since pictures say more then 1000 words, let’s look at a visualisation. We’ll mark the real variance of 25 in red.
hist(vars, breaks = 25)
abline(v=25, col="red")
Recall that histogramm from before? We’ll plot it again, this time we set the paramter prob to TRUE so that we can overlay the histogramm with the density function in green,
hist(means, breaks = 25, prob = TRUE, main = "Histogram and density of means")
lines(density(means), col = "green")
See, how the green curve resembles the curve of a normal distribution.
In this second part we’ll look at the ToothGrowth data from r. First we load the data and let r give us a short summary
data("ToothGrowth")
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
The description given in the R Documentation to this dataset is:
The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).
First let’s look at the whole dataset
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
we see that half of the guinea pigs recieved the orange juice and the other half recieved the ascorbic acid. We also can deduce that each dose (0.5 mg/day, 1mg/day and 1.5 mg/day) was given to 20 guinea pigs. Lets split the dataset by this factor and look at the summaries.
ToothGrowth %>% group_by(supp) %>% summarise(min(len), median(len), mean(len), max(len))
## # A tibble: 2 x 5
## supp `min(len)` `median(len)` `mean(len)` `max(len)`
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 OJ 8.2 22.7 20.7 30.9
## 2 VC 4.2 16.5 17.0 33.9
The variable dose is denoted as numeric, but we can also view it as a factor since there are just 3 levels each with 20 observations. So compute the same code as before, but this time we group the data by dose instead of supp.
ToothGrowth %>% group_by(dose) %>% summarise(min(len), median(len), mean(len), max(len))
## # A tibble: 3 x 5
## dose `min(len)` `median(len)` `mean(len)` `max(len)`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.5 4.2 9.85 10.6 21.5
## 2 1 13.6 19.2 19.7 27.3
## 3 2 18.5 26.0 26.1 33.9
From the summary above we could think that the mean tooth Grwoth for guinea pigs who recieved orange juice is bigger than the growth of guinea pugs who recieved ascorbic acid. Let’s look at a plot.
color <- c("orange", "blue")[ToothGrowth$supp]
plot(ToothGrowth$len, col = color, xlab = "", ylab="length")
legend("bottomright", legend = c("Orange Juice, OJ", "Ascorbic Acid, VC"), col = c("orange", "blue"), pch = 1)
Yeah. After seeing this plot we’re not so sure anymore, so lets use a two sided test to decide if the means are different at all. We define our hypothesis
H_0 : mu_OJ = mu_VC
H_1 : mu_OJ != mu_VC
We have 30 observations each. We could use a z-test, but let’s stick with the t-test. Firs we’re going to get the values from the table and then we’re going to compute a one-sided t-test with alpha = 0.05.
OJ <- filter(ToothGrowth, supp == "OJ") %>% select(len)
VC <- filter(ToothGrowth, supp == "VC") %>% select(len)
t.test(OJ, VC, alternative = "two.sided", paired = FALSE, conf.level = 0.95)
##
## Welch Two Sample t-test
##
## data: OJ and VC
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean of x mean of y
## 20.66333 16.96333
Since 0 is in the interval we can’t reject our Null-hypothesis. We could have seen this also by the p-value which is around 0.6, and thus greater than our specified alpha = 0.05.
From the summary we could deduce that the bigger the dose, the greater the growth. But as we have seen in the example above those numbers can be decieving. So lets look at a plot first.
color <- c("red", "green", "blue")[as.factor(ToothGrowth$dose)]
plot(ToothGrowth$len, col = color, xlab = "", ylab="length")
legend("bottomright", legend = c("0.5 mg/day", "1 mg/day", "2 mg/day"),pch=1, col = c("red", "green", "blue"))
That looks straightforward. Since we have 3 samples, we’ll test 2 hypothsises. Firstly we’re going to compare the growth of the pigs who got the 1mg/day to those who got the 0.5 mg per day. Then the 1 mg/day to the 2 mg per day. If we can accept both these hypothsis, then we can automatically deduce, that the growth with 2mg/day is bigger than with 0.5 mg/day (with security of 95%). We’ll be conservative and use the bonferroni correction und thus select alpha = 0.025 for both tests.
H_0 : mu_1 <= mu_0.5
H_1 : mu_1 > mu_0.5
point5 <- filter(ToothGrowth, dose == 0.5) %>% select(len)
one <- filter(ToothGrowth, dose == 1) %>% select(len)
t.test(one, point5, alternative = "greater" , paired = FALSE, conf.level = 0.975)
##
## Welch Two Sample t-test
##
## data: one and point5
## t = 6.4766, df = 37.986, p-value = 6.342e-08
## alternative hypothesis: true difference in means is greater than 0
## 97.5 percent confidence interval:
## 6.276219 Inf
## sample estimates:
## mean of x mean of y
## 19.735 10.605
As we thought, straightforward, With a p-value this small (6.3e-8) and 0 not in the confidence interval for the difference we can positively reject the Nullhypothesis and say that the growth with 1 mg/day is greater than with 0.5 mg per day.
H_0 : mu_2 <= mu_1
H_1 : mu_2 > mu_1
two <- filter(ToothGrowth, dose == 2) %>% select(len)
t.test(two, one, alternative = "greater" , paired = FALSE, conf.level = 0.975)
##
## Welch Two Sample t-test
##
## data: two and one
## t = 4.9005, df = 37.101, p-value = 9.532e-06
## alternative hypothesis: true difference in means is greater than 0
## 97.5 percent confidence interval:
## 3.733519 Inf
## sample estimates:
## mean of x mean of y
## 26.100 19.735
Again we have a positively small p-Value and 0 not in the confidence interval. So again, we reject the Null hypothesis and say, that with 2 mg/day the guineapig’s teeth grow more than with 1 mg per day.
In the above tests we saw, that the guinea pig’s teeth grew more depending on how much of the Vitamin C they got. It seems that the more Vitamin the more growth. What we don’t know, and can’t deduce from the given data, is where the limit of this correlation is. Since the teeth won’t get infinetly long, even when the pigs get loads of vitamin. We also saw that there is not significant difference in wether the vitamin was given in the form of orange juice or ascorbic acid.
Since we conducted both tests on the same sample and both tests had alpha = 0.05, our conclusions include a type I - error probability of 10%.