In this project, the exponential distribution has been investigated in R and compared with the Central Limit Theorem. The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of the exponential distribution is 1/lambda and the standard deviation is also 1/lambda. lambda has be setted as 0.2 for all of the simulations. The distribution of averages of 40 exponentials has been investigated.
In a second part, the ToothGrowth data in the R datasets package has been analyzed.
Besides we need two specific packages: ggplot2 and datasets.
Some notations have been introduced as below: - lambda = 0.2 - expo = n = 40 - simulations = 1000 - mean = standard deviation = 1/lambda = 5 as it is precised in the synopsis
lambda <- 0.2
n <- 40
simulations <- 1000
sd <- 1/lambda
set.seed(20) #needed for reproducibility
Then simulations have been run:
rexp.simulations <- rexp(simulations*n, rate=lambda)
Both means have been calculated as below. Besides, preplotting has been written for the final graph:
sample_mean <- mean(rexp.simulations)
theoretical_mean <- 1/lambda
new_data <- NULL
for (i in 1 : simulations) new_data = c(new_data, mean(rexp(n, lambda)))
preplotting <- data.frame(header=c("Sample mean", "Theoretical mean"), values=c(sample_mean, theoretical_mean))
graph <- ggplot(NULL, aes(x=new_data))
Finally, we got: - sample_mean: 4.963828 - theoretical_mean: 5
Both means are very close which is not surprising as the population is quite large.
Both variances have been calculated as below:
sample_variance <- sd(new_data)^2 #where sd is the standard deviation
theoretical_variance <- ((1/lambda)^2)/n
Finally, we got: - sample_variance: 0.5955671 - theoretical_variance: 0.625
Both variances are not exactly the same even if they are quite similar.
Regarding the ‘motivating example’ from the assignement, we can create a plot to observe the distribution but also both means:
final_graph_1 <- graph + geom_histogram(aes(y=..density..), color="green", fill="white", binwidth=0.1) + geom_density(color="blue")
final_graph_1 <- final_graph_1 + geom_vline(data=preplotting, aes(xintercept=values, linetype=header, color=header), show.legend=TRUE)
final_graph_1 <- final_graph_1 + theme(legend.title=element_blank()) + labs(title = "Sample and theoretical means comparison", x ="Data", y = "Density")
final_graph_1
We can notice that the density function looks like a normal distribution. Indeed, we could get a normal distribution by increasing either n or simulations.
Getting the qq-plot could be an other option to underline this conclusion:
qqnorm(new_data)
qqline(new_data)
As it can be noticed, the points are aligned almost perfectly with the straight line which spotlights that the distribution is approximately normal.
First we load the ToothGrowth data as below:
data(ToothGrowth)
We are interested in three parameters: - len: tooth length - supp: supplement type) - dose = dosage in mg per day
The dosage is 0.5, 1 or 2 so it is not understandable to get dosage_mean and dosage_median. Hence, we need to convert dose with as.factor as below:
ToothGrowth$dose <- as.factor(ToothGrowth$dose)
We can finally summarize the dataset:
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 0.5:20
## 1st Qu.:13.07 VC:30 1 :20
## Median :19.25 2 :20
## Mean :18.81
## 3rd Qu.:25.27
## Max. :33.90
The population is not very large so we can get box plots to get visual comparisons:
final_graph_2 <- ggplot(data = ToothGrowth, aes(dose, len)) + geom_boxplot(mapping=aes(group=dose)) + facet_wrap(~ supp, nrow=1)
final_graph_2
We will work with Welch Two Sample ttest to understand the impact of the supplement type but also the type of dosage on the tooth length:
t.test(len~supp, data=ToothGrowth)
##
## Welch Two Sample t-test
##
## data: len by supp
## t = 1.9153, df = 55.309, p-value = 0.06063
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1710156 7.5710156
## sample estimates:
## mean in group OJ mean in group VC
## 20.66333 16.96333
`
The p-Value is equal to 0.061 which is higher than 0.05, hence the null hypothesis remains. In other terms, we can not truly think that the type of supplement impacts tooth length.
I had some troubles with the dataset because dose is now a factor so I just reloaded it:
data(ToothGrowth)
t.test(ToothGrowth$len, ToothGrowth$dose)
##
## Welch Two Sample t-test
##
## data: ToothGrowth$len and ToothGrowth$dose
## t = 17.81, df = 59.798, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 15.66453 19.62881
## sample estimates:
## mean of x mean of y
## 18.813333 1.166667
The p-Value is lower than 2.2e-16 which is lower than 0.05, hence the null hypothesis can be rejected. In other terms, we can truly think that the type of dosage impacts tooth length.