The purpose of this section is to investigate the exponential
distribution in R and to compare it with the Central Limit Theorem
(CLT). The exponential distribution can be simulated in R with the
command rexp(n, lambda) where n is the number
of simulations and and lambda is the rate parameter. From
the theory, it is known that the mean of an exponential random variable
\(X\) is \(E(X) = 1/\lambda\) and the variance is
\(\operatorname{Var}(X) = 1 /
\lambda^2\). For the following simulation, we shall take \(\lambda = 0.2\) and we shall investigate
the distribution of the average of 40 exponentials. In particular we
shall:
Before we start, let us load the libraries that we shall need throughout.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
First of all we need to simulate the data: the code below initialises
the seed to 1, then creates a data frame with \(n = 40\) rows and \(\texttt{nosim} = 1000\) columns in which
every row represents \(n = 1000\)
simulations of an exponential random variable with \(\lambda = 2\).
set.seed(1)
nosim <- 1000
n <- 40
lambda <- 0.2
ExpSim <- data.frame(matrix(rexp(n * nosim, rate = lambda), nosim, n))
Once the data frame is created, we can consider the data frame
meanSim with the sample mean of each row from the data
frame ExpSim. The sample mean is therefore
sampleMean and the theoretical mean, \(1/\lambda\), is stored in the variable
theoryMean. It follows a t-test to check whether these two
means are compatible.
meanSim <- data.frame(means = apply(ExpSim, 1, mean))
sampleMean <- mean(meanSim$means)
theoryMean <- 1 / lambda
t.test(meanSim$means, mu = theoryMean)
##
## One Sample t-test
##
## data: meanSim$means
## t = -0.40134, df = 999, p-value = 0.6883
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
## 4.941254 5.038797
## sample estimates:
## mean of x
## 4.990025
Since the p-value of the test is 0.688, the null hypothesis cannot be rejected, i.e. the two sample mean and the theoretical mean are compatible. Let us also see this comparison in a graph. The dashed line in the graph represents the theoretical mean, \(\lambda = 5\).
ggplot(data = meanSim, aes(x = means)) +
geom_histogram(binwidth = 0.2, colour = "black", fill = "pink") +
geom_vline(xintercept = theoryMean, linewidth = 1, linetype = "dashed") +
ggtitle("The distribution of 1000 averages of 40 Exponential RVs") +
xlab("Value of the means") + ylab("Frequency") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
A similar approach is taken for the variance: in this case we shall only store the two quantities into two variables and compare them ‘by eye’.
varSim <- data.frame(vars = apply(ExpSim, 1, var))
sampleVar <- sum(varSim$vars) / (length(varSim$vars) - 1)
theoryVar <- (1 / lambda) ^ 2
print(sampleVar)
## [1] 25.08292
print(theoryVar)
## [1] 25
At first glance, we notice how the theoretical value, 25, and the sample variance, 25.08, are indeed very close. Let us also see this in a graph. The dashed line in the graph represents the theoretical variance, \(\lambda^2 = 25\).
ggplot(data = varSim, aes(x = vars)) +
geom_histogram(binwidth = 3, colour = "black", fill = "lightskyblue") +
geom_vline(xintercept = theoryVar, linewidth = 1, linetype = "dashed") +
ggtitle("The distribution of 1000 variance of 40 Exponential RVs") +
xlab("Variance") + ylab("Frequency") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
As final part of the activity, we wish to show that the sample mean is normally distributed. We can plot the histogram of the sample means the \(n = 40\) groups of simulations, overlaying the normal distribution with mean 4.99 and standard deviation 0.786. In the graph, the vertical dashed line represents the mean of the distribution.
ggplot(data = data.frame(meanSim), aes(x = means)) +
geom_histogram(aes(y = after_stat(density)),
colour = "black", fill = "lavenderblush",
binwidth = 0.25) +
stat_function(fun = dnorm, args = list(mean = mean(meanSim$means),
sd = sd(meanSim$means)),
colour = "purple", linewidth = 1) +
geom_vline(xintercept = theoryMean, linewidth = 1, linetype = "dashed") +
xlab("Means") + ylab("Density") +
ggtitle("Comparison between Sample Mean of Data and Normal Distribution") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
The second part of this work analyses the data frame
ToothGrowth in R providing a basic data summary and
comparing the tooth growth by supplement given and dosage.
Let us start by loading the data frame and check the variables involved.
data("ToothGrowth")
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
Let us also check the head and the tail of the data frame:
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
tail(ToothGrowth)
## len supp dose
## 55 24.8 OJ 2
## 56 30.9 OJ 2
## 57 26.4 OJ 2
## 58 27.3 OJ 2
## 59 29.4 OJ 2
## 60 23.0 OJ 2
The summary of the data can be obtained simply taking:
summary(ToothGrowth)
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
In order to understand the relationship between the two different supplements and the tooth growth, let us start by analysing a boxplot of the tooth length by supplement.
ggplot(data = ToothGrowth) +
geom_boxplot(aes(x = supp, colour = supp, y = len)) +
xlab("Supplement") +
ylab("Tooth Length") +
ggtitle("Boxplot of Tooth Length by Supplement") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
At a glance, it looks like the OJ supplement has a slightly better effect than VC. Let us also check the sample relationship by dosage:
ggplot(data = ToothGrowth) +
geom_boxplot(aes(x = supp, colour = supp, y = len)) +
facet_grid(. ~ dose) +
xlab("Supplement") +
ylab("Tooth Length") +
ggtitle("Boxplot of Tooth Length by Supplement and Dose") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
This second plot shows that for low and medium dosage (0.5 and 1), supplement OJ has bigger effect than VC on tooth length.
Let us prove this rigorously with the help of the t-test. To this
end, we will need to filter the data set by dose This can be done
effectively with the aid of the package dplyr.
lowDosage <- filter(ToothGrowth, dose == 0.5)
medDosage <- filter(ToothGrowth, dose == 1)
highDosage <- filter(ToothGrowth, dose == 2)
Let us now check the t-test for the whole data set and for the data sets filtered by dosage.
## t.test for overall group
totTest <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = ToothGrowth)
## t.test for small dosage
lowTest <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = lowDosage)
## t.test for med dosage
medTest <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = medDosage)
## t.test for high dosage
highTest <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = highDosage)
Let us store the p-values of the tests carried out into a variable
pValues and with check whether the test allows to reject
the null hypothesis \(H_0\): the
two supplements have identical effect.
pValues <- c(totTest$p.value, lowTest$p.value, medTest$p.value, highTest$p.value)
round(pValues, 3)
## [1] 0.061 0.006 0.001 0.964
names(pValues) <- c("tot", "low", "med", "high")
pValues < 0.05
## tot low med high
## FALSE TRUE TRUE FALSE
As the code shows, the null hypothesis,
For the overall study, with a p-value of 0.061, there is not enough evidence to reject the null hypothesis.
For low and medium dose, there is actually enough evidence to reject the null hypothesis and infer that the effect of OJ is bigger than the effect of VC in the tooth length.
For the high dose, there is not enough evidence to reject the null hypothesis.
In the analysis, we have assumed that the overall population from which the sample was taken is at least approximately normally distributed, and that the population is also normally distributed with respect under each dose. We have also assumed that the sample is taken randomly and it represents the whole population.
In conclusion, we have enough evidence to reject the null hypothesis for low and medium dose, and we may infer that, for low and medium dosage, the supplement OJ has a bigger effect on tooth growth. However, in general, the overall effect is not statistically significant, as shown by the result of the test on the overall sample, at least at 5% significance level.