Part 1: Simulation of Exponential Distribution

Overview

The purpose of this section is to investigate the exponential distribution in R and to compare it with the Central Limit Theorem (CLT). The exponential distribution can be simulated in R with the command rexp(n, lambda) where n is the number of simulations and and lambda is the rate parameter. From the theory, it is known that the mean of an exponential random variable \(X\) is \(E(X) = 1/\lambda\) and the variance is \(\operatorname{Var}(X) = 1 / \lambda^2\). For the following simulation, we shall take \(\lambda = 0.2\) and we shall investigate the distribution of the average of 40 exponentials. In particular we shall:

  1. Show the sample mean and compare it to the theoretical mean of the distribution.
  2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.
  3. Show that the distribution is approximately normal.

Preliminary actions

Before we start, let us load the libraries that we shall need throughout.

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Simulation

First of all we need to simulate the data: the code below initialises the seed to 1, then creates a data frame with \(n = 40\) rows and \(\texttt{nosim} = 1000\) columns in which every row represents \(n = 1000\) simulations of an exponential random variable with \(\lambda = 2\).

set.seed(1)
nosim <- 1000
n <- 40
lambda <- 0.2

ExpSim <- data.frame(matrix(rexp(n * nosim, rate = lambda), nosim, n))

Sample mean against theoretical mean

Once the data frame is created, we can consider the data frame meanSim with the sample mean of each row from the data frame ExpSim. The sample mean is therefore sampleMean and the theoretical mean, \(1/\lambda\), is stored in the variable theoryMean. It follows a t-test to check whether these two means are compatible.

meanSim <- data.frame(means = apply(ExpSim, 1, mean))
sampleMean <- mean(meanSim$means)
theoryMean <- 1 / lambda
t.test(meanSim$means, mu = theoryMean)
## 
##  One Sample t-test
## 
## data:  meanSim$means
## t = -0.40134, df = 999, p-value = 0.6883
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
##  4.941254 5.038797
## sample estimates:
## mean of x 
##  4.990025

Since the p-value of the test is 0.688, the null hypothesis cannot be rejected, i.e. the two sample mean and the theoretical mean are compatible. Let us also see this comparison in a graph. The dashed line in the graph represents the theoretical mean, \(\lambda = 5\).

ggplot(data = meanSim, aes(x = means)) + 
    geom_histogram(binwidth = 0.2, colour = "black", fill = "pink") +
    geom_vline(xintercept = theoryMean, linewidth = 1, linetype = "dashed") +
    ggtitle("The distribution of 1000 averages of 40 Exponential RVs") +
    xlab("Value of the means") + ylab("Frequency") +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Sample variance against theoretical variance

A similar approach is taken for the variance: in this case we shall only store the two quantities into two variables and compare them ‘by eye’.

varSim <- data.frame(vars = apply(ExpSim, 1, var))
sampleVar <- sum(varSim$vars) / (length(varSim$vars) - 1)
theoryVar <- (1 / lambda) ^ 2
print(sampleVar)
## [1] 25.08292
print(theoryVar)
## [1] 25

At first glance, we notice how the theoretical value, 25, and the sample variance, 25.08, are indeed very close. Let us also see this in a graph. The dashed line in the graph represents the theoretical variance, \(\lambda^2 = 25\).

ggplot(data = varSim, aes(x = vars)) + 
    geom_histogram(binwidth = 3, colour = "black", fill = "lightskyblue") +
    geom_vline(xintercept = theoryVar, linewidth = 1, linetype = "dashed") +
    ggtitle("The distribution of 1000 variance of 40 Exponential RVs") +
    xlab("Variance") + ylab("Frequency") +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Distribution

As final part of the activity, we wish to show that the sample mean is normally distributed. We can plot the histogram of the sample means the \(n = 40\) groups of simulations, overlaying the normal distribution with mean 4.99 and standard deviation 0.786. In the graph, the vertical dashed line represents the mean of the distribution.

ggplot(data = data.frame(meanSim), aes(x = means)) + 
    geom_histogram(aes(y = after_stat(density)), 
                   colour = "black", fill = "lavenderblush", 
                   binwidth = 0.25) +
    stat_function(fun = dnorm, args = list(mean = mean(meanSim$means), 
                                           sd = sd(meanSim$means)),
                  colour = "purple", linewidth = 1) +
    geom_vline(xintercept = theoryMean, linewidth = 1, linetype = "dashed") +
    xlab("Means") + ylab("Density") +
    ggtitle("Comparison between Sample Mean of Data and Normal Distribution") +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))

Part 2: Data Analysis of the Data Frame “ToothGrowth”

Overview

The second part of this work analyses the data frame ToothGrowth in R providing a basic data summary and comparing the tooth growth by supplement given and dosage.

Loading data and basic data exploration

Let us start by loading the data frame and check the variables involved.

data("ToothGrowth")
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

Let us also check the head and the tail of the data frame:

head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
tail(ToothGrowth)
##     len supp dose
## 55 24.8   OJ    2
## 56 30.9   OJ    2
## 57 26.4   OJ    2
## 58 27.3   OJ    2
## 59 29.4   OJ    2
## 60 23.0   OJ    2

The summary of the data can be obtained simply taking:

summary(ToothGrowth)
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000

Investigation

In order to understand the relationship between the two different supplements and the tooth growth, let us start by analysing a boxplot of the tooth length by supplement.

ggplot(data = ToothGrowth) +
    geom_boxplot(aes(x = supp, colour = supp, y = len)) +
    xlab("Supplement") +
    ylab("Tooth Length") +
    ggtitle("Boxplot of Tooth Length by Supplement") +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))

At a glance, it looks like the OJ supplement has a slightly better effect than VC. Let us also check the sample relationship by dosage:

ggplot(data = ToothGrowth) +
    geom_boxplot(aes(x = supp, colour = supp, y = len)) +
    facet_grid(. ~ dose) +
    xlab("Supplement") +
    ylab("Tooth Length") +
    ggtitle("Boxplot of Tooth Length by Supplement and Dose") + 
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))

This second plot shows that for low and medium dosage (0.5 and 1), supplement OJ has bigger effect than VC on tooth length.

Statistical testing:

Let us prove this rigorously with the help of the t-test. To this end, we will need to filter the data set by dose This can be done effectively with the aid of the package dplyr.

lowDosage <- filter(ToothGrowth, dose == 0.5)
medDosage <- filter(ToothGrowth, dose == 1)
highDosage <- filter(ToothGrowth, dose == 2)

Let us now check the t-test for the whole data set and for the data sets filtered by dosage.

## t.test for overall group
totTest <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = ToothGrowth)

## t.test for small dosage
lowTest <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = lowDosage)

## t.test for med dosage
medTest <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = medDosage)

## t.test for high dosage
highTest <- t.test(len ~ supp, paired = FALSE, var.equal = FALSE, data = highDosage)

Let us store the p-values of the tests carried out into a variable pValues and with check whether the test allows to reject the null hypothesis \(H_0\): the two supplements have identical effect.

pValues <- c(totTest$p.value, lowTest$p.value, medTest$p.value, highTest$p.value)
round(pValues, 3)
## [1] 0.061 0.006 0.001 0.964
names(pValues) <- c("tot", "low", "med", "high")
pValues < 0.05
##   tot   low   med  high 
## FALSE  TRUE  TRUE FALSE

As the code shows, the null hypothesis,

  • For the overall study, with a p-value of 0.061, there is not enough evidence to reject the null hypothesis.

  • For low and medium dose, there is actually enough evidence to reject the null hypothesis and infer that the effect of OJ is bigger than the effect of VC in the tooth length.

  • For the high dose, there is not enough evidence to reject the null hypothesis.

Assumptions and Conclusions

  • In the analysis, we have assumed that the overall population from which the sample was taken is at least approximately normally distributed, and that the population is also normally distributed with respect under each dose. We have also assumed that the sample is taken randomly and it represents the whole population.

  • In conclusion, we have enough evidence to reject the null hypothesis for low and medium dose, and we may infer that, for low and medium dosage, the supplement OJ has a bigger effect on tooth growth. However, in general, the overall effect is not statistically significant, as shown by the result of the test on the overall sample, at least at 5% significance level.